Reliability First#
The ops-db-api is designed with one overriding principle: observatory operations must never fail due to network issues. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency.
The Core Requirement#
During telescope observations at 5600m altitude in Chile:
What must work - Operations:
Recording observation start/end times
Logging instrument configurations
Registering raw data files
Tracking observation status changes
What can fail:
Network connection to main database (Cologne)
Replication lag to local replica
Remote API calls for auxiliary services
The guarantee: Observatory operations never block or fail due to network issues.
Transaction Buffering: The Safety Net#
How It Works#
When operating at a secondary site (observatory):
Write request arrives at critical endpoint (e.g., create observation)
Transaction is built with pre-generated IDs
Buffer to Redis immediately (sub-millisecond)
Return success to client with pre-generated ID
Background processor executes transaction on main DB asynchronously
LSN tracking confirms when data reaches replica
Cache cleanup occurs after replication confirmed
sequenceDiagram
participant Client
participant API
participant Redis
participant BG as Background<br/>Processor
participant MainDB as Main DB<br/>(Cologne)
participant Replica as Local<br/>Replica
Client->>API: POST /executed_obs_units/start
Note over API: Build transaction<br/>Pre-generate UUID
API->>Redis: LPUSH to buffer (instant)
Redis-->>API: OK
API-->>Client: 201 Created (UUID: abc123)
Note over Client: Client continues<br/>operation immediately
loop Background Processing
BG->>Redis: RPOP from buffer
BG->>MainDB: Execute transaction
MainDB-->>BG: LSN: 0/12345678
BG->>Replica: Poll replay LSN
alt Replicated
Replica-->>BG: LSN: 0/12345678 (caught up)
BG->>Redis: Cleanup cache
else Not Yet
Replica-->>BG: LSN: 0/12345600 (behind)
BG->>Redis: Extend cache TTL
end
end
Benefits#
Immediate Response:
Client never waits for remote database
Sub-millisecond latency for buffered operations
No difference in experience during network issues
Guaranteed Execution:
Redis persistence (AOF) ensures durability
Automatic retry with exponential backoff
Failed transactions move to dead-letter queue for investigation
Operational Continuity:
Telescope observations never pause
Operators get immediate feedback
Data accumulates in buffer until network restored
Pre-Generated IDs#
A critical feature: IDs are generated before database insertion:
# Traditional approach (doesn't work for buffering)
def create_observation(data):
obs = Observation(**data)
db.add(obs)
db.commit()
return obs.id # ID only available after DB commit!
# Our approach (works with buffering)
def create_observation(data):
obs_id = uuid.uuid4() # Generate immediately
transaction = builder.create(
model_class=Observation,
data={"id": obs_id, **data}
)
buffer_transaction(transaction)
return obs_id # Available immediately!
Why this matters:
Client can reference the observation immediately
Can create related records before DB commit (e.g., files linked to observation)
Status queries work immediately (check buffer + DB)
See Transaction Builder for implementation.
LSN Tracking: Precision Over Guesswork#
The Problem with Guessing#
Without LSN tracking, we’d have to guess when data has replicated:
# Naive approach (unreliable)
def wait_for_replication(record_id):
await asyncio.sleep(5) # Hope it's replicated by now?
return check_replica(record_id) # Might not be there yet!
Problems:
Hard-coded delays waste time (too long) or miss replication (too short)
No way to know actual replication state
Can’t adjust to varying network conditions
Cache management is guesswork
LSN: The Solution#
PostgreSQL Log Sequence Numbers (LSN) provide precision:
After write to main database:
-- Get LSN after transaction commit
SELECT pg_current_wal_lsn();
-- Returns: '0/12345678'
Check replica progress:
-- Get LSN that replica has replayed
SELECT pg_last_wal_replay_lsn();
-- Returns: '0/12345600' (slightly behind)
Compare:
main_lsn = "0/12345678"
replica_lsn = "0/12345600"
if replica_lsn >= main_lsn:
print("Replica has our data!")
else:
bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn)
print(f"Replica is {bytes_behind} bytes behind")
How We Use LSN#
1. Track Transaction LSN:
When background processor executes a buffered transaction:
# Execute transaction
await executor.execute_transaction(transaction, session)
await session.commit()
# Capture LSN
result = await session.execute("SELECT pg_current_wal_lsn()")
lsn = result.scalar()
# Store with transaction
await redis.setex(
f"transaction:{tx_id}:lsn",
ttl=3600,
value=lsn
)
2. Poll Replica Status:
Periodically check if replica has caught up:
async def check_replication(transaction_lsn):
# Query replica
result = await replica_session.execute(
"SELECT pg_last_wal_replay_lsn()"
)
replica_lsn = result.scalar()
# Compare
if replica_lsn >= transaction_lsn:
return True # Replicated!
else:
return False # Still waiting
3. Smart Cache Management:
Adjust cache TTL based on replication state:
if replicated:
# Data is in replica, can cleanup buffer
await cleanup_transaction_cache(tx_id)
else:
# Still replicating, keep cache longer
await extend_cache_ttl(tx_id, additional_seconds=60)
4. Timeout Handling:
If replication takes too long:
if time_waiting > LSN_TIMEOUT:
# Replication very slow or stuck
logger.warning(
f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s"
)
# Keep cache indefinitely (or until manual cleanup)
await set_cache_ttl(tx_id, ttl=None)
See LSN Tracking for full implementation.
Automatic Retry with Exponential Backoff#
When transaction execution fails, we retry intelligently:
Retry Configuration#
# .env
TRANSACTION_RETRY_ATTEMPTS=3
TRANSACTION_RETRY_DELAY=5 # Initial delay in seconds
Retry Strategy#
async def execute_with_retry(transaction):
attempt = 0
delay = TRANSACTION_RETRY_DELAY
while attempt < TRANSACTION_RETRY_ATTEMPTS:
try:
result = await execute_transaction(transaction)
return result
except NetworkError as e:
attempt += 1
if attempt >= TRANSACTION_RETRY_ATTEMPTS:
# Move to failed queue
await move_to_failed_queue(transaction)
raise
# Exponential backoff
logger.warning(
f"Transaction failed (attempt {attempt}), "
f"retrying in {delay}s: {e}"
)
await asyncio.sleep(delay)
delay *= 2 # Exponential backoff
Why exponential backoff:
Transient network issues often resolve quickly (first retry)
Persistent issues need longer wait (later retries)
Prevents overwhelming failed service with retry storm
Failed Transaction Handling#
If all retries fail:
# Move to dead-letter queue
await redis.lpush(
f"site:{site_name}:failed_transactions",
json.dumps(transaction)
)
# Log for investigation
logger.error(
f"Transaction {tx_id} failed after {attempts} attempts: {error}"
)
# Alert monitoring system
await alert_monitoring(
severity="error",
message=f"Transaction {tx_id} failed permanently"
)
Human intervention needed:
Investigate root cause (network? database? bug?)
Manually replay failed transaction
Or mark as resolved if no longer needed
See Debugging Transaction Buffering for debugging failed transactions.
Eventual Consistency: Accept the Lag#
Why Eventual Is Good Enough#
Strong consistency would require:
def create_observation(data):
# Write to main DB
await main_db.insert(data)
# Wait for all replicas to confirm
for replica in replicas:
await replica.wait_for_replication() # Blocks!
# If any replica unreachable: FAIL
return success
This doesn’t work at observatory because network to Cologne can be down.
Eventual consistency means:
def create_observation(data):
# Buffer locally
await buffer_transaction(data) # Never fails!
# Return immediately
return success # Client continues
# Background: sync when possible
# Eventually all sites will have consistent data
The guarantee: Data will become consistent, we just can’t say exactly when.
How Long is “Eventually”?#
Best case (good network):
Buffered: < 1ms (Redis write)
Executed on main: 1-5 seconds (background processor)
Replicated to observatory: 1-10 seconds (depending on lag)
Total: Seconds
Worst case (network down):
Buffered: < 1ms (Redis write)
Executed on main: Hours to days (when network restored)
Replicated to observatory: Hours to days + replication time
Total: Until network restoration + sync time
Acceptable because:
Observatory operations don’t block (buffering works)
Scientists query data hours/days later (lag doesn’t matter)
Real-time monitoring uses WebSockets (immediate local updates)
Smart Query Manager: Consistent View#
Even during replication lag, reads are consistent:
async def get_observations(obs_unit_id):
# Query local replica
db_records = await db.query(
Observation
).filter(
Observation.obs_unit_id == obs_unit_id
)
# Query buffer
buffered_records = await get_buffered_records(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Query read buffer (updates to buffered records)
read_buffer_updates = await get_read_buffer_updates(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Merge: buffer + updates override DB
return merge_records(
db_records,
buffered_records,
read_buffer_updates
)
Result: Client always sees latest state, even if not yet in database.
See Smart Query Manager for details.
Read Buffer: Mutable Buffered Records#
The Problem#
Buffered records sometimes need updates before replication:
# Create observation (buffered)
obs_id = await create_observation({
"status": "running",
"start_time": "2025-01-01T00:00:00Z"
})
# 10 seconds later: observation finishes
# But original record still in buffer (not in DB yet)
# How to update it?
Can’t modify database (record not there yet) Can’t modify buffer (breaks transaction atomicity)
The Solution: Read Buffer#
Separate buffer for tracking updates to buffered records:
# Update buffered observation
await update_read_buffer(
model_class="Observation",
record_id=obs_id,
updates={
"status": "completed",
"end_time": "2025-01-01T00:10:00Z"
},
transaction_id=original_tx_id
)
Read buffer tracks:
Which buffered record to update
What fields changed
When update happened
Which transaction created original record
Smart query manager merges:
Database record (if replicated)
Buffered record (if still in buffer)
Read buffer updates (latest changes)
Cleanup happens when LSN confirms replication.
See Read Buffer Manager for implementation.
UI Can Tolerate Staleness#
Why UI Users Don’t Mind#
The web UI (ops-db-ui) queries the API for data visualization:
Typical queries:
Transfer overview dashboard
Observation history for past week
Source visibility calculations
Raw data package listings
Staleness tolerance:
Seconds of lag: Imperceptible to human users
Minutes of lag: Acceptable for most dashboards
Hours of lag: Only matters for real-time monitoring
Real-time needs (WebSockets):
For truly real-time data (active transfers, current observations):
WebSockets provide immediate updates
Redis pub/sub broadcasts changes
UI updates instantly regardless of replication
Example: Transfer monitoring
# REST API: May be seconds behind
GET /api/transfer/overview
# WebSocket: Immediate updates
WS /api/transfer/ws/overview
# Receives: { "transfer_id": 123, "status": "completed" }
# Instantly!
Health Monitoring#
The API exposes health information about the buffering system:
Health Endpoint#
curl http://localhost:8000/health
Response:
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"transaction_buffer": {
"size": 12,
"failed": 0,
"oldest_pending_seconds": 5.2
},
"background_processor": {
"status": "running",
"last_run": "2025-01-01T00:00:00Z",
"transactions_processed": 150,
"success_rate": 0.98
},
"replication_lag": {
"current_lag_seconds": 2.5,
"main_lsn": "0/12345678",
"replica_lsn": "0/12345600"
}
}
Buffer Statistics#
curl http://localhost:8000/buffer-stats
Response:
{
"pending_transactions": 12,
"failed_transactions": 0,
"processing_rate": 10.5,
"average_execution_time_ms": 125,
"buffer_size_mb": 2.3
}
Monitoring Alerts#
Set up monitoring for:
High buffer size:
Alert if buffer > 1000 transactions
Indicates network issue or main DB problem
Failed transactions:
Alert on any failed transaction
Requires human investigation
Replication lag:
Alert if lag > 5 minutes
Indicates replica performance issue
Background processor stopped:
Alert if processor not running
Critical - buffer won’t drain
Summary#
Reliability is achieved through:
Transaction Buffering:
Immediate buffering to Redis
Never blocks client
Automatic async execution
Pre-generated IDs for immediate reference
LSN Tracking:
Precise replication state knowledge
Smart cache management
No guesswork about consistency
Automatic Retry:
Exponential backoff
Failed transaction queue
Human oversight for permanent failures
Eventual Consistency:
Accept lag in exchange for reliability
Smart queries merge buffered + persisted
Read buffer tracks updates to buffered records
Monitoring:
Health endpoints
Buffer statistics
Alerting for anomalies
The result: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened.
Next Steps#
Transaction Buffering Overview - Technical implementation details
Debugging Transaction Buffering - Troubleshooting reliability issues