# Reliability First ```{eval-rst} .. verified:: 2025-11-12 :reviewer: Christof Buchbender ``` The ops-db-api is designed with one overriding principle: **observatory operations must never fail due to network issues**. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency. ```{contents} Table of Contents :depth: 2 :local: true ``` ## The Core Requirement During telescope observations at 5600m altitude in Chile: **What must work - Operations**: - Recording observation start/end times - Logging instrument configurations - Registering raw data files - Tracking observation status changes **What can fail**: - Network connection to main database (Cologne) - Replication lag to local replica - Remote API calls for auxiliary services **The guarantee**: Observatory operations **never** block or fail due to network issues. ## Transaction Buffering: The Safety Net ### How It Works When operating at a secondary site (observatory): 1. **Write request arrives** at critical endpoint (e.g., create observation) 2. **Transaction is built** with pre-generated IDs 3. **Buffer to Redis** immediately (sub-millisecond) 4. **Return success** to client with pre-generated ID 5. **Background processor** executes transaction on main DB asynchronously 6. **LSN tracking** confirms when data reaches replica 7. **Cache cleanup** occurs after replication confirmed ```{eval-rst} .. mermaid:: sequenceDiagram participant Client participant API participant Redis participant BG as Background
Processor participant MainDB as Main DB
(Cologne) participant Replica as Local
Replica Client->>API: POST /executed_obs_units/start Note over API: Build transaction
Pre-generate UUID API->>Redis: LPUSH to buffer (instant) Redis-->>API: OK API-->>Client: 201 Created (UUID: abc123) Note over Client: Client continues
operation immediately loop Background Processing BG->>Redis: RPOP from buffer BG->>MainDB: Execute transaction MainDB-->>BG: LSN: 0/12345678 BG->>Replica: Poll replay LSN alt Replicated Replica-->>BG: LSN: 0/12345678 (caught up) BG->>Redis: Cleanup cache else Not Yet Replica-->>BG: LSN: 0/12345600 (behind) BG->>Redis: Extend cache TTL end end ``` ### Benefits **Immediate Response**: - Client never waits for remote database - Sub-millisecond latency for buffered operations - No difference in experience during network issues **Guaranteed Execution**: - Redis persistence (AOF) ensures durability - Automatic retry with exponential backoff - Failed transactions move to dead-letter queue for investigation **Operational Continuity**: - Telescope observations never pause - Operators get immediate feedback - Data accumulates in buffer until network restored ### Pre-Generated IDs A critical feature: IDs are generated **before** database insertion: ```python # Traditional approach (doesn't work for buffering) def create_observation(data): obs = Observation(**data) db.add(obs) db.commit() return obs.id # ID only available after DB commit! # Our approach (works with buffering) def create_observation(data): obs_id = uuid.uuid4() # Generate immediately transaction = builder.create( model_class=Observation, data={"id": obs_id, **data} ) buffer_transaction(transaction) return obs_id # Available immediately! ``` **Why this matters**: - Client can reference the observation immediately - Can create related records before DB commit (e.g., files linked to observation) - Status queries work immediately (check buffer + DB) See {doc}`../deep-dive/transaction-buffering/transaction-builder` for implementation. ## LSN Tracking: Precision Over Guesswork ### The Problem with Guessing **Without LSN tracking**, we'd have to guess when data has replicated: ```python # Naive approach (unreliable) def wait_for_replication(record_id): await asyncio.sleep(5) # Hope it's replicated by now? return check_replica(record_id) # Might not be there yet! ``` **Problems**: - Hard-coded delays waste time (too long) or miss replication (too short) - No way to know actual replication state - Can't adjust to varying network conditions - Cache management is guesswork ### LSN: The Solution **PostgreSQL Log Sequence Numbers (LSN)** provide precision: **After write to main database**: ```sql -- Get LSN after transaction commit SELECT pg_current_wal_lsn(); -- Returns: '0/12345678' ``` **Check replica progress**: ```sql -- Get LSN that replica has replayed SELECT pg_last_wal_replay_lsn(); -- Returns: '0/12345600' (slightly behind) ``` **Compare**: ```python main_lsn = "0/12345678" replica_lsn = "0/12345600" if replica_lsn >= main_lsn: print("Replica has our data!") else: bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn) print(f"Replica is {bytes_behind} bytes behind") ``` ### How We Use LSN **1. Track Transaction LSN**: When background processor executes a buffered transaction: ```python # Execute transaction await executor.execute_transaction(transaction, session) await session.commit() # Capture LSN result = await session.execute("SELECT pg_current_wal_lsn()") lsn = result.scalar() # Store with transaction await redis.setex( f"transaction:{tx_id}:lsn", ttl=3600, value=lsn ) ``` **2. Poll Replica Status**: Periodically check if replica has caught up: ```python async def check_replication(transaction_lsn): # Query replica result = await replica_session.execute( "SELECT pg_last_wal_replay_lsn()" ) replica_lsn = result.scalar() # Compare if replica_lsn >= transaction_lsn: return True # Replicated! else: return False # Still waiting ``` **3. Smart Cache Management**: Adjust cache TTL based on replication state: ```python if replicated: # Data is in replica, can cleanup buffer await cleanup_transaction_cache(tx_id) else: # Still replicating, keep cache longer await extend_cache_ttl(tx_id, additional_seconds=60) ``` **4. Timeout Handling**: If replication takes too long: ```python if time_waiting > LSN_TIMEOUT: # Replication very slow or stuck logger.warning( f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s" ) # Keep cache indefinitely (or until manual cleanup) await set_cache_ttl(tx_id, ttl=None) ``` See {doc}`../deep-dive/transaction-buffering/lsn-tracking` for full implementation. ## Automatic Retry with Exponential Backoff When transaction execution fails, we retry intelligently: ### Retry Configuration ```bash # .env TRANSACTION_RETRY_ATTEMPTS=3 TRANSACTION_RETRY_DELAY=5 # Initial delay in seconds ``` ### Retry Strategy ```python async def execute_with_retry(transaction): attempt = 0 delay = TRANSACTION_RETRY_DELAY while attempt < TRANSACTION_RETRY_ATTEMPTS: try: result = await execute_transaction(transaction) return result except NetworkError as e: attempt += 1 if attempt >= TRANSACTION_RETRY_ATTEMPTS: # Move to failed queue await move_to_failed_queue(transaction) raise # Exponential backoff logger.warning( f"Transaction failed (attempt {attempt}), " f"retrying in {delay}s: {e}" ) await asyncio.sleep(delay) delay *= 2 # Exponential backoff ``` **Why exponential backoff**: - Transient network issues often resolve quickly (first retry) - Persistent issues need longer wait (later retries) - Prevents overwhelming failed service with retry storm ### Failed Transaction Handling If all retries fail: ```python # Move to dead-letter queue await redis.lpush( f"site:{site_name}:failed_transactions", json.dumps(transaction) ) # Log for investigation logger.error( f"Transaction {tx_id} failed after {attempts} attempts: {error}" ) # Alert monitoring system await alert_monitoring( severity="error", message=f"Transaction {tx_id} failed permanently" ) ``` **Human intervention needed**: - Investigate root cause (network? database? bug?) - Manually replay failed transaction - Or mark as resolved if no longer needed See {doc}`../development/debugging-buffering` for debugging failed transactions. ## Eventual Consistency: Accept the Lag ### Why Eventual Is Good Enough **Strong consistency** would require: ```python def create_observation(data): # Write to main DB await main_db.insert(data) # Wait for all replicas to confirm for replica in replicas: await replica.wait_for_replication() # Blocks! # If any replica unreachable: FAIL return success ``` **This doesn't work at observatory** because network to Cologne can be down. **Eventual consistency** means: ```python def create_observation(data): # Buffer locally await buffer_transaction(data) # Never fails! # Return immediately return success # Client continues # Background: sync when possible # Eventually all sites will have consistent data ``` **The guarantee**: Data *will* become consistent, we just can't say exactly *when*. ### How Long is "Eventually"? **Best case** (good network): - Buffered: < 1ms (Redis write) - Executed on main: 1-5 seconds (background processor) - Replicated to observatory: 1-10 seconds (depending on lag) - **Total**: Seconds **Worst case** (network down): - Buffered: < 1ms (Redis write) - Executed on main: Hours to days (when network restored) - Replicated to observatory: Hours to days + replication time - **Total**: Until network restoration + sync time **Acceptable because**: - Observatory operations don't block (buffering works) - Scientists query data hours/days later (lag doesn't matter) - Real-time monitoring uses WebSockets (immediate local updates) ### Smart Query Manager: Consistent View Even during replication lag, reads are consistent: ```python async def get_observations(obs_unit_id): # Query local replica db_records = await db.query( Observation ).filter( Observation.obs_unit_id == obs_unit_id ) # Query buffer buffered_records = await get_buffered_records( "Observation", {"obs_unit_id": obs_unit_id} ) # Query read buffer (updates to buffered records) read_buffer_updates = await get_read_buffer_updates( "Observation", {"obs_unit_id": obs_unit_id} ) # Merge: buffer + updates override DB return merge_records( db_records, buffered_records, read_buffer_updates ) ``` **Result**: Client always sees latest state, even if not yet in database. See {doc}`../deep-dive/transaction-buffering/smart-query-manager` for details. ## Read Buffer: Mutable Buffered Records ### The Problem Buffered records sometimes need updates before replication: ```python # Create observation (buffered) obs_id = await create_observation({ "status": "running", "start_time": "2025-01-01T00:00:00Z" }) # 10 seconds later: observation finishes # But original record still in buffer (not in DB yet) # How to update it? ``` **Can't modify database** (record not there yet) **Can't modify buffer** (breaks transaction atomicity) ### The Solution: Read Buffer Separate buffer for tracking updates to buffered records: ```python # Update buffered observation await update_read_buffer( model_class="Observation", record_id=obs_id, updates={ "status": "completed", "end_time": "2025-01-01T00:10:00Z" }, transaction_id=original_tx_id ) ``` **Read buffer tracks**: - Which buffered record to update - What fields changed - When update happened - Which transaction created original record **Smart query manager merges**: 1. Database record (if replicated) 2. Buffered record (if still in buffer) 3. Read buffer updates (latest changes) **Cleanup happens** when LSN confirms replication. See {doc}`../deep-dive/transaction-buffering/read-buffer-manager` for implementation. ## UI Can Tolerate Staleness ### Why UI Users Don't Mind The web UI (ops-db-ui) queries the API for data visualization: **Typical queries**: - Transfer overview dashboard - Observation history for past week - Source visibility calculations - Raw data package listings **Staleness tolerance**: - **Seconds of lag**: Imperceptible to human users - **Minutes of lag**: Acceptable for most dashboards - **Hours of lag**: Only matters for real-time monitoring **Real-time needs** (WebSockets): For truly real-time data (active transfers, current observations): - WebSockets provide immediate updates - Redis pub/sub broadcasts changes - UI updates instantly regardless of replication **Example**: Transfer monitoring ```python # REST API: May be seconds behind GET /api/transfer/overview # WebSocket: Immediate updates WS /api/transfer/ws/overview # Receives: { "transfer_id": 123, "status": "completed" } # Instantly! ``` ## Health Monitoring The API exposes health information about the buffering system: ### Health Endpoint ```bash curl http://localhost:8000/health ``` Response: ```json { "status": "healthy", "database": "connected", "redis": "connected", "transaction_buffer": { "size": 12, "failed": 0, "oldest_pending_seconds": 5.2 }, "background_processor": { "status": "running", "last_run": "2025-01-01T00:00:00Z", "transactions_processed": 150, "success_rate": 0.98 }, "replication_lag": { "current_lag_seconds": 2.5, "main_lsn": "0/12345678", "replica_lsn": "0/12345600" } } ``` ### Buffer Statistics ```bash curl http://localhost:8000/buffer-stats ``` Response: ```json { "pending_transactions": 12, "failed_transactions": 0, "processing_rate": 10.5, "average_execution_time_ms": 125, "buffer_size_mb": 2.3 } ``` ### Monitoring Alerts Set up monitoring for: **High buffer size**: - Alert if buffer > 1000 transactions - Indicates network issue or main DB problem **Failed transactions**: - Alert on any failed transaction - Requires human investigation **Replication lag**: - Alert if lag > 5 minutes - Indicates replica performance issue **Background processor stopped**: - Alert if processor not running - Critical - buffer won't drain ## Summary Reliability is achieved through: **Transaction Buffering**: - Immediate buffering to Redis - Never blocks client - Automatic async execution - Pre-generated IDs for immediate reference **LSN Tracking**: - Precise replication state knowledge - Smart cache management - No guesswork about consistency **Automatic Retry**: - Exponential backoff - Failed transaction queue - Human oversight for permanent failures **Eventual Consistency**: - Accept lag in exchange for reliability - Smart queries merge buffered + persisted - Read buffer tracks updates to buffered records **Monitoring**: - Health endpoints - Buffer statistics - Alerting for anomalies **The result**: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened. ## Next Steps - {doc}`../deep-dive/transaction-buffering/overview` - Technical implementation details - {doc}`../development/debugging-buffering` - Troubleshooting reliability issues