# Reliability First
```{eval-rst}
.. verified:: 2025-11-12
:reviewer: Christof Buchbender
```
The ops-db-api is designed with one overriding principle: **observatory operations must never fail due to network issues**. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency.
```{contents} Table of Contents
:depth: 2
:local: true
```
## The Core Requirement
During telescope observations at 5600m altitude in Chile:
**What must work - Operations**:
- Recording observation start/end times
- Logging instrument configurations
- Registering raw data files
- Tracking observation status changes
**What can fail**:
- Network connection to main database (Cologne)
- Replication lag to local replica
- Remote API calls for auxiliary services
**The guarantee**: Observatory operations **never** block or fail due to network issues.
## Transaction Buffering: The Safety Net
### How It Works
When operating at a secondary site (observatory):
1. **Write request arrives** at critical endpoint (e.g., create observation)
2. **Transaction is built** with pre-generated IDs
3. **Buffer to Redis** immediately (sub-millisecond)
4. **Return success** to client with pre-generated ID
5. **Background processor** executes transaction on main DB asynchronously
6. **LSN tracking** confirms when data reaches replica
7. **Cache cleanup** occurs after replication confirmed
```{eval-rst}
.. mermaid::
sequenceDiagram
participant Client
participant API
participant Redis
participant BG as Background
Processor
participant MainDB as Main DB
(Cologne)
participant Replica as Local
Replica
Client->>API: POST /executed_obs_units/start
Note over API: Build transaction
Pre-generate UUID
API->>Redis: LPUSH to buffer (instant)
Redis-->>API: OK
API-->>Client: 201 Created (UUID: abc123)
Note over Client: Client continues
operation immediately
loop Background Processing
BG->>Redis: RPOP from buffer
BG->>MainDB: Execute transaction
MainDB-->>BG: LSN: 0/12345678
BG->>Replica: Poll replay LSN
alt Replicated
Replica-->>BG: LSN: 0/12345678 (caught up)
BG->>Redis: Cleanup cache
else Not Yet
Replica-->>BG: LSN: 0/12345600 (behind)
BG->>Redis: Extend cache TTL
end
end
```
### Benefits
**Immediate Response**:
- Client never waits for remote database
- Sub-millisecond latency for buffered operations
- No difference in experience during network issues
**Guaranteed Execution**:
- Redis persistence (AOF) ensures durability
- Automatic retry with exponential backoff
- Failed transactions move to dead-letter queue for investigation
**Operational Continuity**:
- Telescope observations never pause
- Operators get immediate feedback
- Data accumulates in buffer until network restored
### Pre-Generated IDs
A critical feature: IDs are generated **before** database insertion:
```python
# Traditional approach (doesn't work for buffering)
def create_observation(data):
obs = Observation(**data)
db.add(obs)
db.commit()
return obs.id # ID only available after DB commit!
# Our approach (works with buffering)
def create_observation(data):
obs_id = uuid.uuid4() # Generate immediately
transaction = builder.create(
model_class=Observation,
data={"id": obs_id, **data}
)
buffer_transaction(transaction)
return obs_id # Available immediately!
```
**Why this matters**:
- Client can reference the observation immediately
- Can create related records before DB commit (e.g., files linked to observation)
- Status queries work immediately (check buffer + DB)
See {doc}`../deep-dive/transaction-buffering/transaction-builder` for implementation.
## LSN Tracking: Precision Over Guesswork
### The Problem with Guessing
**Without LSN tracking**, we'd have to guess when data has replicated:
```python
# Naive approach (unreliable)
def wait_for_replication(record_id):
await asyncio.sleep(5) # Hope it's replicated by now?
return check_replica(record_id) # Might not be there yet!
```
**Problems**:
- Hard-coded delays waste time (too long) or miss replication (too short)
- No way to know actual replication state
- Can't adjust to varying network conditions
- Cache management is guesswork
### LSN: The Solution
**PostgreSQL Log Sequence Numbers (LSN)** provide precision:
**After write to main database**:
```sql
-- Get LSN after transaction commit
SELECT pg_current_wal_lsn();
-- Returns: '0/12345678'
```
**Check replica progress**:
```sql
-- Get LSN that replica has replayed
SELECT pg_last_wal_replay_lsn();
-- Returns: '0/12345600' (slightly behind)
```
**Compare**:
```python
main_lsn = "0/12345678"
replica_lsn = "0/12345600"
if replica_lsn >= main_lsn:
print("Replica has our data!")
else:
bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn)
print(f"Replica is {bytes_behind} bytes behind")
```
### How We Use LSN
**1. Track Transaction LSN**:
When background processor executes a buffered transaction:
```python
# Execute transaction
await executor.execute_transaction(transaction, session)
await session.commit()
# Capture LSN
result = await session.execute("SELECT pg_current_wal_lsn()")
lsn = result.scalar()
# Store with transaction
await redis.setex(
f"transaction:{tx_id}:lsn",
ttl=3600,
value=lsn
)
```
**2. Poll Replica Status**:
Periodically check if replica has caught up:
```python
async def check_replication(transaction_lsn):
# Query replica
result = await replica_session.execute(
"SELECT pg_last_wal_replay_lsn()"
)
replica_lsn = result.scalar()
# Compare
if replica_lsn >= transaction_lsn:
return True # Replicated!
else:
return False # Still waiting
```
**3. Smart Cache Management**:
Adjust cache TTL based on replication state:
```python
if replicated:
# Data is in replica, can cleanup buffer
await cleanup_transaction_cache(tx_id)
else:
# Still replicating, keep cache longer
await extend_cache_ttl(tx_id, additional_seconds=60)
```
**4. Timeout Handling**:
If replication takes too long:
```python
if time_waiting > LSN_TIMEOUT:
# Replication very slow or stuck
logger.warning(
f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s"
)
# Keep cache indefinitely (or until manual cleanup)
await set_cache_ttl(tx_id, ttl=None)
```
See {doc}`../deep-dive/transaction-buffering/lsn-tracking` for full implementation.
## Automatic Retry with Exponential Backoff
When transaction execution fails, we retry intelligently:
### Retry Configuration
```bash
# .env
TRANSACTION_RETRY_ATTEMPTS=3
TRANSACTION_RETRY_DELAY=5 # Initial delay in seconds
```
### Retry Strategy
```python
async def execute_with_retry(transaction):
attempt = 0
delay = TRANSACTION_RETRY_DELAY
while attempt < TRANSACTION_RETRY_ATTEMPTS:
try:
result = await execute_transaction(transaction)
return result
except NetworkError as e:
attempt += 1
if attempt >= TRANSACTION_RETRY_ATTEMPTS:
# Move to failed queue
await move_to_failed_queue(transaction)
raise
# Exponential backoff
logger.warning(
f"Transaction failed (attempt {attempt}), "
f"retrying in {delay}s: {e}"
)
await asyncio.sleep(delay)
delay *= 2 # Exponential backoff
```
**Why exponential backoff**:
- Transient network issues often resolve quickly (first retry)
- Persistent issues need longer wait (later retries)
- Prevents overwhelming failed service with retry storm
### Failed Transaction Handling
If all retries fail:
```python
# Move to dead-letter queue
await redis.lpush(
f"site:{site_name}:failed_transactions",
json.dumps(transaction)
)
# Log for investigation
logger.error(
f"Transaction {tx_id} failed after {attempts} attempts: {error}"
)
# Alert monitoring system
await alert_monitoring(
severity="error",
message=f"Transaction {tx_id} failed permanently"
)
```
**Human intervention needed**:
- Investigate root cause (network? database? bug?)
- Manually replay failed transaction
- Or mark as resolved if no longer needed
See {doc}`../development/debugging-buffering` for debugging failed transactions.
## Eventual Consistency: Accept the Lag
### Why Eventual Is Good Enough
**Strong consistency** would require:
```python
def create_observation(data):
# Write to main DB
await main_db.insert(data)
# Wait for all replicas to confirm
for replica in replicas:
await replica.wait_for_replication() # Blocks!
# If any replica unreachable: FAIL
return success
```
**This doesn't work at observatory** because network to Cologne can be down.
**Eventual consistency** means:
```python
def create_observation(data):
# Buffer locally
await buffer_transaction(data) # Never fails!
# Return immediately
return success # Client continues
# Background: sync when possible
# Eventually all sites will have consistent data
```
**The guarantee**: Data *will* become consistent, we just can't say exactly *when*.
### How Long is "Eventually"?
**Best case** (good network):
- Buffered: < 1ms (Redis write)
- Executed on main: 1-5 seconds (background processor)
- Replicated to observatory: 1-10 seconds (depending on lag)
- **Total**: Seconds
**Worst case** (network down):
- Buffered: < 1ms (Redis write)
- Executed on main: Hours to days (when network restored)
- Replicated to observatory: Hours to days + replication time
- **Total**: Until network restoration + sync time
**Acceptable because**:
- Observatory operations don't block (buffering works)
- Scientists query data hours/days later (lag doesn't matter)
- Real-time monitoring uses WebSockets (immediate local updates)
### Smart Query Manager: Consistent View
Even during replication lag, reads are consistent:
```python
async def get_observations(obs_unit_id):
# Query local replica
db_records = await db.query(
Observation
).filter(
Observation.obs_unit_id == obs_unit_id
)
# Query buffer
buffered_records = await get_buffered_records(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Query read buffer (updates to buffered records)
read_buffer_updates = await get_read_buffer_updates(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Merge: buffer + updates override DB
return merge_records(
db_records,
buffered_records,
read_buffer_updates
)
```
**Result**: Client always sees latest state, even if not yet in database.
See {doc}`../deep-dive/transaction-buffering/smart-query-manager` for details.
## Read Buffer: Mutable Buffered Records
### The Problem
Buffered records sometimes need updates before replication:
```python
# Create observation (buffered)
obs_id = await create_observation({
"status": "running",
"start_time": "2025-01-01T00:00:00Z"
})
# 10 seconds later: observation finishes
# But original record still in buffer (not in DB yet)
# How to update it?
```
**Can't modify database** (record not there yet)
**Can't modify buffer** (breaks transaction atomicity)
### The Solution: Read Buffer
Separate buffer for tracking updates to buffered records:
```python
# Update buffered observation
await update_read_buffer(
model_class="Observation",
record_id=obs_id,
updates={
"status": "completed",
"end_time": "2025-01-01T00:10:00Z"
},
transaction_id=original_tx_id
)
```
**Read buffer tracks**:
- Which buffered record to update
- What fields changed
- When update happened
- Which transaction created original record
**Smart query manager merges**:
1. Database record (if replicated)
2. Buffered record (if still in buffer)
3. Read buffer updates (latest changes)
**Cleanup happens** when LSN confirms replication.
See {doc}`../deep-dive/transaction-buffering/read-buffer-manager` for implementation.
## UI Can Tolerate Staleness
### Why UI Users Don't Mind
The web UI (ops-db-ui) queries the API for data visualization:
**Typical queries**:
- Transfer overview dashboard
- Observation history for past week
- Source visibility calculations
- Raw data package listings
**Staleness tolerance**:
- **Seconds of lag**: Imperceptible to human users
- **Minutes of lag**: Acceptable for most dashboards
- **Hours of lag**: Only matters for real-time monitoring
**Real-time needs** (WebSockets):
For truly real-time data (active transfers, current observations):
- WebSockets provide immediate updates
- Redis pub/sub broadcasts changes
- UI updates instantly regardless of replication
**Example**: Transfer monitoring
```python
# REST API: May be seconds behind
GET /api/transfer/overview
# WebSocket: Immediate updates
WS /api/transfer/ws/overview
# Receives: { "transfer_id": 123, "status": "completed" }
# Instantly!
```
## Health Monitoring
The API exposes health information about the buffering system:
### Health Endpoint
```bash
curl http://localhost:8000/health
```
Response:
```json
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"transaction_buffer": {
"size": 12,
"failed": 0,
"oldest_pending_seconds": 5.2
},
"background_processor": {
"status": "running",
"last_run": "2025-01-01T00:00:00Z",
"transactions_processed": 150,
"success_rate": 0.98
},
"replication_lag": {
"current_lag_seconds": 2.5,
"main_lsn": "0/12345678",
"replica_lsn": "0/12345600"
}
}
```
### Buffer Statistics
```bash
curl http://localhost:8000/buffer-stats
```
Response:
```json
{
"pending_transactions": 12,
"failed_transactions": 0,
"processing_rate": 10.5,
"average_execution_time_ms": 125,
"buffer_size_mb": 2.3
}
```
### Monitoring Alerts
Set up monitoring for:
**High buffer size**:
- Alert if buffer > 1000 transactions
- Indicates network issue or main DB problem
**Failed transactions**:
- Alert on any failed transaction
- Requires human investigation
**Replication lag**:
- Alert if lag > 5 minutes
- Indicates replica performance issue
**Background processor stopped**:
- Alert if processor not running
- Critical - buffer won't drain
## Summary
Reliability is achieved through:
**Transaction Buffering**:
- Immediate buffering to Redis
- Never blocks client
- Automatic async execution
- Pre-generated IDs for immediate reference
**LSN Tracking**:
- Precise replication state knowledge
- Smart cache management
- No guesswork about consistency
**Automatic Retry**:
- Exponential backoff
- Failed transaction queue
- Human oversight for permanent failures
**Eventual Consistency**:
- Accept lag in exchange for reliability
- Smart queries merge buffered + persisted
- Read buffer tracks updates to buffered records
**Monitoring**:
- Health endpoints
- Buffer statistics
- Alerting for anomalies
**The result**: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened.
## Next Steps
- {doc}`../deep-dive/transaction-buffering/overview` - Technical implementation details
- {doc}`../development/debugging-buffering` - Troubleshooting reliability issues