Reliability First#

Documentation Verified Last checked: 2025-11-12 Reviewer: Christof Buchbender

The ops-db-api is designed with one overriding principle: observatory operations must never fail due to network issues. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency.

The Core Requirement#

During telescope observations at 5600m altitude in Chile:

What must work - Operations:

  • Recording observation start/end times

  • Logging instrument configurations

  • Registering raw data files

  • Tracking observation status changes

What can fail:

  • Network connection to main database (Cologne)

  • Replication lag to local replica

  • Remote API calls for auxiliary services

The guarantee: Observatory operations never block or fail due to network issues.

Transaction Buffering: The Safety Net#

How It Works#

When operating at a secondary site (observatory):

  1. Write request arrives at critical endpoint (e.g., create observation)

  2. Transaction is built with pre-generated IDs

  3. Buffer to Redis immediately (sub-millisecond)

  4. Return success to client with pre-generated ID

  5. Background processor executes transaction on main DB asynchronously

  6. LSN tracking confirms when data reaches replica

  7. Cache cleanup occurs after replication confirmed

        sequenceDiagram
    participant Client
    participant API
    participant Redis
    participant BG as Background<br/>Processor
    participant MainDB as Main DB<br/>(Cologne)
    participant Replica as Local<br/>Replica

    Client->>API: POST /executed_obs_units/start
    Note over API: Build transaction<br/>Pre-generate UUID
    API->>Redis: LPUSH to buffer (instant)
    Redis-->>API: OK
    API-->>Client: 201 Created (UUID: abc123)

    Note over Client: Client continues<br/>operation immediately

    loop Background Processing
        BG->>Redis: RPOP from buffer
        BG->>MainDB: Execute transaction
        MainDB-->>BG: LSN: 0/12345678
        BG->>Replica: Poll replay LSN
        alt Replicated
            Replica-->>BG: LSN: 0/12345678 (caught up)
            BG->>Redis: Cleanup cache
        else Not Yet
            Replica-->>BG: LSN: 0/12345600 (behind)
            BG->>Redis: Extend cache TTL
        end
    end
    

Benefits#

Immediate Response:

  • Client never waits for remote database

  • Sub-millisecond latency for buffered operations

  • No difference in experience during network issues

Guaranteed Execution:

  • Redis persistence (AOF) ensures durability

  • Automatic retry with exponential backoff

  • Failed transactions move to dead-letter queue for investigation

Operational Continuity:

  • Telescope observations never pause

  • Operators get immediate feedback

  • Data accumulates in buffer until network restored

Pre-Generated IDs#

A critical feature: IDs are generated before database insertion:

# Traditional approach (doesn't work for buffering)
def create_observation(data):
    obs = Observation(**data)
    db.add(obs)
    db.commit()
    return obs.id  # ID only available after DB commit!

# Our approach (works with buffering)
def create_observation(data):
    obs_id = uuid.uuid4()  # Generate immediately
    transaction = builder.create(
        model_class=Observation,
        data={"id": obs_id, **data}
    )
    buffer_transaction(transaction)
    return obs_id  # Available immediately!

Why this matters:

  • Client can reference the observation immediately

  • Can create related records before DB commit (e.g., files linked to observation)

  • Status queries work immediately (check buffer + DB)

See Transaction Builder for implementation.

LSN Tracking: Precision Over Guesswork#

The Problem with Guessing#

Without LSN tracking, we’d have to guess when data has replicated:

# Naive approach (unreliable)
def wait_for_replication(record_id):
    await asyncio.sleep(5)  # Hope it's replicated by now?
    return check_replica(record_id)  # Might not be there yet!

Problems:

  • Hard-coded delays waste time (too long) or miss replication (too short)

  • No way to know actual replication state

  • Can’t adjust to varying network conditions

  • Cache management is guesswork

LSN: The Solution#

PostgreSQL Log Sequence Numbers (LSN) provide precision:

After write to main database:

-- Get LSN after transaction commit
SELECT pg_current_wal_lsn();
-- Returns: '0/12345678'

Check replica progress:

-- Get LSN that replica has replayed
SELECT pg_last_wal_replay_lsn();
-- Returns: '0/12345600' (slightly behind)

Compare:

main_lsn = "0/12345678"
replica_lsn = "0/12345600"

if replica_lsn >= main_lsn:
    print("Replica has our data!")
else:
    bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn)
    print(f"Replica is {bytes_behind} bytes behind")

How We Use LSN#

1. Track Transaction LSN:

When background processor executes a buffered transaction:

# Execute transaction
await executor.execute_transaction(transaction, session)
await session.commit()

# Capture LSN
result = await session.execute("SELECT pg_current_wal_lsn()")
lsn = result.scalar()

# Store with transaction
await redis.setex(
    f"transaction:{tx_id}:lsn",
    ttl=3600,
    value=lsn
)

2. Poll Replica Status:

Periodically check if replica has caught up:

async def check_replication(transaction_lsn):
    # Query replica
    result = await replica_session.execute(
        "SELECT pg_last_wal_replay_lsn()"
    )
    replica_lsn = result.scalar()

    # Compare
    if replica_lsn >= transaction_lsn:
        return True  # Replicated!
    else:
        return False  # Still waiting

3. Smart Cache Management:

Adjust cache TTL based on replication state:

if replicated:
    # Data is in replica, can cleanup buffer
    await cleanup_transaction_cache(tx_id)
else:
    # Still replicating, keep cache longer
    await extend_cache_ttl(tx_id, additional_seconds=60)

4. Timeout Handling:

If replication takes too long:

if time_waiting > LSN_TIMEOUT:
    # Replication very slow or stuck
    logger.warning(
        f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s"
    )
    # Keep cache indefinitely (or until manual cleanup)
    await set_cache_ttl(tx_id, ttl=None)

See LSN Tracking for full implementation.

Automatic Retry with Exponential Backoff#

When transaction execution fails, we retry intelligently:

Retry Configuration#

# .env
TRANSACTION_RETRY_ATTEMPTS=3
TRANSACTION_RETRY_DELAY=5  # Initial delay in seconds

Retry Strategy#

async def execute_with_retry(transaction):
    attempt = 0
    delay = TRANSACTION_RETRY_DELAY

    while attempt < TRANSACTION_RETRY_ATTEMPTS:
        try:
            result = await execute_transaction(transaction)
            return result
        except NetworkError as e:
            attempt += 1
            if attempt >= TRANSACTION_RETRY_ATTEMPTS:
                # Move to failed queue
                await move_to_failed_queue(transaction)
                raise

            # Exponential backoff
            logger.warning(
                f"Transaction failed (attempt {attempt}), "
                f"retrying in {delay}s: {e}"
            )
            await asyncio.sleep(delay)
            delay *= 2  # Exponential backoff

Why exponential backoff:

  • Transient network issues often resolve quickly (first retry)

  • Persistent issues need longer wait (later retries)

  • Prevents overwhelming failed service with retry storm

Failed Transaction Handling#

If all retries fail:

# Move to dead-letter queue
await redis.lpush(
    f"site:{site_name}:failed_transactions",
    json.dumps(transaction)
)

# Log for investigation
logger.error(
    f"Transaction {tx_id} failed after {attempts} attempts: {error}"
)

# Alert monitoring system
await alert_monitoring(
    severity="error",
    message=f"Transaction {tx_id} failed permanently"
)

Human intervention needed:

  • Investigate root cause (network? database? bug?)

  • Manually replay failed transaction

  • Or mark as resolved if no longer needed

See Debugging Transaction Buffering for debugging failed transactions.

Eventual Consistency: Accept the Lag#

Why Eventual Is Good Enough#

Strong consistency would require:

def create_observation(data):
    # Write to main DB
    await main_db.insert(data)

    # Wait for all replicas to confirm
    for replica in replicas:
        await replica.wait_for_replication()  # Blocks!

    # If any replica unreachable: FAIL
    return success

This doesn’t work at observatory because network to Cologne can be down.

Eventual consistency means:

def create_observation(data):
    # Buffer locally
    await buffer_transaction(data)  # Never fails!

    # Return immediately
    return success  # Client continues

    # Background: sync when possible
    # Eventually all sites will have consistent data

The guarantee: Data will become consistent, we just can’t say exactly when.

How Long is “Eventually”?#

Best case (good network):

  • Buffered: < 1ms (Redis write)

  • Executed on main: 1-5 seconds (background processor)

  • Replicated to observatory: 1-10 seconds (depending on lag)

  • Total: Seconds

Worst case (network down):

  • Buffered: < 1ms (Redis write)

  • Executed on main: Hours to days (when network restored)

  • Replicated to observatory: Hours to days + replication time

  • Total: Until network restoration + sync time

Acceptable because:

  • Observatory operations don’t block (buffering works)

  • Scientists query data hours/days later (lag doesn’t matter)

  • Real-time monitoring uses WebSockets (immediate local updates)

Smart Query Manager: Consistent View#

Even during replication lag, reads are consistent:

async def get_observations(obs_unit_id):
    # Query local replica
    db_records = await db.query(
        Observation
    ).filter(
        Observation.obs_unit_id == obs_unit_id
    )

    # Query buffer
    buffered_records = await get_buffered_records(
        "Observation",
        {"obs_unit_id": obs_unit_id}
    )

    # Query read buffer (updates to buffered records)
    read_buffer_updates = await get_read_buffer_updates(
        "Observation",
        {"obs_unit_id": obs_unit_id}
    )

    # Merge: buffer + updates override DB
    return merge_records(
        db_records,
        buffered_records,
        read_buffer_updates
    )

Result: Client always sees latest state, even if not yet in database.

See Smart Query Manager for details.

Read Buffer: Mutable Buffered Records#

The Problem#

Buffered records sometimes need updates before replication:

# Create observation (buffered)
obs_id = await create_observation({
    "status": "running",
    "start_time": "2025-01-01T00:00:00Z"
})

# 10 seconds later: observation finishes
# But original record still in buffer (not in DB yet)
# How to update it?

Can’t modify database (record not there yet) Can’t modify buffer (breaks transaction atomicity)

The Solution: Read Buffer#

Separate buffer for tracking updates to buffered records:

# Update buffered observation
await update_read_buffer(
    model_class="Observation",
    record_id=obs_id,
    updates={
        "status": "completed",
        "end_time": "2025-01-01T00:10:00Z"
    },
    transaction_id=original_tx_id
)

Read buffer tracks:

  • Which buffered record to update

  • What fields changed

  • When update happened

  • Which transaction created original record

Smart query manager merges:

  1. Database record (if replicated)

  2. Buffered record (if still in buffer)

  3. Read buffer updates (latest changes)

Cleanup happens when LSN confirms replication.

See Read Buffer Manager for implementation.

UI Can Tolerate Staleness#

Why UI Users Don’t Mind#

The web UI (ops-db-ui) queries the API for data visualization:

Typical queries:

  • Transfer overview dashboard

  • Observation history for past week

  • Source visibility calculations

  • Raw data package listings

Staleness tolerance:

  • Seconds of lag: Imperceptible to human users

  • Minutes of lag: Acceptable for most dashboards

  • Hours of lag: Only matters for real-time monitoring

Real-time needs (WebSockets):

For truly real-time data (active transfers, current observations):

  • WebSockets provide immediate updates

  • Redis pub/sub broadcasts changes

  • UI updates instantly regardless of replication

Example: Transfer monitoring

# REST API: May be seconds behind
GET /api/transfer/overview

# WebSocket: Immediate updates
WS /api/transfer/ws/overview
# Receives: { "transfer_id": 123, "status": "completed" }
# Instantly!

Health Monitoring#

The API exposes health information about the buffering system:

Health Endpoint#

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "transaction_buffer": {
    "size": 12,
    "failed": 0,
    "oldest_pending_seconds": 5.2
  },
  "background_processor": {
    "status": "running",
    "last_run": "2025-01-01T00:00:00Z",
    "transactions_processed": 150,
    "success_rate": 0.98
  },
  "replication_lag": {
    "current_lag_seconds": 2.5,
    "main_lsn": "0/12345678",
    "replica_lsn": "0/12345600"
  }
}

Buffer Statistics#

curl http://localhost:8000/buffer-stats

Response:

{
  "pending_transactions": 12,
  "failed_transactions": 0,
  "processing_rate": 10.5,
  "average_execution_time_ms": 125,
  "buffer_size_mb": 2.3
}

Monitoring Alerts#

Set up monitoring for:

High buffer size:

  • Alert if buffer > 1000 transactions

  • Indicates network issue or main DB problem

Failed transactions:

  • Alert on any failed transaction

  • Requires human investigation

Replication lag:

  • Alert if lag > 5 minutes

  • Indicates replica performance issue

Background processor stopped:

  • Alert if processor not running

  • Critical - buffer won’t drain

Summary#

Reliability is achieved through:

Transaction Buffering:

  • Immediate buffering to Redis

  • Never blocks client

  • Automatic async execution

  • Pre-generated IDs for immediate reference

LSN Tracking:

  • Precise replication state knowledge

  • Smart cache management

  • No guesswork about consistency

Automatic Retry:

  • Exponential backoff

  • Failed transaction queue

  • Human oversight for permanent failures

Eventual Consistency:

  • Accept lag in exchange for reliability

  • Smart queries merge buffered + persisted

  • Read buffer tracks updates to buffered records

Monitoring:

  • Health endpoints

  • Buffer statistics

  • Alerting for anomalies

The result: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened.

Next Steps#