Distributed Architecture#

Documentation Verified Last checked: 2025-11-12 Reviewer: Christof Buchbender

The ops-db-api operates in a distributed environment with a single authoritative database and multiple read-only replicas. Understanding this topology is crucial for understanding the system’s behavior.

Database Topology#

The system consists of:

  • One authoritative database (main site - Cologne, Germany)

  • One or more read-only replicas (secondary sites - observatory in Chile, potentially others)

  • One API instance per site (each connecting to its local database)

        graph TB
    subgraph "Main Site (Cologne)"
        API_Main[API Server<br/>SITE_TYPE=main]
        DB_Main[(Main Database<br/>Read + Write)]
        Redis_Main[Redis<br/>Caching only]
    end

    subgraph "Observatory Site (Chile)"
        API_Obs[API Server<br/>SITE_TYPE=secondary]
        DB_Replica[(Replica Database<br/>Read-only)]
        Redis_Obs[Redis<br/>Transaction Buffer]
    end

    API_Main -->|Direct Writes| DB_Main
    API_Main -->|Direct Reads| DB_Main
    API_Main -->|Cache| Redis_Main

    API_Obs -->|Buffered Writes| Redis_Obs
    Redis_Obs -->|Background Process| DB_Main
    API_Obs -->|Reads| DB_Replica
    API_Obs -->|Merge Buffer| Redis_Obs

    DB_Main -.->|Streaming Replication<br/>WAL| DB_Replica

    style DB_Main fill:#90EE90
    style DB_Replica fill:#FFB6C1
    style Redis_Obs fill:#FFD700
    

Single Authoritative Source#

The Foundation: One Source of Truth#

Main Database (Cologne) is the only database that accepts writes:

  • All data modifications must ultimately go through this database

  • It is the authoritative source for all data

  • Other databases are read-only copies (replicas)

This design choice has profound implications:

Advantages:

  • No conflict resolution needed (impossible for two sites to write same record)

  • Predictable data flow (all writes flow one direction)

  • Simple consistency model (main DB is always “correct”)

  • Easy rollback and backup (one source to protect)

Trade-offs:

  • Network partition blocks writes from secondary sites (mitigated by buffering)

  • Replica can be slightly behind main (mitigated by LSN tracking)

  • Main site is single point of failure for writes (mitigated by high availability)

Why Not Multi-Master?#

Multi-Master means multiple databases can accept writes:

        graph LR
    DB_A[(Database A)] <-.->|Bidirectional<br/>Sync| DB_B[(Database B)]
    Client_A[Client] -->|Writes| DB_A
    Client_B[Client] -->|Writes| DB_B
    

Sounds great, but:

  1. Conflict Resolution Required:

    # Site A: Update observation status
    UPDATE executed_obs_unit
    SET status = 'completed', end_time = '2025-01-01T01:00:00'
    WHERE id = 123;
    
    # Site B: Simultaneously update same observation
    UPDATE executed_obs_unit
    SET status = 'failed', error_message = 'Weather abort'
    WHERE id = 123;
    
    # When they sync: Which status is correct?
    # Last-write-wins? Merge fields? Manual resolution?
    
  2. Cascading Complexity:

    • Foreign key constraints become complex

    • Transaction atomicity across sites is difficult

    • Schema migrations must work on all sites simultaneously

  3. Conflict Types:

    • Update-update: Both sites modify same record

    • Insert-insert: Both sites create record with same ID

    • Delete-update: One site deletes what another updates

    • Referential integrity: Foreign keys pointing to deleted records

Our Single-Master Approach avoids all these issues at the cost of requiring buffering at secondary sites.

PostgreSQL Streaming Replication#

How It Works#

PostgreSQL replication uses the Write-Ahead Log (WAL):

  1. Main database writes all changes to WAL before applying them

  2. WAL segments are shipped to replicas (continuously or in batches)

  3. Replicas replay WAL segments to stay synchronized

  4. Replication lag is the time between main write and replica apply

        sequenceDiagram
    participant Client
    participant Main DB
    participant WAL
    participant Replica

    Client->>Main DB: INSERT observation
    Main DB->>WAL: Write WAL record (LSN: 0/12345678)
    Main DB->>Main DB: Apply to database
    Main DB-->>Client: Success

    Note over WAL,Replica: Continuous streaming

    WAL->>Replica: Stream WAL record
    Replica->>Replica: Replay WAL (apply changes)
    Replica->>Replica: Update replay LSN

    Note over Replica: Now has data at LSN 0/12345678
    

WAL (Write-Ahead Log)#

The WAL is PostgreSQL’s transaction log:

What it contains:

  • Every INSERT, UPDATE, DELETE operation

  • Transaction commits and aborts

  • Schema changes (DDL)

  • Sequence increments

Why it’s powerful:

  • Guaranteed durability (written before data changes)

  • Point-in-time recovery possible

  • Replication uses the same mechanism as crash recovery

  • Binary format - efficient to transmit

LSN (Log Sequence Number)#

The LSN is a unique identifier for each WAL record:

Format: file_offset/byte_offset (e.g., 0/12345678)

Properties:

  • Monotonically increasing (never decreases)

  • Unique per transaction

  • Consistent across main and replicas

How we use it:

# After writing to main database
lsn_after_write = execute_query("SELECT pg_current_wal_lsn()")
# Returns: '0/12345678'

# Check if replica has caught up
replica_lsn = execute_query("SELECT pg_last_wal_replay_lsn()")
# Returns: '0/12345600' (slightly behind)

if replica_lsn >= lsn_after_write:
    print("Replica has our data!")
else:
    print(f"Replica is {lsn_after_write - replica_lsn} bytes behind")

See LSN Tracking for implementation details.

Replication Lag#

Lag is the time between a write on main and its availability on replica:

Factors affecting lag:

  • Network bandwidth (WAL transmission speed)

  • Network latency (round-trip time)

  • Replica load (CPU, disk I/O)

  • WAL volume (many writes = more to replay)

Typical lag:

  • Good network: Milliseconds to seconds

  • Slow network: Seconds to minutes

  • Network down: Infinite (accumulates until reconnection)

How we handle it:

  • Buffer writes locally (observatory doesn’t block on lag)

  • Track LSN precisely (know when data has replicated)

  • Merge buffered + replicated data on reads (consistent view)

  • Extend cache TTL if replication takes longer than expected

Site Types: MAIN vs SECONDARY#

The API behavior changes dramatically based on site type:

MAIN Site (Cologne)#

Configuration:

SITE_TYPE=main
MAIN_DB_HOST=localhost  # Same as LOCAL_DB_HOST
LOCAL_DB_HOST=localhost

Behavior:

  • Writes: Direct to database (no buffering)

  • Reads: Direct from database

  • Redis: Used only for caching, not buffering

  • Background Processor: Disabled or minimal activity

  • LSN Tracking: Not needed (no replication to track)

Why:

  • Main database is local (low latency)

  • No replication lag (reads immediately consistent)

  • Buffering overhead unnecessary

When to use:

  • Production main site (Cologne)

  • Development when testing direct database operations

  • Any site where main database is accessible with low latency

SECONDARY Site (Observatory)#

Configuration:

SITE_TYPE=secondary
MAIN_DB_HOST=main-db.example.com  # Remote main database
LOCAL_DB_HOST=localhost            # Local read-only replica

Behavior:

  • Critical Writes: Buffered in Redis, processed asynchronously

  • Non-Critical Writes: Direct to local database (fails if read-only)

  • Reads: From local replica + merge with buffered data

  • Redis: Transaction buffer + read buffer + caching

  • Background Processor: Active, processes buffered transactions

  • LSN Tracking: Active, monitors replication state

Why:

  • Main database is remote (high latency or unavailable)

  • Replication lag expected (merge buffer with DB)

  • Operations must never block (buffer ensures reliability)

When to use:

  • Production observatory site

  • Development when testing transaction buffering

  • Any site where main database is unreliable

Connection Configuration#

The API maintains two database connections:

Main Database Connection#

Purpose: Execute write operations (INSERT, UPDATE, DELETE)

Configuration:

MAIN_DB_HOST=main-db.example.com
MAIN_DB_PORT=5432
MAIN_DB_USER=ccat_ops_user
MAIN_DB_PASSWORD=***
MAIN_DB_NAME=ccat_ops_db

Used by:

  • Background processor (executing buffered transactions)

  • Transaction executor (at main site for direct writes)

Connection pool:

  • Async connection (asyncpg)

  • Separate pool for background processing

  • Configurable pool size (default: 10 connections)

Local Database Connection#

Purpose: Execute read operations (SELECT)

Configuration:

LOCAL_DB_HOST=localhost  # Replica at secondary, main at main site
LOCAL_DB_PORT=5432
LOCAL_DB_USER=ccat_ops_user
LOCAL_DB_PASSWORD=***
LOCAL_DB_NAME=ccat_ops_db

Used by:

  • All read endpoints

  • Smart query manager (merge with buffered data)

  • Non-critical operations (at main site)

Connection pool:

  • Async connection (asyncpg)

  • Shared across all API requests

  • Configurable pool size (default: 20 connections)

Why Two Connections?#

Separation of concerns:

  • Reads can go to local replica (fast, no network)

  • Writes go to main database (authoritative)

  • Background processor doesn’t interfere with read traffic

Flexibility:

  • Main site: Both connections point to same database

  • Secondary site: Different databases (main remote, local replica)

  • Testing: Can point to different databases for verification

Network Partition Scenarios#

Understanding how the system behaves during network issues:

Scenario 1: Observatory Loses Connection#

What happens:

  1. API always uses buffering mode

  2. Critical operations buffer in Redis

  3. Reads continue from local replica (slightly stale data)

  4. Buffer accumulates in Redis (monitored for size)

  5. Background processor retries periodically

  6. When network restored: buffer drains automatically

Data consistency:

  • Buffered writes not visible in local reads (until replicated)

  • Smart query manager merges buffer + DB (consistent view)

  • Read buffer manager handles updates to buffered records

Recovery:

  • Automatic - no manual intervention needed

  • Background processor processes entire buffer

  • LSN tracking confirms when data reaches replica

  • Cache cleanup occurs after replication confirmed

Scenario 2: Main Site Database Down#

What happens:

  1. Background processor can’t execute buffered transactions

  2. Buffer continues to accumulate in Redis

  3. Observatory operations continue unaffected

  4. When database restored: buffer drains automatically

Data loss risk:

  • Redis with AOF: Minimal risk (persisted to disk)

  • Redis memory-only: Risk if Redis also fails

  • Recommendation: Always use Redis AOF at observatory

Scenario 3: Replica Falls Behind#

What happens:

  1. Local reads become increasingly stale

  2. Smart query manager still merges buffer (partially mitigates)

  3. LSN tracking detects lag

  4. Cache TTL extended (preserve buffered data longer)

When critical:

  • Rare - replicas usually catch up quickly

  • If persistent: Check replica performance, network bandwidth

  • Can temporarily point reads to main database (high latency)

Why Not Application-Level Sync?#

Alternative approach: Application manages synchronization

# Application-level sync (we don't do this)
def sync_observations():
    # Get observations from local DB modified since last sync
    local_obs = db_local.query(
        Observation
    ).filter(
        Observation.modified_at > last_sync_time
    )

    # Upload to main database
    for obs in local_obs:
        db_main.merge(obs)  # What if already modified at main?

    # Download observations from main DB
    main_obs = db_main.query(
        Observation
    ).filter(
        Observation.modified_at > last_sync_time
    )

    # Apply to local DB
    for obs in main_obs:
        db_local.merge(obs)  # Conflict resolution needed!

Problems with this approach:

  1. Conflict Resolution: If both sites modify same record, how to merge?

  2. Foreign Keys: Need to track and resolve cross-table relationships

  3. Partial Sync: Some tables sync, others don’t - inconsistent state

  4. Transaction Atomicity: Multi-table transactions can sync partially

  5. Schema Changes: Every migration needs sync logic update

  6. Performance: Full table scans or complex change tracking needed

PostgreSQL streaming replication solves all these at the database engine level.

Summary#

The distributed architecture provides:

  • Single source of truth: Main database (Cologne)

  • Local replicas: Fast reads at each site (PostgreSQL streaming replication)

  • Site-aware behavior: MAIN (direct writes) vs SECONDARY (buffered writes)

  • Automatic replication: PostgreSQL handles data synchronization

  • Precise tracking: LSN tells us exactly when data has replicated

  • Network resilience: Operations continue during network issues

This architecture is optimized for reliability at the cost of:

  • Additional complexity (buffering, LSN tracking, smart queries)

  • Eventual consistency (seconds to minutes lag)

  • Single point of write failure (main database)

These trade-offs are acceptable for the observatory use case where reliability trumps immediate consistency.

Next Steps#