Transaction Buffering#

The transaction buffering system is the core innovation that enables reliable observatory operations despite unreliable network connectivity.

Introduction#

The Problem: Observatory operations must never block or fail due to network issues.

The Solution: Buffer write operations locally in Redis and execute them asynchronously against the main database when the network is available.

The Innovation: LSN-based replication tracking provides precise knowledge of when buffered data has reached the local replica, enabling smart cache management.

System Components#

The buffering system consists of seven interconnected components:

  1. Site Configuration: Determines buffering behavior based on site type

  2. Transaction Builder: Constructs multi-step atomic operations with dependencies

  3. Transaction Manager: Buffers transactions to Redis and manages state

  4. Background Processor: Continuously processes buffered transactions

  5. Transaction Executor: Executes transactions against the main database

  6. LSN Tracker: Monitors PostgreSQL replication progress

  7. Smart Query Manager: Merges database + buffered + read buffer data

  8. Read Buffer Manager: Tracks mutable updates to buffered records

        graph TB
    Client[Client Request]
    Decorator[@critical_operation]
    Builder[Transaction Builder]
    Manager[Transaction Manager]
    Redis[(Redis Buffer)]
    BG[Background Processor]
    Executor[Transaction Executor]
    MainDB[(Main Database)]
    LSN[LSN Tracker]
    Replica[(Local Replica)]
    Smart[Smart Query Manager]
    ReadBuf[Read Buffer Manager]

    Client --> Decorator
    Decorator --> Builder
    Builder --> Manager
    Manager --> Redis
    Redis --> BG
    BG --> Executor
    Executor --> MainDB
    Executor --> LSN
    LSN --> Replica
    LSN --> Manager
    MainDB -.Replication.-> Replica
    Smart --> Replica
    Smart --> Redis
    Smart --> ReadBuf
    Client -.Query.-> Smart

    style Redis fill:#FFD700
    style MainDB fill:#90EE90
    style Replica fill:#FFB6C1
    

Complete Lifecycle#

From client request to replicated data:

        sequenceDiagram
    participant Client
    participant API
    participant Builder
    participant Manager
    participant Redis
    participant BG as Background<br/>Processor
    participant Main as Main DB
    participant LSN as LSN Tracker
    participant Replica

    Note over Client,Replica: Phase 1: Buffering
    Client->>API: POST /executed_obs_units/start
    API->>Builder: Build transaction
    Builder->>Builder: Generate UUID
    Builder-->>API: Transaction
    API->>Manager: Buffer transaction
    Manager->>Redis: LPUSH to buffer
    Manager->>Redis: Cache generated IDs
    Manager->>Redis: Cache buffered data
    Redis-->>Manager: OK
    Manager-->>API: Transaction ID
    API-->>Client: 201 Created (UUID)

    Note over Client,Replica: Phase 2: Processing
    loop Every 1 second
        BG->>Redis: RPOP from buffer
        Redis-->>BG: Transaction
        BG->>Main: Execute transaction
        Main-->>BG: Success
        BG->>Main: SELECT pg_current_wal_lsn()
        Main-->>BG: LSN: 0/12345678
    end

    Note over Client,Replica: Phase 3: Replication
    Main->>Replica: Stream WAL
    Replica->>Replica: Replay WAL

    Note over Client,Replica: Phase 4: Confirmation
    loop Every 0.1 seconds
        LSN->>Replica: SELECT pg_last_wal_replay_lsn()
        Replica-->>LSN: Current LSN
        LSN->>LSN: Compare with target LSN
    end

    Note over Client,Replica: Phase 5: Cleanup
    LSN-->>BG: Replicated
    BG->>Redis: DELETE cached data
    BG->>Redis: DELETE from write-through cache
    

Data Structures#

Key Redis Data Structures#

Transaction Buffer (List):

Key: site:{site_name}:transaction_buffer
Type: List (LPUSH/RPOP)
Content: JSON-serialized SQLAlchemyBufferedTransaction

Transaction Status (String with TTL):

Key: site:{site_name}:transaction:{transaction_id}
Type: String
TTL: 3600 seconds (1 hour)
Content: JSON-serialized transaction status

Write-Through Cache (Hash):

Key: site:{site_name}:cache:ids:{model}:{id}
Type: String
TTL: Varies (extended until replicated)
Content: Pre-generated ID

Buffered Data Cache (String):

Key: site:{site_name}:buffered:{model}:{id}
Type: String
TTL: Varies (deleted when replicated)
Content: JSON-serialized record data

Read Buffer (Hash):

Key: site:{site_name}:read_buffer:{model}:{id}
Type: Hash
Fields: Updated field values
TTL: Varies (deleted when replicated)

Transaction Structure#

SQLAlchemyBufferedTransaction model:

class SQLAlchemyBufferedTransaction:
    transaction_id: str  # UUID
    endpoint: str  # e.g., "create_observation"
    site: str  # e.g., "observatory"
    timestamp: datetime
    status: str  # "pending", "processing", "completed", "failed"
    retry_count: int
    steps: List[SQLAlchemyTransactionStep]

SQLAlchemyTransactionStep model:

class SQLAlchemyTransactionStep:
    step_id: str
    operation: str  # "create", "update", "delete", "bulk_create"
    model_class: str  # "ExecutedObsUnit"
    data: Dict[str, Any]  # Record data
    conditions: Dict[str, Any]  # For updates/deletes
    dependencies: List[str]  # step_ids this depends on

Performance Characteristics#

Latency#

Buffering operation (at secondary site):

  • Transaction building: < 10ms

  • Redis LPUSH: < 1ms

  • Cache writes: < 5ms

  • Response generation: < 1ms

  • Total: < 20ms (typical)

Background processing:

  • Transaction execution: 50-500ms (depends on complexity)

  • LSN capture: < 5ms

  • LSN comparison: < 10ms

  • Total: 100ms - 1s per transaction

Replication lag:

  • Good network: 1-10 seconds

  • Slow network: 10-60 seconds

  • Network down: Accumulates

Throughput#

Single background processor:

  • Simple transactions: ~10-20/second

  • Complex transactions: ~2-5/second

  • Bulk operations: ~1-2/second

Bottlenecks:

  • Main database connection latency

  • Transaction complexity

  • Network bandwidth (replication)

Reliability#

Buffer persistence:

  • Redis AOF (Append-Only File) enabled

  • Survives process restart

  • Data loss risk: Last 1 second of writes (fsync policy)

Retry mechanism:

  • Configurable attempts (default: 3)

  • Exponential backoff (5s, 10s, 20s)

  • Failed queue for manual intervention

Monitoring:

  • Buffer size tracking

  • Processing rate metrics

  • Failed transaction alerts

Section Contents#

Explore each component in detail: