# Database Topology ```{eval-rst} .. verified:: 2025-11-12 :reviewer: Christof Buchbender ``` The ops-db-api uses a distributed PostgreSQL architecture with a single authoritative database and read-only replicas at remote sites. ```{contents} Table of Contents :depth: 2 :local: true ``` ## Overview The database topology consists of: - **Main Database**: Cologne, Germany (write target) - **Replica Database(s)**: Observatory in Chile, potentially other sites (read-only) - **Streaming Replication**: PostgreSQL WAL-based continuous sync ```{eval-rst} .. mermaid:: graph LR Main[(Main Database
Cologne
Read + Write)] Replica1[(Replica
Observatory
Read Only)] Replica2[(Replica
Institute
Read Only)] Main -.->|WAL Stream| Replica1 Main -.->|WAL Stream| Replica2 API_Main[API @ Main] -->|Writes| Main API_Main -->|Reads| Main API_Obs[API @ Observatory] -->|Buffered Writes| Main API_Obs -->|Reads| Replica1 style Main fill:#90EE90 style Replica1 fill:#FFB6C1 style Replica2 fill:#FFB6C1 ``` ## Main Database (Cologne) ### Characteristics **Role**: Single authoritative source of truth **Location**: Cologne, Germany (data center) **Access**: - Direct write access (low latency) - Direct read access (low latency) - Public endpoint for remote API instances **Configuration**: ```ini # postgresql.conf wal_level = replica max_wal_senders = 10 wal_keep_size = 1GB synchronous_commit = on ``` **Replication Settings**: ```ini # Enable WAL archiving archive_mode = on archive_command = 'cp %p /archive/%f' # Replication slots (one per replica) max_replication_slots = 10 ``` ### Schema Management Database schema is managed by the `ops-db` package: ```bash # Install ops-db package pip install git+https://github.com/ccatobs/ops-db.git # Models are imported from ccat_ops_db from ccat_ops_db import models ``` **Key tables**: - `executed_obs_unit` - Observatory operation records - `raw_data_package` - Data file groupings - `raw_data_file` - Individual file metadata - `observing_program` - Planned observations - `source` - Astronomical sources - `user` - System users - `api_token` - Authentication tokens ### Write Operations All writes go through the main database: ```python # At main site: Direct write async with main_db_session() as session: obs = models.ExecutedObsUnit(**data) session.add(obs) await session.commit() # At secondary site: Buffered write transaction = builder.create( model_class=models.ExecutedObsUnit, data=data ) await transaction_manager.buffer_transaction(transaction) # Background processor later writes to main database ``` ## Replica Databases ### Observatory Replica (Chile) **Role**: Local read-only copy for observatory operations **Location**: CCAT observatory, Chile (5600m altitude) **Purpose**: - Fast local reads (no network latency) - Continue operations during network outages - Merge with buffered data for consistent views **Connection from API**: ```python # Site configuration for secondary site LOCAL_DB_HOST=localhost # Points to local replica LOCAL_DB_PORT=5432 ``` **Replication Lag**: Typical lag depends on network conditions: - Good network: 1-10 seconds - Slow network: 10-60 seconds - Network down: Accumulates until reconnection ### Institute Replicas (Optional) Additional replicas can be deployed at partner institutions: **Purpose**: - Distributed data access for scientists - Load balancing for read queries - Disaster recovery backup **Configuration**: Same as observatory replica ## PostgreSQL Streaming Replication ### How It Works PostgreSQL uses Write-Ahead Logging (WAL) for replication: 1. **Main database writes to WAL** before applying changes 2. **WAL records sent to replicas** continuously or in batches 3. **Replicas replay WAL** to apply the same changes 4. **Replication lag** is the time between main write and replica apply ### WAL (Write-Ahead Log) The WAL contains: - All data modifications (INSERT, UPDATE, DELETE) - Transaction commits and rollbacks - Schema changes (DDL) - Checkpoint records **WAL segment size**: Typically 16MB per file **WAL retention**: Configured to retain enough for replicas to catch up **WAL format**: Binary, PostgreSQL-specific ### LSN (Log Sequence Number) Each WAL record has a unique LSN: **Format**: `file_offset/byte_offset` Example: `0/12345678` **Properties**: - Monotonically increasing - Unique per transaction - Same across main and replicas - Used to track replication progress **Querying LSN**: ```sql -- On main database: Current write position SELECT pg_current_wal_lsn(); -- Returns: 0/12345678 -- On replica: Last replayed position SELECT pg_last_wal_replay_lsn(); -- Returns: 0/12345600 (slightly behind) ``` **Comparing LSNs**: ```python def parse_lsn(lsn_string: str) -> int: """Convert LSN string to integer for comparison""" file_part, offset_part = lsn_string.split('/') return (int(file_part, 16) << 32) + int(offset_part, 16) main_lsn = parse_lsn("0/12345678") replica_lsn = parse_lsn("0/12345600") if replica_lsn >= main_lsn: print("Replica has caught up") else: bytes_behind = main_lsn - replica_lsn print(f"Replica is {bytes_behind} bytes behind") ``` ### Replication Slots **Purpose**: Ensure main database retains WAL until replica has consumed it **Creating a slot**: ```sql SELECT pg_create_physical_replication_slot('observatory_slot'); ``` **Monitoring slots**: ```sql SELECT slot_name, active, restart_lsn, pg_current_wal_lsn() - restart_lsn AS bytes_behind FROM pg_replication_slots; ``` **Danger**: Inactive slots can cause WAL accumulation and disk space issues ## Connection Configuration ### API Configuration The API uses separate connection strings for main and local databases: **At Main Site** (Cologne): ```bash SITE_TYPE=main MAIN_DB_HOST=localhost MAIN_DB_PORT=5432 LOCAL_DB_HOST=localhost # Same as main LOCAL_DB_PORT=5432 ``` **At Secondary Site** (Observatory): ```bash SITE_TYPE=secondary MAIN_DB_HOST=main-db.example.com # Remote MAIN_DB_PORT=5432 LOCAL_DB_HOST=localhost # Local replica LOCAL_DB_PORT=5432 ``` ## Replication Monitoring ### Checking Replication Status **On main database**: ```sql -- Active replication connections SELECT client_addr, state, sync_state, replay_lsn, pg_current_wal_lsn() - replay_lsn AS bytes_behind FROM pg_stat_replication; ``` **On replica**: ```sql -- Replication status SELECT pg_is_in_recovery(), -- Should be true pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(); ``` ### Replication Lag Metrics **Time-based lag**: ```sql -- On replica SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag FROM pg_stat_replication; ``` **Byte-based lag**: ```sql -- On main database SELECT client_addr, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS bytes_behind FROM pg_stat_replication; ``` ### LSN Tracking in API The API actively tracks replication using LSN: ```{literalinclude} ../../ccat_ops_db_api/transaction_buffering/lsn_tracker.py :emphasize-lines: 11-12, 27-30 :language: python :lines: 18-52 ``` ## Failover and Recovery ### Network Partition Scenarios **Scenario 1: Observatory loses connection to main** - Buffering continues (Redis local) - Reads from local replica (may be stale) - Background processor retries periodically - When reconnected: Buffer drains automatically **Scenario 2: Main database goes down** - At main site: API becomes unavailable - At secondary site: Buffering continues - Reads work from replica - When main restored: Buffer processes **Scenario 3: Replica falls very behind** - Reads become increasingly stale - Smart query manager still merges buffer - LSN tracking detects lag - Alert monitoring triggers - Manual intervention may be needed ## Database Migrations Schema changes must be coordinated: ### Migration Process 1. **Develop migration** in ops-db package 2. **Test on development database** 3. **Apply to main database** (Cologne) 4. **Replicas receive automatically** via WAL 5. **Update API** to use new schema 6. **Deploy API changes** to all sites ### Example Migration ```python # In ops-db package # alembic/versions/xxx_add_observation_status.py def upgrade(): op.add_column( 'executed_obs_unit', sa.Column('detailed_status', sa.String(50)) ) def downgrade(): op.drop_column('executed_obs_unit', 'detailed_status') ``` **Apply migration**: ```bash # On main database alembic upgrade head ``` **Replicas** receive the schema change automatically through WAL. ## Backup and Restore ### Backup Strategy **Main database**: - Continuous WAL archiving to S3/backup storage - Daily full backups with pg_dump - Point-in-time recovery capability - Retention: 30 days **Replicas**: - No backups needed (can rebuild from main) - Optional: Snapshot for faster rebuild ### Restore Procedure ```bash # Restore from backup pg_restore -d ccat_ops_db backup_file.dump # Or rebuild replica from main pg_basebackup -h main-db.example.com -D /var/lib/postgresql/data -U replication -v -P ``` ## Performance Considerations ### Replication Impact on Main **WAL generation**: Minimal overhead (< 5%) **Network bandwidth**: Depends on write volume - Light writes: < 1 MB/s - Heavy writes: 10-50 MB/s **Replication slots**: Can accumulate WAL if replica is down (monitor!) ### Replica Query Performance **Read-only**: Cannot create indexes not on main **Hot standby feedback**: Prevents query cancellation, may delay vacuum **Connection pooling**: Same as main database **Query routing**: API automatically queries local replica ## Summary The database topology provides: - **Single source of truth**: Main database in Cologne - **Distributed reads**: Replicas at each site - **Automatic synchronization**: PostgreSQL streaming replication - **Precise tracking**: LSN-based replication monitoring - **Network resilience**: Operations continue during outages Key characteristics: - All writes → main database - Reads from local replica (with buffer merge) - WAL-based replication (continuous or near-continuous) - LSN tracking for cache management - Monitoring for replication health ## Next Steps - {doc}`site-configuration` - How sites are configured - {doc}`../deep-dive/transaction-buffering/lsn-tracking` - LSN tracking implementation - {doc}`../development/debugging-buffering` - Troubleshooting replication issues