# Database Topology
```{eval-rst}
.. verified:: 2025-11-12
:reviewer: Christof Buchbender
```
The ops-db-api uses a distributed PostgreSQL architecture with a single authoritative database and read-only replicas at remote sites.
```{contents} Table of Contents
:depth: 2
:local: true
```
## Overview
The database topology consists of:
- **Main Database**: Cologne, Germany (write target)
- **Replica Database(s)**: Observatory in Chile, potentially other sites (read-only)
- **Streaming Replication**: PostgreSQL WAL-based continuous sync
```{eval-rst}
.. mermaid::
graph LR
Main[(Main Database
Cologne
Read + Write)]
Replica1[(Replica
Observatory
Read Only)]
Replica2[(Replica
Institute
Read Only)]
Main -.->|WAL Stream| Replica1
Main -.->|WAL Stream| Replica2
API_Main[API @ Main] -->|Writes| Main
API_Main -->|Reads| Main
API_Obs[API @ Observatory] -->|Buffered Writes| Main
API_Obs -->|Reads| Replica1
style Main fill:#90EE90
style Replica1 fill:#FFB6C1
style Replica2 fill:#FFB6C1
```
## Main Database (Cologne)
### Characteristics
**Role**: Single authoritative source of truth
**Location**: Cologne, Germany (data center)
**Access**:
- Direct write access (low latency)
- Direct read access (low latency)
- Public endpoint for remote API instances
**Configuration**:
```ini
# postgresql.conf
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
synchronous_commit = on
```
**Replication Settings**:
```ini
# Enable WAL archiving
archive_mode = on
archive_command = 'cp %p /archive/%f'
# Replication slots (one per replica)
max_replication_slots = 10
```
### Schema Management
Database schema is managed by the `ops-db` package:
```bash
# Install ops-db package
pip install git+https://github.com/ccatobs/ops-db.git
# Models are imported from ccat_ops_db
from ccat_ops_db import models
```
**Key tables**:
- `executed_obs_unit` - Observatory operation records
- `raw_data_package` - Data file groupings
- `raw_data_file` - Individual file metadata
- `observing_program` - Planned observations
- `source` - Astronomical sources
- `user` - System users
- `api_token` - Authentication tokens
### Write Operations
All writes go through the main database:
```python
# At main site: Direct write
async with main_db_session() as session:
obs = models.ExecutedObsUnit(**data)
session.add(obs)
await session.commit()
# At secondary site: Buffered write
transaction = builder.create(
model_class=models.ExecutedObsUnit,
data=data
)
await transaction_manager.buffer_transaction(transaction)
# Background processor later writes to main database
```
## Replica Databases
### Observatory Replica (Chile)
**Role**: Local read-only copy for observatory operations
**Location**: CCAT observatory, Chile (5600m altitude)
**Purpose**:
- Fast local reads (no network latency)
- Continue operations during network outages
- Merge with buffered data for consistent views
**Connection from API**:
```python
# Site configuration for secondary site
LOCAL_DB_HOST=localhost # Points to local replica
LOCAL_DB_PORT=5432
```
**Replication Lag**:
Typical lag depends on network conditions:
- Good network: 1-10 seconds
- Slow network: 10-60 seconds
- Network down: Accumulates until reconnection
### Institute Replicas (Optional)
Additional replicas can be deployed at partner institutions:
**Purpose**:
- Distributed data access for scientists
- Load balancing for read queries
- Disaster recovery backup
**Configuration**: Same as observatory replica
## PostgreSQL Streaming Replication
### How It Works
PostgreSQL uses Write-Ahead Logging (WAL) for replication:
1. **Main database writes to WAL** before applying changes
2. **WAL records sent to replicas** continuously or in batches
3. **Replicas replay WAL** to apply the same changes
4. **Replication lag** is the time between main write and replica apply
### WAL (Write-Ahead Log)
The WAL contains:
- All data modifications (INSERT, UPDATE, DELETE)
- Transaction commits and rollbacks
- Schema changes (DDL)
- Checkpoint records
**WAL segment size**: Typically 16MB per file
**WAL retention**: Configured to retain enough for replicas to catch up
**WAL format**: Binary, PostgreSQL-specific
### LSN (Log Sequence Number)
Each WAL record has a unique LSN:
**Format**: `file_offset/byte_offset`
Example: `0/12345678`
**Properties**:
- Monotonically increasing
- Unique per transaction
- Same across main and replicas
- Used to track replication progress
**Querying LSN**:
```sql
-- On main database: Current write position
SELECT pg_current_wal_lsn();
-- Returns: 0/12345678
-- On replica: Last replayed position
SELECT pg_last_wal_replay_lsn();
-- Returns: 0/12345600 (slightly behind)
```
**Comparing LSNs**:
```python
def parse_lsn(lsn_string: str) -> int:
"""Convert LSN string to integer for comparison"""
file_part, offset_part = lsn_string.split('/')
return (int(file_part, 16) << 32) + int(offset_part, 16)
main_lsn = parse_lsn("0/12345678")
replica_lsn = parse_lsn("0/12345600")
if replica_lsn >= main_lsn:
print("Replica has caught up")
else:
bytes_behind = main_lsn - replica_lsn
print(f"Replica is {bytes_behind} bytes behind")
```
### Replication Slots
**Purpose**: Ensure main database retains WAL until replica has consumed it
**Creating a slot**:
```sql
SELECT pg_create_physical_replication_slot('observatory_slot');
```
**Monitoring slots**:
```sql
SELECT
slot_name,
active,
restart_lsn,
pg_current_wal_lsn() - restart_lsn AS bytes_behind
FROM pg_replication_slots;
```
**Danger**: Inactive slots can cause WAL accumulation and disk space issues
## Connection Configuration
### API Configuration
The API uses separate connection strings for main and local databases:
**At Main Site** (Cologne):
```bash
SITE_TYPE=main
MAIN_DB_HOST=localhost
MAIN_DB_PORT=5432
LOCAL_DB_HOST=localhost # Same as main
LOCAL_DB_PORT=5432
```
**At Secondary Site** (Observatory):
```bash
SITE_TYPE=secondary
MAIN_DB_HOST=main-db.example.com # Remote
MAIN_DB_PORT=5432
LOCAL_DB_HOST=localhost # Local replica
LOCAL_DB_PORT=5432
```
## Replication Monitoring
### Checking Replication Status
**On main database**:
```sql
-- Active replication connections
SELECT
client_addr,
state,
sync_state,
replay_lsn,
pg_current_wal_lsn() - replay_lsn AS bytes_behind
FROM pg_stat_replication;
```
**On replica**:
```sql
-- Replication status
SELECT
pg_is_in_recovery(), -- Should be true
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(),
pg_last_xact_replay_timestamp();
```
### Replication Lag Metrics
**Time-based lag**:
```sql
-- On replica
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag
FROM pg_stat_replication;
```
**Byte-based lag**:
```sql
-- On main database
SELECT
client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS bytes_behind
FROM pg_stat_replication;
```
### LSN Tracking in API
The API actively tracks replication using LSN:
```{literalinclude} ../../ccat_ops_db_api/transaction_buffering/lsn_tracker.py
:emphasize-lines: 11-12, 27-30
:language: python
:lines: 18-52
```
## Failover and Recovery
### Network Partition Scenarios
**Scenario 1: Observatory loses connection to main**
- Buffering continues (Redis local)
- Reads from local replica (may be stale)
- Background processor retries periodically
- When reconnected: Buffer drains automatically
**Scenario 2: Main database goes down**
- At main site: API becomes unavailable
- At secondary site: Buffering continues
- Reads work from replica
- When main restored: Buffer processes
**Scenario 3: Replica falls very behind**
- Reads become increasingly stale
- Smart query manager still merges buffer
- LSN tracking detects lag
- Alert monitoring triggers
- Manual intervention may be needed
## Database Migrations
Schema changes must be coordinated:
### Migration Process
1. **Develop migration** in ops-db package
2. **Test on development database**
3. **Apply to main database** (Cologne)
4. **Replicas receive automatically** via WAL
5. **Update API** to use new schema
6. **Deploy API changes** to all sites
### Example Migration
```python
# In ops-db package
# alembic/versions/xxx_add_observation_status.py
def upgrade():
op.add_column(
'executed_obs_unit',
sa.Column('detailed_status', sa.String(50))
)
def downgrade():
op.drop_column('executed_obs_unit', 'detailed_status')
```
**Apply migration**:
```bash
# On main database
alembic upgrade head
```
**Replicas** receive the schema change automatically through WAL.
## Backup and Restore
### Backup Strategy
**Main database**:
- Continuous WAL archiving to S3/backup storage
- Daily full backups with pg_dump
- Point-in-time recovery capability
- Retention: 30 days
**Replicas**:
- No backups needed (can rebuild from main)
- Optional: Snapshot for faster rebuild
### Restore Procedure
```bash
# Restore from backup
pg_restore -d ccat_ops_db backup_file.dump
# Or rebuild replica from main
pg_basebackup -h main-db.example.com -D /var/lib/postgresql/data -U replication -v -P
```
## Performance Considerations
### Replication Impact on Main
**WAL generation**: Minimal overhead (< 5%)
**Network bandwidth**: Depends on write volume
- Light writes: < 1 MB/s
- Heavy writes: 10-50 MB/s
**Replication slots**: Can accumulate WAL if replica is down (monitor!)
### Replica Query Performance
**Read-only**: Cannot create indexes not on main
**Hot standby feedback**: Prevents query cancellation, may delay vacuum
**Connection pooling**: Same as main database
**Query routing**: API automatically queries local replica
## Summary
The database topology provides:
- **Single source of truth**: Main database in Cologne
- **Distributed reads**: Replicas at each site
- **Automatic synchronization**: PostgreSQL streaming replication
- **Precise tracking**: LSN-based replication monitoring
- **Network resilience**: Operations continue during outages
Key characteristics:
- All writes → main database
- Reads from local replica (with buffer merge)
- WAL-based replication (continuous or near-continuous)
- LSN tracking for cache management
- Monitoring for replication health
## Next Steps
- {doc}`site-configuration` - How sites are configured
- {doc}`../deep-dive/transaction-buffering/lsn-tracking` - LSN tracking implementation
- {doc}`../development/debugging-buffering` - Troubleshooting replication issues