Monitoring & Failure Recovery ============================= .. verified:: 2025-11-06 :reviewer: Christof Buchbender The Data Transfer System implements comprehensive monitoring and multi-layered failure recovery to ensure reliable data management across distributed sites. Philosophy ---------- The system is designed to: * Detect failures quickly through multiple mechanisms * Log extensively for debugging * Recover automatically from transient failures * Alert humans only for issues requiring intervention * Never silently skip work or lose data Multiple overlapping systems ensure no failure goes unnoticed: * Celery's built-in retry logic (immediate failures) * Task heartbeat monitoring (stalled tasks) * Service health checks (dead services) * Database state tracking (operation results) * Metrics collection (performance monitoring) Health Monitoring ----------------- Health Check Service ~~~~~~~~~~~~~~~~~~~~ :py:class:`ccat_data_transfer.health_check.HealthCheck` Every manager and worker registers with a health check service: .. literalinclude:: ../../ccat_data_transfer/cli.py :language: python :start-after: def deletion_manager_(verbose): :end-before: @ccat_data_transfer_.command("raw-data-package-manager") :dedent: 4 **Purpose**: * Track which services are running * Detect service crashes * Enable external monitoring * Provide quick health status **Implementation**: Health state stored in Redis with a simple heartbeat mechanism: .. code-block:: text Key: health:{service_type}:{service_name} Value: "alive" TTL: 90 seconds The health check runs in a background thread that updates the key every 30 seconds. If a service stops updating, the key expires → service detected as dead. .. literalinclude:: ../../ccat_data_transfer/health_check.py :pyobject: HealthCheck._update_health_status :language: python Service Status Query ~~~~~~~~~~~~~~~~~~~~ Check service health programmatically: .. literalinclude:: ../../ccat_data_transfer/health_check.py :pyobject: HealthCheck.check_service_health :language: python To check all services of a type: .. literalinclude:: ../../ccat_data_transfer/health_check.py :pyobject: HealthCheck.check_services_health :language: python .. note:: Currently there is no CLI command to display service health status. This functionality must be accessed programmatically or through custom scripts. Task-Level Monitoring ~~~~~~~~~~~~~~~~~~~~~ Individual Celery tasks send heartbeats during execution to detect stalled tasks. :py:func:`ccat_data_transfer.setup_celery_app.make_celery_task` Tasks are created using the :py:func:`ccat_data_transfer.setup_celery_app.make_celery_task` factory function, which returns an enhanced base task class. Each task type inherits from this base class: .. code-block:: python # Example from transfer_manager.py class DataTransferTask(make_celery_task()): """Base class for data transfer tasks.""" def __init__(self): super().__init__() self.operation_type = "transfer" # Example from deletion_manager.py class DeletionTask(make_celery_task()): """Base class for deletion tasks.""" def __init__(self): super().__init__() self.operation_type = "delete" # Example from archive_manager.py class LongTermArchiveTask(make_celery_task()): """Base class for long term archive tasks.""" def __init__(self): super().__init__() self.operation_type = "long_term_archive" The base class returned by :py:func:`ccat_data_transfer.setup_celery_app.make_celery_task()` is ``CCATEnhancedSQLAlchemyTask``, which implements heartbeat tracking and error handling: .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :pyobject: make_celery_task :language: python The ``__call__`` method of this base class implements the heartbeat mechanism: .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :language: python :start-after: def __call__(self, *args, **kwargs): :end-before: # Set up periodic heartbeat with proper cleanup :dedent: 8 **Heartbeat Worker**: The heartbeat runs in a separate thread and updates Redis every 60 seconds (checking every 10 seconds for stop signals): .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :language: python :start-after: def heartbeat_worker(): :end-before: try: :dedent: 12 **Storage**: :py:class:`ccat_data_transfer.task_state_manager.TaskStateManager` .. literalinclude:: ../../ccat_data_transfer/task_state_manager.py :pyobject: TaskStateManager.register_task :language: python If a task stops updating, the recovery service detects it as stalled based on the configured heartbeat timeout. Disk Usage Monitoring ~~~~~~~~~~~~~~~~~~~~~ :py:mod:`ccat_data_transfer.disk_monitor` Continuously monitors disk usage at all ``DiskDataLocation`` instances: .. literalinclude:: ../../ccat_data_transfer/disk_monitor.py :pyobject: monitor_disk_location :language: python **Scheduling**: Disk monitoring runs as Celery Beat periodic tasks. The system uses both location-based and legacy site-based monitoring: .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :language: python :start-after: app.conf.beat_schedule = { :end-before: def start_celery_beat(): **Thresholds**: Configurable thresholds from :py:mod:`ccat_data_transfer.config.config`: * **< 70%**: Normal operation (``BUFFER_WARNING_THRESHOLD_PERCENT``) * **70-85%**: Warning - monitor closely * **85-95%**: Critical - aggressive deletion (``BUFFER_CRITICAL_THRESHOLD_PERCENT``) * **> 95%**: Emergency - immediate action required (``BUFFER_EMERGENCY_THRESHOLD_PERCENT``) Default configuration: .. code-block:: toml BUFFER_WARNING_THRESHOLD_PERCENT = 70 BUFFER_CRITICAL_THRESHOLD_PERCENT = 85 BUFFER_EMERGENCY_THRESHOLD_PERCENT = 95 BUFFER_RECOVERY_THRESHOLD_PERCENT = 60 Metrics Collection ~~~~~~~~~~~~~~~~~~ :py:mod:`ccat_data_transfer.metrics` The system sends operational metrics to InfluxDB for analysis and monitoring. **Metrics Class**: .. literalinclude:: ../../ccat_data_transfer/metrics.py :pyobject: HousekeepingMetrics :language: python **Available Metrics**: Transfer Metrics: .. literalinclude:: ../../ccat_data_transfer/metrics.py :pyobject: HousekeepingMetrics.send_transfer_metrics :language: python Function Metrics: .. literalinclude:: ../../ccat_data_transfer/metrics.py :pyobject: HousekeepingMetrics.send_function_metrics :language: python **Configuration**: InfluxDB connection settings from config: .. code-block:: toml INFLUXDB_URL = "http://localhost:8086" INFLUXDB_TOKEN = "myadmintoken123" INFLUXDB_ORG = "myorg" INFLUXDB_BUCKET = "mybucket" .. note:: Metrics collection is implemented but not systematically integrated throughout all pipeline stages. Individual operations can send metrics using the ``HousekeepingMetrics`` class, but this must be done explicitly in each component. Grafana Dashboards ~~~~~~~~~~~~~~~~~~ Metrics can be visualized in Grafana by connecting to the InfluxDB instance. Dashboard configuration is deployment-specific and not included in the data-transfer package. See :doc:`/docs/source/operations/monitoring` for information on the broader monitoring infrastructure. Failure Detection & Recovery System ------------------------------------ The CCAT Data Transfer system implements a robust two-tier failure recovery mechanism to handle various types of failures that can occur during data transfer operations. **Recovery Architecture Overview**: The system consists of two complementary recovery mechanisms: 1. **Immediate Task Recovery** (Celery-based) - Handles expected failures with automatic retries 2. **Stalled Task Recovery** (Monitor-based) - Handles unexpected interruptions and deadlocks This dual approach ensures that both expected failures (handled by Celery) and unexpected interruptions (handled by the monitor) can be properly managed and recovered from. Immediate Failure Detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Celery detects failures when tasks raise exceptions. Tasks automatically retry based on exception type and task configuration: .. code-block:: python @app.task( autoretry_for=(NetworkError, TemporaryError), retry_kwargs={'max_retries': 5, 'countdown': 60}, retry_backoff=True, # Exponential backoff ) def retriable_operation(): # Operation implementation pass **Automatic Retry**: Tasks can implement custom retry logic: .. code-block:: python @app.task def smart_retry_task(): try: result = do_work() except NetworkError as e: # Network errors: retry quickly raise self.retry(exc=e, countdown=30) except ChecksumError as e: # Data corruption: longer delay, fewer retries raise self.retry(exc=e, countdown=300, max_retries=2) except PermissionError as e: # Permission errors: don't retry mark_permanent_failure() raise Stalled Task Detection & Recovery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :py:mod:`ccat_data_transfer.recovery_service_runner` The stalled task recovery system operates independently of Celery and handles cases where tasks are interrupted unexpectedly. **Components**: 1. **Task Monitor Service** - :py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService` - Monitors task heartbeats - Detects stalled tasks - Initiates recovery procedures - Implements circuit breaker pattern 2. **Recovery Service Runner** - :py:func:`ccat_data_transfer.recovery_service_runner.run_task_recovery_service` - Runs as a standalone process - Manages the monitoring loop - Handles service lifecycle - Sends status notifications **Recovery Process**: .. literalinclude:: ../../ccat_data_transfer/recovery_service_runner.py :pyobject: run_task_recovery_service :language: python **Task State Tracking**: 1. Tasks send periodic heartbeats to Redis (every 60 seconds) 2. Monitor service checks for stale heartbeats (every 60 seconds by default) 3. Stalled tasks are identified based on configured timeout (default 300 seconds) **Configuration**: Recovery service configuration from :py:class:`ccat_data_transfer.config.config.TaskRecoverySettings`: .. literalinclude:: ../../ccat_data_transfer/config/config.py :pyobject: TaskRecoverySettings :language: python **Recovery Actions**: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService._recover_stalled_task :language: python Circuit Breaker Pattern ~~~~~~~~~~~~~~~~~~~~~~~~ The recovery system implements a circuit breaker to prevent infinite retry loops: **Circuit Breaker Logic**: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService._check_circuit_breaker :language: python **Circuit Breaker States**: 1. **Closed** (Normal): Tasks retry normally when they stall 2. **Open** (Tripped): After ``max_stall_count`` consecutive stalls (default: 3), the breaker opens and blocks retries 3. **Automatic Reset**: After ``circuit_breaker_timeout`` (default: 3600 seconds), the breaker automatically closes and allows retries again **Manual Reset**: Administrators can manually reset the circuit breaker: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService.reset_circuit_breaker :language: python **Force Retry**: To force a retry of a stalled task: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService.force_retry_stalled_task :language: python Service Death Detection ~~~~~~~~~~~~~~~~~~~~~~~ If a manager service dies, health check detects it: 1. Service stops sending heartbeats 2. Redis key expires after TTL (90 seconds) 3. Monitoring system detects missing service 4. Alerts can be configured through external monitoring tools **But operations continue**: * Already-submitted tasks still execute * Workers continue processing queues * No data loss (database state preserved) * Restart manager to resume scheduling new work Silent Failure Prevention ~~~~~~~~~~~~~~~~~~~~~~~~~ Several mechanisms prevent silent failures: **Database Constraints**: * Foreign keys ensure referential integrity * Check constraints validate state transitions * Unique constraints prevent duplicates **Transaction Safety**: .. code-block:: python @app.task def operation_with_safety(op_id): with session_scope() as session: try: # Do work perform_operation(session, op_id) # Update state operation = session.query(Operation).get(op_id) operation.status = Status.COMPLETED session.commit() except Exception as e: session.rollback() raise If task fails, transaction rolls back → database unchanged → operation retried. **State Verification**: Before proceeding, workers verify current state: .. code-block:: python def transfer_file(transfer_id): transfer = session.query(DataTransfer).get(transfer_id) # Verify preconditions if transfer.status != Status.PENDING: logger.info("Transfer already processed", transfer_id=transfer_id) return # Idempotent if not os.path.exists(transfer.source_path): raise FileNotFoundError("Source file missing") # Proceed with transfer # ... This prevents: * Processing same operation twice * Operating on stale data * Cascading failures from bad state Recovery Mechanisms ------------------- Layer 1: Celery Retry Logic ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Automatic Retries**: Tasks automatically retry for transient errors through the base class returned by :py:func:`ccat_data_transfer.setup_celery_app.make_celery_task`. Each task class that inherits from this base implements custom retry logic through the ``should_retry`` method: .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :language: python :start-after: def should_retry(self, exc, operation_id, retry_count): :end-before: def on_failure(self, exc, task_id, args, kwargs, einfo): :dedent: 12 The base class provides default retry behavior that subclasses can customize based on their specific operation requirements. **Custom Retry Logic**: Each operation type can customize retry behavior by overriding the ``should_retry`` method. The base implementation checks for non-retryable exceptions and respects retry count limits. **Error Classification**: The system defines a hierarchy of custom exceptions in :py:mod:`ccat_data_transfer.exceptions`: .. literalinclude:: ../../ccat_data_transfer/exceptions.py :pyobject: CCATDataOperationError :language: python Each error type specifies: - Whether it is retryable - Maximum number of retries allowed - Operation-specific context **Recovery Implementation**: Each pipeline component implements custom recovery logic through two key methods defined in the base class returned by ``make_celery_task()``: - ``reset_state_on_failure``: Handles retryable errors by resetting operation state for retry - ``mark_permanent_failure``: Handles non-retryable errors that require intervention Example from ``DataTransferTask``: .. literalinclude:: ../../ccat_data_transfer/transfer_manager.py :pyobject: DataTransferTask.reset_state_on_failure :language: python .. literalinclude:: ../../ccat_data_transfer/transfer_manager.py :pyobject: DataTransferTask.mark_permanent_failure :language: python These methods are called automatically by the base task class when errors occur: .. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py :language: python :start-after: def on_failure(self, exc, task_id, args, kwargs, einfo): :end-before: def on_success(self, retval, task_id, args, kwargs): Layer 2: Task State Reset ~~~~~~~~~~~~~~~~~~~~~~~~~~ Recovery service handles tasks that stall through operation-specific handlers: **Operation-Specific Recovery**: Each operation type has a recovery handler in :py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService`: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :language: python :start-after: # Define recovery methods for each operation type :end-before: def _check_circuit_breaker :dedent: 8 **Example Package Recovery Handler**: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService._recover_package :language: python Layer 3: Manager Re-scanning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Managers continuously scan for work: * Even if task submission failed, next manager cycle finds it * Database is source of truth for what needs doing * Missing work eventually discovered and scheduled **Example**: The :py:func:`ccat_data_transfer.raw_data_package_manager.create_raw_data_packages` function continuously scans for unpackaged files and creates packages: .. literalinclude:: ../../ccat_data_transfer/raw_data_package_manager.py :pyobject: create_raw_data_packages :language: python Even if previous submission got lost, next cycle will find and resubmit. Layer 4: Human Intervention ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When automatic recovery fails, the system alerts operators through notifications. Operations requiring intervention are visible in ops-db-ui with clear indicators, and the notification system sends alerts to configured recipients. Notification System ------------------- :py:mod:`ccat_data_transfer.notification_service` The notification system provides alerts for critical system events. Notification Service ~~~~~~~~~~~~~~~~~~~~ The notification service is implemented in :py:class:`ccat_data_transfer.notification_service.NotificationService` and handles email notifications for critical system events. Notification Channels ~~~~~~~~~~~~~~~~~~~~~ **Redis Queue-Based System**: :py:class:`ccat_data_transfer.notification_service.NotificationService` The notification service processes messages from Redis queues: .. literalinclude:: ../../ccat_data_transfer/notification_service.py :pyobject: NotificationService.start :language: python **Email Notifications**: Email is the primary notification channel: .. literalinclude:: ../../ccat_data_transfer/notification_service.py :pyobject: NotificationService._send_email :language: python **Configuration**: .. code-block:: toml [default.SMTP_CONFIG] SERVER = "smtp.uni-koeln.de" PORT = 25 USE_TLS = false USER = false FROM_ADDRESS = "ccat-data-transfer@uni-koeln.de" RECIPIENTS = ["ccat-data-transfer@uni-koeln.de"] **Sending Notifications**: :py:class:`ccat_data_transfer.notification_service.NotificationClient` Components send notifications by pushing to the Redis queue: .. literalinclude:: ../../ccat_data_transfer/notification_service.py :pyobject: NotificationClient.send_notification :language: python .. note:: Additional notification channels (Slack, Discord, database logging) are not currently implemented. The system only supports email via SMTP and Redis pub/sub for real-time updates to ops-db-ui. Cooldown Management ~~~~~~~~~~~~~~~~~~~ Prevent notification spam with automatic cooldown: **Implementation**: The :py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService` tracks recent notifications and applies a cooldown period (default: 1 hour) to prevent duplicate alerts: .. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py :pyobject: TaskMonitorService._should_send_notification :language: python **Retry Logic**: Failed email deliveries are automatically retried with exponential backoff: .. literalinclude:: ../../ccat_data_transfer/notification_service.py :pyobject: NotificationService._process_message :language: python Configuration ------------- The monitoring and recovery system can be configured through various settings: **Health Check Settings**: In :py:class:`ccat_data_transfer.health_check.HealthCheck`: - ``update_interval``: How often to update health status (default: 30 seconds) - ``ttl``: How long health status remains valid (default: 90 seconds) **Task Recovery Settings**: In :py:class:`ccat_data_transfer.config.config.TaskRecoverySettings`: - ``heartbeat_timeout``: Time before task considered stalled (default: 300 seconds) - ``max_stall_count``: Maximum stalls before circuit breaker opens (default: 3) - ``circuit_breaker_timeout``: Time before circuit breaker resets (default: 3600 seconds) - ``LOOP_INTERVAL``: How often to check for stalled tasks (default: 60 seconds) **Notification Settings**: - ``notification_cooldown``: Time between duplicate notifications (default: 3600 seconds) - ``max_retries``: Maximum email retry attempts (default: 5) - ``retry_delay``: Base delay for exponential backoff (default: 60 seconds) **Disk Monitoring Settings**: - ``BUFFER_WARNING_THRESHOLD_PERCENT``: Warning level (default: 70) - ``BUFFER_CRITICAL_THRESHOLD_PERCENT``: Critical level (default: 85) - ``BUFFER_EMERGENCY_THRESHOLD_PERCENT``: Emergency level (default: 95) - ``BUFFER_RECOVERY_THRESHOLD_PERCENT``: Recovery target (default: 60) Observability Best Practices ----------------------------- Structured Logging ~~~~~~~~~~~~~~~~~~ :py:func:`ccat_data_transfer.logging_utils.get_structured_logger` All logging uses structured format: .. literalinclude:: ../../ccat_data_transfer/logging_utils.py :pyobject: get_structured_logger :language: python Output (JSON): .. code-block:: json { "timestamp": "2024-11-27T10:30:00.123Z", "level": "INFO", "logger": "ccat_data_transfer.transfer_manager", "message": "Transfer completed", "transfer_id": 456, "source_site": "ccat", "dest_site": "cologne", "duration_seconds": 120.5, "throughput_mbps": 450.2, "file_size_bytes": 54321098765 } **Benefits**: * Machine-parseable for log aggregation * Easy to query specific fields * Consistent format across all services * Rich context for debugging Correlation IDs ~~~~~~~~~~~~~~~ Track operations across services: .. code-block:: python # Manager creates operation operation = DataTransfer(...) session.add(operation) session.commit() logger.info( "Submitting transfer task", operation_id=operation.id, # Correlation ID source=source.name, destination=dest.name, ) # Worker logs with same ID logger.info( "Executing transfer", operation_id=operation.id, # Same ID task_id=self.request.id, ) # Later stages reference same ID logger.info( "Unpacking transfer", operation_id=operation.id, # Traceable! ) Query logs by operation_id to see complete lifecycle. Error Context ~~~~~~~~~~~~~ Include rich context in error logs: .. code-block:: python try: transfer_file(source, dest) except Exception as e: logger.error( "Transfer failed", exc_info=e, # Full traceback transfer_id=transfer.id, source_path=source_path, dest_path=dest_path, retry_count=retry_count, file_size=file_size, network_conditions={ "latency_ms": latency, "packet_loss": packet_loss, }, ) Makes debugging vastly easier. Troubleshooting Guide --------------------- Common Issues and Solutions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Issue**: Task stuck in "IN_PROGRESS" forever **Diagnosis**: 1. Check if worker still running: ``celery inspect active`` 2. Check heartbeat: ``redis-cli GET task:{task_id}`` 3. Check worker logs for errors 4. Verify recovery service is running **Solution**: * Recovery service should detect and reset automatically * Check circuit breaker state: ``redis-cli HGETALL circuit_breaker:{operation_type}:{operation_id}`` * If circuit breaker is open, wait for timeout or manually reset * If not, manually reset: ``UPDATE operation SET status='PENDING' WHERE id=X`` * Restart worker if crashed **Issue**: Transfer failing with network errors **Diagnosis**: 1. Test network: ``ping destination_host`` 2. Test BBCP: ``bbcp source dest`` manually 3. Check firewall rules 4. Examine transfer logs for error details 5. Check retry count in database **Solution**: * Transient: Automatic retry will handle * Persistent: Check network configuration, firewalls * If circuit breaker is open: investigate underlying issue before resetting * Workaround: Use alternative route if available **Issue**: Disk usage alert but deletion not working **Diagnosis**: 1. Check deletion manager running: verify health check key exists 2. Check packages eligible for deletion: SQL query 3. Check deletion manager logs 4. Verify retention policies 5. Check disk thresholds in configuration **Solution**: * Ensure packages are ARCHIVED before deletion * Check retention periods aren't too long * Verify threshold configuration matches expectations * Manually trigger deletion if needed * May need to adjust thresholds **Issue**: Circuit breaker preventing recovery **Diagnosis**: 1. Check circuit breaker state: ``redis-cli HGETALL circuit_breaker:{operation_type}:{operation_id}`` 2. Review stall count and last stall time 3. Examine logs for underlying failure cause 4. Verify operation configuration **Solution**: * Wait for automatic reset after timeout period * Fix underlying issue (network, permissions, configuration) * Manually reset circuit breaker if issue is resolved * Use ``force_retry_stalled_task`` for immediate retry **Issue**: Notification service not sending emails **Diagnosis**: 1. Check notification service is running 2. Verify SMTP configuration 3. Check notification queue: ``redis-cli LLEN ccat:notifications:queue`` 4. Check retry queue: ``redis-cli LLEN ccat:notifications:retry:queue`` 5. Examine notification service logs **Solution**: * Verify SMTP server accessibility * Check FROM_ADDRESS configuration * Ensure RECIPIENTS list is valid * Restart notification service if needed * Check for messages in retry queue Best Practices -------------- Error Classification ~~~~~~~~~~~~~~~~~~~~ - Use appropriate error types for different failure scenarios - Set correct retryability based on whether the error is transient - Include relevant context in error messages for debugging Recovery Implementation ~~~~~~~~~~~~~~~~~~~~~~~ - Implement both recovery methods (``reset_state_on_failure`` and ``mark_permanent_failure``) - Handle database state properly with transaction safety - Log recovery actions with structured logging - Ensure idempotent operations to prevent double-processing Monitoring ~~~~~~~~~~ - Monitor recovery success rates through metrics - Track retry counts to identify problematic operations - Review notification patterns to detect systemic issues - Set up alerts for high failure rates - Monitor circuit breaker state for frequently failing operations Maintenance ~~~~~~~~~~~ - Regular review of error patterns to identify common issues - Update recovery strategies based on observed failure modes - Adjust timeouts as needed based on operational experience - Keep configuration in sync with actual system behavior - Periodically review and clean up old circuit breaker states Next Steps ---------- * :doc:`lifecycle` - Detailed deletion policies and retention management * :doc:`philosophy` - Why monitoring is designed this way * :doc:`pipeline` - See where monitoring integrates with pipeline stages * :doc:`/operations/monitoring` - Broader monitoring infrastructure (Grafana, Loki, Promtail)