Monitoring & Failure Recovery
=============================

.. verified:: 2025-11-06
   :reviewer: Christof Buchbender

The Data Transfer System implements comprehensive monitoring and multi-layered failure recovery to ensure reliable data management across distributed sites.

Philosophy
----------

The system is designed to:

* Detect failures quickly through multiple mechanisms
* Log extensively for debugging
* Recover automatically from transient failures
* Alert humans only for issues requiring intervention
* Never silently skip work or lose data

Multiple overlapping systems ensure no failure goes unnoticed:

* Celery's built-in retry logic (immediate failures)
* Task heartbeat monitoring (stalled tasks)
* Service health checks (dead services)
* Database state tracking (operation results)
* Metrics collection (performance monitoring)

Health Monitoring
-----------------

Health Check Service
~~~~~~~~~~~~~~~~~~~~

:py:class:`ccat_data_transfer.health_check.HealthCheck`

Every manager and worker registers with a health check service:

.. literalinclude:: ../../ccat_data_transfer/cli.py
   :language: python
   :start-after: def deletion_manager_(verbose):
   :end-before: @ccat_data_transfer_.command("raw-data-package-manager")
   :dedent: 4

**Purpose**:

* Track which services are running
* Detect service crashes
* Enable external monitoring
* Provide quick health status

**Implementation**:

Health state stored in Redis with a simple heartbeat mechanism:

.. code-block:: text

   Key: health:{service_type}:{service_name}
   Value: "alive"
   TTL: 90 seconds

The health check runs in a background thread that updates the key every 30 seconds. If a service stops updating, the key expires → service detected as dead.

.. literalinclude:: ../../ccat_data_transfer/health_check.py
   :pyobject: HealthCheck._update_health_status
   :language: python

Service Status Query
~~~~~~~~~~~~~~~~~~~~

Check service health programmatically:

.. literalinclude:: ../../ccat_data_transfer/health_check.py
   :pyobject: HealthCheck.check_service_health
   :language: python

To check all services of a type:

.. literalinclude:: ../../ccat_data_transfer/health_check.py
   :pyobject: HealthCheck.check_services_health
   :language: python

.. note::
   Currently there is no CLI command to display service health status. This functionality must be accessed programmatically or through custom scripts.

Task-Level Monitoring
~~~~~~~~~~~~~~~~~~~~~

Individual Celery tasks send heartbeats during execution to detect stalled tasks.

:py:func:`ccat_data_transfer.setup_celery_app.make_celery_task`

Tasks are created using the :py:func:`ccat_data_transfer.setup_celery_app.make_celery_task` factory function, which returns an
enhanced base task class. Each task type inherits from this base class:

.. code-block:: python

   # Example from transfer_manager.py
   class DataTransferTask(make_celery_task()):
       """Base class for data transfer tasks."""
       
       def __init__(self):
           super().__init__()
           self.operation_type = "transfer"
   
   # Example from deletion_manager.py
   class DeletionTask(make_celery_task()):
       """Base class for deletion tasks."""
       
       def __init__(self):
           super().__init__()
           self.operation_type = "delete"
   
   # Example from archive_manager.py
   class LongTermArchiveTask(make_celery_task()):
       """Base class for long term archive tasks."""
       
       def __init__(self):
           super().__init__()
           self.operation_type = "long_term_archive"

The base class returned by
:py:func:`ccat_data_transfer.setup_celery_app.make_celery_task()` is
``CCATEnhancedSQLAlchemyTask``, which implements heartbeat tracking and error handling:

.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :pyobject: make_celery_task
   :language: python

The ``__call__`` method of this base class implements the heartbeat mechanism:

.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :language: python
   :start-after: def __call__(self, *args, **kwargs):
   :end-before: # Set up periodic heartbeat with proper cleanup
   :dedent: 8

**Heartbeat Worker**:

The heartbeat runs in a separate thread and updates Redis every 60 seconds (checking
every 10 seconds for stop signals):


.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :language: python
   :start-after: def heartbeat_worker():
   :end-before: try:
   :dedent: 12


**Storage**:

:py:class:`ccat_data_transfer.task_state_manager.TaskStateManager`

.. literalinclude:: ../../ccat_data_transfer/task_state_manager.py
   :pyobject: TaskStateManager.register_task
   :language: python

If a task stops updating, the recovery service detects it as stalled based on the
configured heartbeat timeout.

Disk Usage Monitoring
~~~~~~~~~~~~~~~~~~~~~

:py:mod:`ccat_data_transfer.disk_monitor`

Continuously monitors disk usage at all ``DiskDataLocation`` instances:

.. literalinclude:: ../../ccat_data_transfer/disk_monitor.py
   :pyobject: monitor_disk_location
   :language: python

**Scheduling**:

Disk monitoring runs as Celery Beat periodic tasks. The system uses both location-based
and legacy site-based monitoring:

.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :language: python
   :start-after: app.conf.beat_schedule = {
   :end-before: def start_celery_beat():

**Thresholds**:

Configurable thresholds from :py:mod:`ccat_data_transfer.config.config`:

* **< 70%**: Normal operation (``BUFFER_WARNING_THRESHOLD_PERCENT``)
* **70-85%**: Warning - monitor closely
* **85-95%**: Critical - aggressive deletion (``BUFFER_CRITICAL_THRESHOLD_PERCENT``)
* **> 95%**: Emergency - immediate action required (``BUFFER_EMERGENCY_THRESHOLD_PERCENT``)

Default configuration:

.. code-block:: toml

   BUFFER_WARNING_THRESHOLD_PERCENT = 70
   BUFFER_CRITICAL_THRESHOLD_PERCENT = 85
   BUFFER_EMERGENCY_THRESHOLD_PERCENT = 95
   BUFFER_RECOVERY_THRESHOLD_PERCENT = 60

Metrics Collection
~~~~~~~~~~~~~~~~~~

:py:mod:`ccat_data_transfer.metrics`

The system sends operational metrics to InfluxDB for analysis and monitoring.

**Metrics Class**:

.. literalinclude:: ../../ccat_data_transfer/metrics.py
   :pyobject: HousekeepingMetrics
   :language: python

**Available Metrics**:

Transfer Metrics:

.. literalinclude:: ../../ccat_data_transfer/metrics.py
   :pyobject: HousekeepingMetrics.send_transfer_metrics
   :language: python

Function Metrics:

.. literalinclude:: ../../ccat_data_transfer/metrics.py
   :pyobject: HousekeepingMetrics.send_function_metrics
   :language: python

**Configuration**:

InfluxDB connection settings from config:

.. code-block:: toml

   INFLUXDB_URL = "http://localhost:8086"
   INFLUXDB_TOKEN = "myadmintoken123"
   INFLUXDB_ORG = "myorg"
   INFLUXDB_BUCKET = "mybucket"

.. note::
   Metrics collection is implemented but not systematically integrated throughout all pipeline stages. Individual operations can send metrics using the ``HousekeepingMetrics`` class, but this must be done explicitly in each component.

Grafana Dashboards
~~~~~~~~~~~~~~~~~~

Metrics can be visualized in Grafana by connecting to the InfluxDB instance. Dashboard configuration is deployment-specific and not included in the data-transfer package.

See :doc:`/docs/source/operations/monitoring` for information on the broader monitoring infrastructure.

Failure Detection & Recovery System
------------------------------------

The CCAT Data Transfer system implements a robust two-tier failure recovery mechanism to handle various types of failures that can occur during data transfer operations.

**Recovery Architecture Overview**:

The system consists of two complementary recovery mechanisms:

1. **Immediate Task Recovery** (Celery-based) - Handles expected failures with automatic retries
2. **Stalled Task Recovery** (Monitor-based) - Handles unexpected interruptions and deadlocks

This dual approach ensures that both expected failures (handled by Celery) and unexpected interruptions (handled by the monitor) can be properly managed and recovered from.

Immediate Failure Detection
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Celery detects failures when tasks raise exceptions. Tasks automatically retry based on exception type and task configuration:

.. code-block:: python

   @app.task(
       autoretry_for=(NetworkError, TemporaryError),
       retry_kwargs={'max_retries': 5, 'countdown': 60},
       retry_backoff=True,  # Exponential backoff
   )
   def retriable_operation():
       # Operation implementation
       pass

**Automatic Retry**:

Tasks can implement custom retry logic:

.. code-block:: python

   @app.task
   def smart_retry_task():
       try:
           result = do_work()
       except NetworkError as e:
           # Network errors: retry quickly
           raise self.retry(exc=e, countdown=30)
       except ChecksumError as e:
           # Data corruption: longer delay, fewer retries
           raise self.retry(exc=e, countdown=300, max_retries=2)
       except PermissionError as e:
           # Permission errors: don't retry
           mark_permanent_failure()
           raise

Stalled Task Detection & Recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:py:mod:`ccat_data_transfer.recovery_service_runner`

The stalled task recovery system operates independently of Celery and handles cases where tasks are interrupted unexpectedly.

**Components**:

1. **Task Monitor Service** - :py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService`
   
   - Monitors task heartbeats
   - Detects stalled tasks
   - Initiates recovery procedures
   - Implements circuit breaker pattern

2. **Recovery Service Runner** - :py:func:`ccat_data_transfer.recovery_service_runner.run_task_recovery_service`
   
   - Runs as a standalone process
   - Manages the monitoring loop
   - Handles service lifecycle
   - Sends status notifications

**Recovery Process**:

.. literalinclude:: ../../ccat_data_transfer/recovery_service_runner.py
   :pyobject: run_task_recovery_service
   :language: python

**Task State Tracking**:

1. Tasks send periodic heartbeats to Redis (every 60 seconds)
2. Monitor service checks for stale heartbeats (every 60 seconds by default)
3. Stalled tasks are identified based on configured timeout (default 300 seconds)

**Configuration**:

Recovery service configuration from :py:class:`ccat_data_transfer.config.config.TaskRecoverySettings`:

.. literalinclude:: ../../ccat_data_transfer/config/config.py
   :pyobject: TaskRecoverySettings
   :language: python

**Recovery Actions**:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService._recover_stalled_task
   :language: python

Circuit Breaker Pattern
~~~~~~~~~~~~~~~~~~~~~~~~

The recovery system implements a circuit breaker to prevent infinite retry loops:

**Circuit Breaker Logic**:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService._check_circuit_breaker
   :language: python

**Circuit Breaker States**:

1. **Closed** (Normal): Tasks retry normally when they stall
2. **Open** (Tripped): After ``max_stall_count`` consecutive stalls (default: 3), the breaker opens and blocks retries
3. **Automatic Reset**: After ``circuit_breaker_timeout`` (default: 3600 seconds), the breaker automatically closes and allows retries again

**Manual Reset**:

Administrators can manually reset the circuit breaker:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService.reset_circuit_breaker
   :language: python

**Force Retry**:

To force a retry of a stalled task:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService.force_retry_stalled_task
   :language: python

Service Death Detection
~~~~~~~~~~~~~~~~~~~~~~~

If a manager service dies, health check detects it:

1. Service stops sending heartbeats
2. Redis key expires after TTL (90 seconds)
3. Monitoring system detects missing service
4. Alerts can be configured through external monitoring tools

**But operations continue**:

* Already-submitted tasks still execute
* Workers continue processing queues
* No data loss (database state preserved)
* Restart manager to resume scheduling new work

Silent Failure Prevention
~~~~~~~~~~~~~~~~~~~~~~~~~

Several mechanisms prevent silent failures:

**Database Constraints**:

* Foreign keys ensure referential integrity
* Check constraints validate state transitions
* Unique constraints prevent duplicates

**Transaction Safety**:

.. code-block:: python

   @app.task
   def operation_with_safety(op_id):
       with session_scope() as session:
           try:
               # Do work
               perform_operation(session, op_id)
               
               # Update state
               operation = session.query(Operation).get(op_id)
               operation.status = Status.COMPLETED
               
               session.commit()
           except Exception as e:
               session.rollback()
               raise

If task fails, transaction rolls back → database unchanged → operation retried.

**State Verification**:

Before proceeding, workers verify current state:

.. code-block:: python

   def transfer_file(transfer_id):
       transfer = session.query(DataTransfer).get(transfer_id)
       
       # Verify preconditions
       if transfer.status != Status.PENDING:
           logger.info("Transfer already processed", transfer_id=transfer_id)
           return  # Idempotent
       
       if not os.path.exists(transfer.source_path):
           raise FileNotFoundError("Source file missing")
       
       # Proceed with transfer
       # ...

This prevents:

* Processing same operation twice
* Operating on stale data
* Cascading failures from bad state

Recovery Mechanisms
-------------------

Layer 1: Celery Retry Logic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Automatic Retries**:

Tasks automatically retry for transient errors through the base class returned by
:py:func:`ccat_data_transfer.setup_celery_app.make_celery_task`. Each task class that
inherits from this base implements custom retry logic through the ``should_retry``
method:

.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :language: python
   :start-after: def should_retry(self, exc, operation_id, retry_count):
   :end-before: def on_failure(self, exc, task_id, args, kwargs, einfo):
   :dedent: 12

The base class provides default retry behavior that subclasses can customize based on
their specific operation requirements.

**Custom Retry Logic**:

Each operation type can customize retry behavior by overriding the ``should_retry`` method.
The base implementation checks for non-retryable exceptions and respects retry count limits.

**Error Classification**:

The system defines a hierarchy of custom exceptions in
:py:mod:`ccat_data_transfer.exceptions`:

.. literalinclude:: ../../ccat_data_transfer/exceptions.py
   :pyobject: CCATDataOperationError
   :language: python

Each error type specifies:

- Whether it is retryable
- Maximum number of retries allowed
- Operation-specific context

**Recovery Implementation**:

Each pipeline component implements custom recovery logic through two key methods defined
in the base class returned by ``make_celery_task()``:

- ``reset_state_on_failure``: Handles retryable errors by resetting operation state for
  retry
- ``mark_permanent_failure``: Handles non-retryable errors that require intervention

Example from ``DataTransferTask``:

.. literalinclude:: ../../ccat_data_transfer/transfer_manager.py
   :pyobject: DataTransferTask.reset_state_on_failure
   :language: python

.. literalinclude:: ../../ccat_data_transfer/transfer_manager.py
   :pyobject: DataTransferTask.mark_permanent_failure
   :language: python

These methods are called automatically by the base task class when errors occur:

.. literalinclude:: ../../ccat_data_transfer/setup_celery_app.py
   :language: python
   :start-after: def on_failure(self, exc, task_id, args, kwargs, einfo):
   :end-before: def on_success(self, retval, task_id, args, kwargs):

Layer 2: Task State Reset
~~~~~~~~~~~~~~~~~~~~~~~~~~

Recovery service handles tasks that stall through operation-specific handlers:

**Operation-Specific Recovery**:

Each operation type has a recovery handler in
:py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService`:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :language: python
   :start-after: # Define recovery methods for each operation type
   :end-before: def _check_circuit_breaker
   :dedent: 8

**Example Package Recovery Handler**:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService._recover_package
   :language: python


Layer 3: Manager Re-scanning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Managers continuously scan for work:

* Even if task submission failed, next manager cycle finds it
* Database is source of truth for what needs doing
* Missing work eventually discovered and scheduled

**Example**:

The :py:func:`ccat_data_transfer.raw_data_package_manager.create_raw_data_packages` function
continuously scans for unpackaged files and creates packages:

.. literalinclude:: ../../ccat_data_transfer/raw_data_package_manager.py
   :pyobject: create_raw_data_packages
   :language: python

Even if previous submission got lost, next cycle will find and resubmit.

Layer 4: Human Intervention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When automatic recovery fails, the system alerts operators through notifications.
Operations requiring intervention are visible in ops-db-ui with clear indicators,
and the notification system sends alerts to configured recipients.

Notification System
-------------------

:py:mod:`ccat_data_transfer.notification_service`

The notification system provides alerts for critical system events.

Notification Service
~~~~~~~~~~~~~~~~~~~~

The notification service is implemented in
:py:class:`ccat_data_transfer.notification_service.NotificationService` and handles
email notifications for critical system events.


Notification Channels
~~~~~~~~~~~~~~~~~~~~~

**Redis Queue-Based System**:

:py:class:`ccat_data_transfer.notification_service.NotificationService`

The notification service processes messages from Redis queues:

.. literalinclude:: ../../ccat_data_transfer/notification_service.py
   :pyobject: NotificationService.start
   :language: python

**Email Notifications**:

Email is the primary notification channel:

.. literalinclude:: ../../ccat_data_transfer/notification_service.py
   :pyobject: NotificationService._send_email
   :language: python

**Configuration**:

.. code-block:: toml

   [default.SMTP_CONFIG]
   SERVER = "smtp.uni-koeln.de"
   PORT = 25
   USE_TLS = false
   USER = false
   FROM_ADDRESS = "ccat-data-transfer@uni-koeln.de"
   RECIPIENTS = ["ccat-data-transfer@uni-koeln.de"]

**Sending Notifications**:

:py:class:`ccat_data_transfer.notification_service.NotificationClient`

Components send notifications by pushing to the Redis queue:

.. literalinclude:: ../../ccat_data_transfer/notification_service.py
   :pyobject: NotificationClient.send_notification
   :language: python

.. note::
   Additional notification channels (Slack, Discord, database logging) are not currently implemented. The system only supports email via SMTP and Redis pub/sub for real-time updates to ops-db-ui.

Cooldown Management
~~~~~~~~~~~~~~~~~~~

Prevent notification spam with automatic cooldown:

**Implementation**:

The :py:class:`ccat_data_transfer.task_monitor_service.TaskMonitorService` tracks recent
notifications and applies a cooldown period (default: 1 hour) to prevent duplicate alerts:

.. literalinclude:: ../../ccat_data_transfer/task_monitor_service.py
   :pyobject: TaskMonitorService._should_send_notification
   :language: python


**Retry Logic**:

Failed email deliveries are automatically retried with exponential backoff:

.. literalinclude:: ../../ccat_data_transfer/notification_service.py
   :pyobject: NotificationService._process_message
   :language: python

Configuration
-------------

The monitoring and recovery system can be configured through various settings:

**Health Check Settings**:

In :py:class:`ccat_data_transfer.health_check.HealthCheck`:

- ``update_interval``: How often to update health status (default: 30 seconds)
- ``ttl``: How long health status remains valid (default: 90 seconds)

**Task Recovery Settings**:

In :py:class:`ccat_data_transfer.config.config.TaskRecoverySettings`:

- ``heartbeat_timeout``: Time before task considered stalled (default: 300 seconds)
- ``max_stall_count``: Maximum stalls before circuit breaker opens (default: 3)
- ``circuit_breaker_timeout``: Time before circuit breaker resets (default: 3600 seconds)
- ``LOOP_INTERVAL``: How often to check for stalled tasks (default: 60 seconds)

**Notification Settings**:

- ``notification_cooldown``: Time between duplicate notifications (default: 3600 seconds)
- ``max_retries``: Maximum email retry attempts (default: 5)
- ``retry_delay``: Base delay for exponential backoff (default: 60 seconds)

**Disk Monitoring Settings**:

- ``BUFFER_WARNING_THRESHOLD_PERCENT``: Warning level (default: 70)
- ``BUFFER_CRITICAL_THRESHOLD_PERCENT``: Critical level (default: 85)
- ``BUFFER_EMERGENCY_THRESHOLD_PERCENT``: Emergency level (default: 95)
- ``BUFFER_RECOVERY_THRESHOLD_PERCENT``: Recovery target (default: 60)

Observability Best Practices
-----------------------------

Structured Logging
~~~~~~~~~~~~~~~~~~

:py:func:`ccat_data_transfer.logging_utils.get_structured_logger`

All logging uses structured format:

.. literalinclude:: ../../ccat_data_transfer/logging_utils.py
   :pyobject: get_structured_logger
   :language: python

Output (JSON):

.. code-block:: json

   {
     "timestamp": "2024-11-27T10:30:00.123Z",
     "level": "INFO",
     "logger": "ccat_data_transfer.transfer_manager",
     "message": "Transfer completed",
     "transfer_id": 456,
     "source_site": "ccat",
     "dest_site": "cologne",
     "duration_seconds": 120.5,
     "throughput_mbps": 450.2,
     "file_size_bytes": 54321098765
   }

**Benefits**:

* Machine-parseable for log aggregation
* Easy to query specific fields
* Consistent format across all services
* Rich context for debugging

Correlation IDs
~~~~~~~~~~~~~~~

Track operations across services:

.. code-block:: python

   # Manager creates operation
   operation = DataTransfer(...)
   session.add(operation)
   session.commit()
   
   logger.info(
       "Submitting transfer task",
       operation_id=operation.id,  # Correlation ID
       source=source.name,
       destination=dest.name,
   )
   
   # Worker logs with same ID
   logger.info(
       "Executing transfer",
       operation_id=operation.id,  # Same ID
       task_id=self.request.id,
   )
   
   # Later stages reference same ID
   logger.info(
       "Unpacking transfer",
       operation_id=operation.id,  # Traceable!
   )

Query logs by operation_id to see complete lifecycle.

Error Context
~~~~~~~~~~~~~

Include rich context in error logs:

.. code-block:: python

   try:
       transfer_file(source, dest)
   except Exception as e:
       logger.error(
           "Transfer failed",
           exc_info=e,  # Full traceback
           transfer_id=transfer.id,
           source_path=source_path,
           dest_path=dest_path,
           retry_count=retry_count,
           file_size=file_size,
           network_conditions={
               "latency_ms": latency,
               "packet_loss": packet_loss,
           },
       )

Makes debugging vastly easier.

Troubleshooting Guide
---------------------

Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Issue**: Task stuck in "IN_PROGRESS" forever

**Diagnosis**:

1. Check if worker still running: ``celery inspect active``
2. Check heartbeat: ``redis-cli GET task:{task_id}``
3. Check worker logs for errors
4. Verify recovery service is running

**Solution**:

* Recovery service should detect and reset automatically
* Check circuit breaker state: ``redis-cli HGETALL circuit_breaker:{operation_type}:{operation_id}``
* If circuit breaker is open, wait for timeout or manually reset
* If not, manually reset: ``UPDATE operation SET status='PENDING' WHERE id=X``
* Restart worker if crashed

**Issue**: Transfer failing with network errors

**Diagnosis**:

1. Test network: ``ping destination_host``
2. Test BBCP: ``bbcp source dest`` manually
3. Check firewall rules
4. Examine transfer logs for error details
5. Check retry count in database

**Solution**:

* Transient: Automatic retry will handle
* Persistent: Check network configuration, firewalls
* If circuit breaker is open: investigate underlying issue before resetting
* Workaround: Use alternative route if available

**Issue**: Disk usage alert but deletion not working

**Diagnosis**:

1. Check deletion manager running: verify health check key exists
2. Check packages eligible for deletion: SQL query
3. Check deletion manager logs
4. Verify retention policies
5. Check disk thresholds in configuration

**Solution**:

* Ensure packages are ARCHIVED before deletion
* Check retention periods aren't too long
* Verify threshold configuration matches expectations
* Manually trigger deletion if needed
* May need to adjust thresholds

**Issue**: Circuit breaker preventing recovery

**Diagnosis**:

1. Check circuit breaker state: ``redis-cli HGETALL circuit_breaker:{operation_type}:{operation_id}``
2. Review stall count and last stall time
3. Examine logs for underlying failure cause
4. Verify operation configuration

**Solution**:

* Wait for automatic reset after timeout period
* Fix underlying issue (network, permissions, configuration)
* Manually reset circuit breaker if issue is resolved
* Use ``force_retry_stalled_task`` for immediate retry

**Issue**: Notification service not sending emails

**Diagnosis**:

1. Check notification service is running
2. Verify SMTP configuration
3. Check notification queue: ``redis-cli LLEN ccat:notifications:queue``
4. Check retry queue: ``redis-cli LLEN ccat:notifications:retry:queue``
5. Examine notification service logs

**Solution**:

* Verify SMTP server accessibility
* Check FROM_ADDRESS configuration
* Ensure RECIPIENTS list is valid
* Restart notification service if needed
* Check for messages in retry queue

Best Practices
--------------

Error Classification
~~~~~~~~~~~~~~~~~~~~

- Use appropriate error types for different failure scenarios
- Set correct retryability based on whether the error is transient
- Include relevant context in error messages for debugging

Recovery Implementation
~~~~~~~~~~~~~~~~~~~~~~~

- Implement both recovery methods (``reset_state_on_failure`` and ``mark_permanent_failure``)
- Handle database state properly with transaction safety
- Log recovery actions with structured logging
- Ensure idempotent operations to prevent double-processing

Monitoring
~~~~~~~~~~

- Monitor recovery success rates through metrics
- Track retry counts to identify problematic operations
- Review notification patterns to detect systemic issues
- Set up alerts for high failure rates
- Monitor circuit breaker state for frequently failing operations

Maintenance
~~~~~~~~~~~

- Regular review of error patterns to identify common issues
- Update recovery strategies based on observed failure modes
- Adjust timeouts as needed based on operational experience
- Keep configuration in sync with actual system behavior
- Periodically review and clean up old circuit breaker states

Next Steps
----------

* :doc:`lifecycle` - Detailed deletion policies and retention management
* :doc:`philosophy` - Why monitoring is designed this way
* :doc:`pipeline` - See where monitoring integrates with pipeline stages
* :doc:`/operations/monitoring` - Broader monitoring infrastructure (Grafana, Loki, Promtail)