# Execution Flow ```{eval-rst} .. verified:: 2026-03-07 :reviewer: Christof Buchbender ``` This page describes the lifecycle of a pipeline run, from trigger evaluation through completion. Three independent manager processes drive runs through their statuses. ## Run Status Lifecycle ```{eval-rst} .. mermaid:: stateDiagram-v2 [*] --> PENDING: trigger-manager creates run PENDING --> STAGING_DATA: workflow-manager picks up STAGING_DATA --> SUBMITTED: job dispatched to HPC SUBMITTED --> RUNNING: HPC confirms execution started RUNNING --> COLLECTING_RESULTS: job completed COLLECTING_RESULTS --> COMPLETED: outputs registered RUNNING --> FAILED: job failed COLLECTING_RESULTS --> FAILED: output collection failed FAILED --> PENDING: retry (if retries remain) PENDING --> CANCELLED: user cancellation SUBMITTED --> CANCELLED: user cancellation RUNNING --> CANCELLED: user cancellation ``` **Status descriptions:** `PENDING` : Run created by trigger-manager. Waiting for workflow-manager to pick it up. `STAGING_DATA` : Workflow-manager is preparing input data. Creates staging job, builds execution command, writes input manifest. `SUBMITTED` : Job has been dispatched to the HPC backend. The `hpc_job_id` is recorded. `RUNNING` : HPC backend confirms the job is executing. Transition detected by result-manager polling the backend. `COLLECTING_RESULTS` : Job completed successfully. Result-manager is scanning output directories and creating DataProduct records. `COMPLETED` : All outputs registered, provenance linked, metrics collected. Terminal state. `FAILED` : Job failed or output collection failed. If `retry_count < max_retries`, the result-manager resets to PENDING for automatic retry. `CANCELLED` : User-initiated cancellation via API. Terminal state. If the job is running on HPC, a best-effort cancellation is attempted. ## The Three Managers Each manager is an independent Python process running a poll-dispatch-sleep loop. ```{eval-rst} .. mermaid:: graph LR subgraph TM["trigger-manager"] T1["Evaluate enabled Pipelines"] T2["Resolve DataGrouping"] T3["Compute gap per step"] T4["Create PENDING runs"] T1 --> T2 --> T3 --> T4 end subgraph WM["workflow-manager"] W1["Pick up PENDING runs"] W2["Build Apptainer command"] W3["Write manifest.json"] W4["Submit to HPC backend"] W1 --> W2 --> W3 --> W4 end subgraph RM["result-manager"] R1["Poll SUBMITTED/RUNNING"] R2["Check HPC job status"] R3["Scan output directories"] R4["Create DataProducts"] R1 --> R2 --> R3 --> R4 end TM -->|PENDING| WM -->|SUBMITTED| RM style TM fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px style WM fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style RM fill:#fff3e0,stroke:#ef6c00,stroke-width:2px ``` ### Trigger Manager The trigger-manager evaluates enabled Pipelines on each poll cycle: 1. Query all enabled Pipelines for the local `PROCESSING_LOCATION_ID` 2. For each Pipeline, iterate ReductionSteps in `step_order` 3. For each step, resolve the DataGrouping into sub-groups via the filter engine 4. Compute the **gap**: which sub-groups have unprocessed data? 5. Check **cooldown**: has enough time passed since the latest input change? 6. Check **DAG dependencies**: are upstream steps completed? 7. If all checks pass, create an `ExecutedReductionStep` with status `PENDING` **Gap evaluation:** - **First step** (raw data input): compare RawDataPackages matched by the DataGrouping against existing successful runs. Packages without a completed run are the gap. - **Subsequent steps** (intermediate input): check for new DataProducts from the upstream step's workspace that haven't been consumed by a successful run of this step. **Cooldown check:** After finding a non-empty gap, verify that `cooldown_seconds` has elapsed since the most recent input change. If not, skip this cycle. ### Workflow Manager The workflow-manager picks up PENDING runs and dispatches them to HPC: 1. Query `ExecutedReductionStep` records with status `PENDING` 2. For each run: 1. Determine the software version (pinned or latest) 2. Create the directory structure on disk 3. Build the input manifest (`manifest.json`) 4. Build the full `apptainer exec` command via the CommandBuilder 5. Store the `execution_command` on the run record 6. Submit to the HPC backend (Kubernetes, SLURM, or Local) 7. Record the `hpc_job_id` and set status to `SUBMITTED` ### Result Manager The result-manager monitors running jobs and collects outputs: 1. Query runs with status `SUBMITTED` or `RUNNING` 2. For each run, check the HPC backend for job status 3. On completion: 1. Set status to `COLLECTING_RESULTS` 2. Scan output directories using convention-based discovery 3. Create `DataProduct` records with type, checksum, and size 4. Load optional `.metadata.json` sidecar files 5. Link provenance (DataProduct → RawDataPackages) 6. Collect execution metrics (processing_time_s, peak_memory_gb, cpu_hours) 7. Set status to `COMPLETED` 4. On failure: 1. Capture logs from the HPC backend 2. If `retry_count < max_retries`: reset to `PENDING` 3. Otherwise: set status to `FAILED` with error message ## Trigger Types | Type | Behavior | |------|----------| | `CONTINUOUS` | Evaluate gap every poll cycle. Fire when unprocessed data exists and cooldown has elapsed. Standing pipeline mode. | | `CRON` | Evaluate gap on a cron schedule (from `trigger_config` or per-step `schedule`). Same gap logic, coarser timing. | | `MANUAL` | Never auto-fires. Scientist triggers via API (`POST /pipelines/{id}/trigger`). | `CONTINUOUS` subsumes the earlier `ON_NEW_DATA` and `DAILY` modes --- the difference is just cooldown configuration (60 seconds vs. 86400 seconds). ## Error Handling & Retry - `max_retries` per ReductionStep (default 3) - Immediate retry (no exponential backoff --- compute jobs fail for deterministic reasons) - After exhausting retries: status `FAILED` - Standing pipelines: failed sub-groups don't block other sub-groups - Failed runs visible in dashboard for manual intervention - Retry can also be triggered manually via `POST /pipelines/runs/{id}/retry` ## Related Documentation - {doc}`pipeline_hierarchy` - Model relationships - {doc}`container_contract` - What containers read and write - {doc}`/source/architecture/manager_worker` - Manager/worker design pattern - {doc}`/source/architecture/hpc_backends` - HPC backend implementations