# Overview ```{eval-rst} .. verified:: 2026-03-07 :reviewer: Christof Buchbender ``` The Workflow Manager is the science pipeline orchestration system for the CCAT Data Center. It takes archived raw observation data, runs reduction software against it on HPC infrastructure, and produces calibrated data products with full provenance tracking. ## What Problem Does It Solve? After the data-transfer system archives raw observations, scientists need to: 1. Select which data to process (by source, instrument, observation configuration, etc.) 2. Run calibration, baseline subtraction, gridding, and map-making software 3. Track which raw data produced which outputs (provenance) 4. Re-process when new data arrives or software improves 5. Manage intermediate products and disk space The Workflow Manager automates all of this. Scientists define **what** to process and **how** to process it. The system handles the **when** and **where** automatically. ## Key Abstractions The system is built around a small number of core concepts: **Pipeline** : A DAG (directed acyclic graph) of reduction steps. The top-level object that defines what data to process, where to process it, and when to trigger. See {doc}`pipeline_hierarchy`. **ReductionStep** : One execution unit within a Pipeline. Ties together a piece of reduction software with configuration and scheduling parameters. Steps are ordered and can depend on upstream steps. **DataGrouping** : Defines "what data" to process using declarative filter rules. Shared across pipelines --- different pipelines can use the same grouping with different execution granularity. See {doc}`data_grouping`. **ExecutedReductionStep** : A concrete run of a ReductionStep for a specific sub-group of data. Tracks status, HPC job ID, execution metrics, and links to input/output data products. **DataProduct** : An output file produced by a run. Carries type classification (SCIENCE, QA, PLOT, etc.), checksums, quality metrics, and provenance lineage back to raw data. ## Two Operational Modes **Standing Pipelines** --- always-on, autonomous, incremental: - Broad DataGrouping (e.g., "all CHAI data, state=archived") - Trigger: CONTINUOUS --- system continuously evaluates gaps per step - Per-step cooldown: calibration fires fast, map fires daily - Intermediate products cached in workspace; purged under disk pressure **Ad-hoc Pipelines** --- scientist one-offs: - Specific DataGrouping (e.g., "NGC253, CO43, specific obs dates") - Trigger: MANUAL - Custom config tweaks per ReductionStep - Run once, done ## How Data Flows ```{eval-rst} .. mermaid:: sequenceDiagram participant Sci as Scientist participant API as ops-db-api participant TM as trigger-manager participant WM as workflow-manager participant HPC as HPC Backend participant RM as result-manager participant DB as ops-db Sci->>API: Create Pipeline + Steps Note over TM: Poll loop TM->>DB: Resolve DataGrouping TM->>DB: Evaluate gap per step TM->>DB: Create ExecutedReductionStep (PENDING) Note over WM: Poll loop WM->>DB: Pick up PENDING runs WM->>HPC: Submit Apptainer job WM->>DB: Update status (SUBMITTED) Note over RM: Poll loop RM->>HPC: Check job status RM->>DB: Scan output directory RM->>DB: Create DataProduct records RM->>DB: Update status (COMPLETED) ``` ## Related Documentation - {doc}`pipeline_hierarchy` - Pipeline, ReductionStep, and model relationships - {doc}`execution_flow` - Status lifecycle and the three manager processes - {doc}`data_grouping` - Filter engine and sub-group resolution - {doc}`container_contract` - How to build pipeline containers