# Overview

```{eval-rst}
.. verified:: 2026-03-07
   :reviewer: Christof Buchbender
```

The Workflow Manager is the science pipeline orchestration system for the CCAT Data
Center. It takes archived raw observation data, runs reduction software against it
on HPC infrastructure, and produces calibrated data products with full provenance
tracking.

## What Problem Does It Solve?

After the data-transfer system archives raw observations, scientists need to:

1. Select which data to process (by source, instrument, observation configuration, etc.)
2. Run calibration, baseline subtraction, gridding, and map-making software
3. Track which raw data produced which outputs (provenance)
4. Re-process when new data arrives or software improves
5. Manage intermediate products and disk space

The Workflow Manager automates all of this. Scientists define **what** to process and
**how** to process it. The system handles the **when** and **where** automatically.

## Key Abstractions

The system is built around a small number of core concepts:

**Pipeline**

: A DAG (directed acyclic graph) of reduction steps. The top-level object that
  defines what data to process, where to process it, and when to trigger.
  See {doc}`pipeline_hierarchy`.

**ReductionStep**

: One execution unit within a Pipeline. Ties together a piece of reduction software
  with configuration and scheduling parameters. Steps are ordered and can depend on
  upstream steps.

**DataGrouping**

: Defines "what data" to process using declarative filter rules. Shared across
  pipelines --- different pipelines can use the same grouping with different execution
  granularity. See {doc}`data_grouping`.

**ExecutedReductionStep**

: A concrete run of a ReductionStep for a specific sub-group of data. Tracks status,
  HPC job ID, execution metrics, and links to input/output data products.

**DataProduct**

: An output file produced by a run. Carries type classification (SCIENCE, QA, PLOT,
  etc.), checksums, quality metrics, and provenance lineage back to raw data.

## Two Operational Modes

**Standing Pipelines** --- always-on, autonomous, incremental:

- Broad DataGrouping (e.g., "all CHAI data, state=archived")
- Trigger: CONTINUOUS --- system continuously evaluates gaps per step
- Per-step cooldown: calibration fires fast, map fires daily
- Intermediate products cached in workspace; purged under disk pressure

**Ad-hoc Pipelines** --- scientist one-offs:

- Specific DataGrouping (e.g., "NGC253, CO43, specific obs dates")
- Trigger: MANUAL
- Custom config tweaks per ReductionStep
- Run once, done

## How Data Flows

```{eval-rst}
.. mermaid::

   sequenceDiagram
       participant Sci as Scientist
       participant API as ops-db-api
       participant TM as trigger-manager
       participant WM as workflow-manager
       participant HPC as HPC Backend
       participant RM as result-manager
       participant DB as ops-db

       Sci->>API: Create Pipeline + Steps
       Note over TM: Poll loop
       TM->>DB: Resolve DataGrouping
       TM->>DB: Evaluate gap per step
       TM->>DB: Create ExecutedReductionStep (PENDING)
       Note over WM: Poll loop
       WM->>DB: Pick up PENDING runs
       WM->>HPC: Submit Apptainer job
       WM->>DB: Update status (SUBMITTED)
       Note over RM: Poll loop
       RM->>HPC: Check job status
       RM->>DB: Scan output directory
       RM->>DB: Create DataProduct records
       RM->>DB: Update status (COMPLETED)
```

## Related Documentation

- {doc}`pipeline_hierarchy` - Pipeline, ReductionStep, and model relationships
- {doc}`execution_flow` - Status lifecycle and the three manager processes
- {doc}`data_grouping` - Filter engine and sub-group resolution
- {doc}`container_contract` - How to build pipeline containers