Overview#

Documentation Verified Last checked: 2026-03-07 Reviewer: Christof Buchbender

The Workflow Manager is the science pipeline orchestration system for the CCAT Data Center. It takes archived raw observation data, runs reduction software against it on HPC infrastructure, and produces calibrated data products with full provenance tracking.

What Problem Does It Solve?#

After the data-transfer system archives raw observations, scientists need to:

  1. Select which data to process (by source, instrument, observation configuration, etc.)

  2. Run calibration, baseline subtraction, gridding, and map-making software

  3. Track which raw data produced which outputs (provenance)

  4. Re-process when new data arrives or software improves

  5. Manage intermediate products and disk space

The Workflow Manager automates all of this. Scientists define what to process and how to process it. The system handles the when and where automatically.

Key Abstractions#

The system is built around a small number of core concepts:

Pipeline

A DAG (directed acyclic graph) of reduction steps. The top-level object that defines what data to process, where to process it, and when to trigger. See Pipeline Hierarchy.

ReductionStep

One execution unit within a Pipeline. Ties together a piece of reduction software with configuration and scheduling parameters. Steps are ordered and can depend on upstream steps.

DataGrouping

Defines “what data” to process using declarative filter rules. Shared across pipelines — different pipelines can use the same grouping with different execution granularity. See Data Grouping & Filter Engine.

ExecutedReductionStep

A concrete run of a ReductionStep for a specific sub-group of data. Tracks status, HPC job ID, execution metrics, and links to input/output data products.

DataProduct

An output file produced by a run. Carries type classification (SCIENCE, QA, PLOT, etc.), checksums, quality metrics, and provenance lineage back to raw data.

Two Operational Modes#

Standing Pipelines — always-on, autonomous, incremental:

  • Broad DataGrouping (e.g., “all CHAI data, state=archived”)

  • Trigger: CONTINUOUS — system continuously evaluates gaps per step

  • Per-step cooldown: calibration fires fast, map fires daily

  • Intermediate products cached in workspace; purged under disk pressure

Ad-hoc Pipelines — scientist one-offs:

  • Specific DataGrouping (e.g., “NGC253, CO43, specific obs dates”)

  • Trigger: MANUAL

  • Custom config tweaks per ReductionStep

  • Run once, done

How Data Flows#

        sequenceDiagram
    participant Sci as Scientist
    participant API as ops-db-api
    participant TM as trigger-manager
    participant WM as workflow-manager
    participant HPC as HPC Backend
    participant RM as result-manager
    participant DB as ops-db

    Sci->>API: Create Pipeline + Steps
    Note over TM: Poll loop
    TM->>DB: Resolve DataGrouping
    TM->>DB: Evaluate gap per step
    TM->>DB: Create ExecutedReductionStep (PENDING)
    Note over WM: Poll loop
    WM->>DB: Pick up PENDING runs
    WM->>HPC: Submit Apptainer job
    WM->>DB: Update status (SUBMITTED)
    Note over RM: Poll loop
    RM->>HPC: Check job status
    RM->>DB: Scan output directory
    RM->>DB: Create DataProduct records
    RM->>DB: Update status (COMPLETED)