Pipeline Hierarchy#

✓

Documentation Verified Last checked: 2026-03-07 Reviewer: Christof Buchbender

The Workflow Manager uses a hierarchical model to describe science reduction workflows. Understanding this hierarchy is essential for creating and managing pipelines.

Model Hierarchy#

Pipeline                        (DAG container — "what, where, when")
  ├── DataGrouping              (shared — "what data", filter_rules)
  ├── processing_location_id    (FK → DataLocation — where to run)
  ├── trigger_type              (CONTINUOUS, CRON, MANUAL)
  ├── enabled, priority
  │
  ├── ReductionStep 1           (calibrate)
  │     ├── group_by            ([executed_obs_unit] — per scan)
  │     ├── cooldown_seconds    (60 — fire quickly)
  │     ├── ReductionSoftware   (chai-calibration)
  │     ├── ReductionStepConfig ({calibration_mode: "hot_cold"})
  │     └── version_policy      ("compatible")
  │
  ├── ReductionStep 2           (baseline, depends_on: Step 1)
  │     ├── group_by            ([executed_obs_unit] — per scan)
  │     ├── cooldown_seconds    (60)
  │     └── ReductionSoftware   (chai-baseline)
  │
  └── ReductionStep 3           (grid+map, depends_on: Step 2)
        ├── group_by            ([] — all data combined)
        ├── cooldown_seconds    (86400 — daily)
        └── ReductionSoftware   (chai-gridding)

Pipeline#

The top-level object. A Pipeline is a DAG of ReductionSteps that share a common DataGrouping and processing location.

Key attributes:

data_grouping_id — what data to process (FK → DataGrouping)
processing_location_id — where to run (FK → DataLocation of type PROCESSING)
trigger_type — when to trigger: CONTINUOUS, CRON, or MANUAL
trigger_config — JSON for trigger-specific settings (e.g., cron expression)
enabled — whether the trigger-manager evaluates this pipeline
priority — scheduling priority (higher = more important)
created_by_user_id — authority tracking

Database model: ccat_ops_db.models.Pipeline

ReductionStep#

One execution unit within a Pipeline. Each step runs a piece of reduction software against a subset of the data, determined by the group_by dimensions.

Key attributes:

pipeline_id — belongs to a Pipeline
step_order — ordering within the DAG
reduction_software_id — which container to run
reduction_step_config_id — configuration parameters
group_by — JSON list of dimensions for sub-grouping
cooldown_seconds — minimum wait after latest input change before firing
schedule — optional per-step cron expression
max_concurrent_runs — parallelism control
pinned_version_id — optional: pin to a specific software version
version_policy — "compatible" (use latest) or "exact" (use pinned)
max_retries — how many times to retry on failure

Database model: ccat_ops_db.models.ReductionStep

Step Dependencies#

Steps within a Pipeline form a DAG via the reduction_step_dependency association table. Each step can declare upstream dependencies. The trigger-manager enforces that downstream steps only fire after upstream steps complete for the same sub-group scope.

ReductionSoftware#

A registered container image in GHCR. Represents a piece of reduction software as a black box — could be a single tool or an entire reduction pipeline internally.

Key attributes:

name — human-readable name (e.g., “chai-calibration”)
ghcr_url — GitHub Container Registry URL
github_repo_url — link to source code
description — what the software does

Database model: ccat_ops_db.models.ReductionSoftware

Versions are tracked separately in ReductionSoftwareVersion, which records the image tag, digest, and is_latest flag.

ReductionStepConfig#

Configuration for a specific step: parameters, environment variables, command template, and resource requirements. No polymorphism — all instrument-specific parameters go in the config_parameters JSON column.

Key attributes:

name — descriptive name
config_parameters — JSON dict of parameters passed to the container
environment_variables — JSON dict of extra env vars
command_template — the command to run inside the container
resource_requirements — JSON dict (cpu, memory_gb, time_hours, gpu)

Database model: ccat_ops_db.models.ReductionStepConfig

Example resource requirements:

{"cpu": 4, "memory_gb": 16, "time_hours": 2, "gpu": 0}

These are translated to platform-specific flags by the HPC backend:

SLURM: --cpus-per-task=4 --mem=16G --time=02:00:00
Kubernetes: resource requests/limits
Local: informational only

ExecutedReductionStep#

A concrete run of a ReductionStep for a specific sub-group. Created by the trigger-manager, progressed through statuses by the workflow-manager and result-manager.

Key attributes:

reduction_step_id — which step this is a run of
reduction_software_version_id — exact version used
status — current lifecycle state (see Execution Flow)
sub_group_key — identifies the data subset (e.g., source=NGC253|line=CO43)
hpc_job_id — backend-specific job identifier
execution_command — full Apptainer command (stored for reproducibility)
processing_time_s, peak_memory_gb, cpu_hours — execution metrics

Database model: ccat_ops_db.models.ExecutedReductionStep

DataProduct#

An output file produced by a run. Classified by type based on the output directory convention (see Container Contract).

Types: SCIENCE, INTERMEDIATE, QA_METRIC, PLOT, STATISTICS, LOG

Key attributes:

executed_reduction_step_id — which run produced this
data_product_type — classification
file_path — path to the file on disk
file_size — in bytes
checksum — integrity verification
metadata — JSON from optional sidecar files

Provenance lineage is tracked via the data_product_lineage association table, linking each DataProduct back to the RawDataPackages that contributed to it.

Database model: ccat_ops_db.models.DataProduct

Version Management#

Two modes for software version binding:

Compatible (default): Use the latest version of the ReductionSoftware. Intermediates from older versions are still valid. Non-breaking version bumps don’t trigger re-processing.
Exact: Use exactly the pinned version. Only intermediates from this exact version count as valid. Changing the pinned version invalidates old intermediates, causing the gap evaluation to trigger full re-processing.

Pipeline Hierarchy#

Model Hierarchy#

Pipeline#

ReductionStep#

Step Dependencies#

ReductionSoftware#

ReductionStepConfig#

ExecutedReductionStep#

DataProduct#

Version Management#

Related Documentation#