Pipeline Hierarchy#
The Workflow Manager uses a hierarchical model to describe science reduction workflows. Understanding this hierarchy is essential for creating and managing pipelines.
Model Hierarchy#
Pipeline (DAG container — "what, where, when")
├── DataGrouping (shared — "what data", filter_rules)
├── processing_location_id (FK → DataLocation — where to run)
├── trigger_type (CONTINUOUS, CRON, MANUAL)
├── enabled, priority
│
├── ReductionStep 1 (calibrate)
│ ├── group_by ([executed_obs_unit] — per scan)
│ ├── cooldown_seconds (60 — fire quickly)
│ ├── ReductionSoftware (chai-calibration)
│ ├── ReductionStepConfig ({calibration_mode: "hot_cold"})
│ └── version_policy ("compatible")
│
├── ReductionStep 2 (baseline, depends_on: Step 1)
│ ├── group_by ([executed_obs_unit] — per scan)
│ ├── cooldown_seconds (60)
│ └── ReductionSoftware (chai-baseline)
│
└── ReductionStep 3 (grid+map, depends_on: Step 2)
├── group_by ([] — all data combined)
├── cooldown_seconds (86400 — daily)
└── ReductionSoftware (chai-gridding)
Pipeline#
The top-level object. A Pipeline is a DAG of ReductionSteps that share a common DataGrouping and processing location.
Key attributes:
data_grouping_id— what data to process (FK → DataGrouping)processing_location_id— where to run (FK → DataLocation of type PROCESSING)trigger_type— when to trigger: CONTINUOUS, CRON, or MANUALtrigger_config— JSON for trigger-specific settings (e.g., cron expression)enabled— whether the trigger-manager evaluates this pipelinepriority— scheduling priority (higher = more important)created_by_user_id— authority tracking
Database model: ccat_ops_db.models.Pipeline
ReductionStep#
One execution unit within a Pipeline. Each step runs a piece of reduction software
against a subset of the data, determined by the group_by dimensions.
Key attributes:
pipeline_id— belongs to a Pipelinestep_order— ordering within the DAGreduction_software_id— which container to runreduction_step_config_id— configuration parametersgroup_by— JSON list of dimensions for sub-groupingcooldown_seconds— minimum wait after latest input change before firingschedule— optional per-step cron expressionmax_concurrent_runs— parallelism controlpinned_version_id— optional: pin to a specific software versionversion_policy—"compatible"(use latest) or"exact"(use pinned)max_retries— how many times to retry on failure
Database model: ccat_ops_db.models.ReductionStep
Step Dependencies#
Steps within a Pipeline form a DAG via the reduction_step_dependency association
table. Each step can declare upstream dependencies. The trigger-manager enforces that
downstream steps only fire after upstream steps complete for the same sub-group scope.
ReductionSoftware#
A registered container image in GHCR. Represents a piece of reduction software as a black box — could be a single tool or an entire reduction pipeline internally.
Key attributes:
name— human-readable name (e.g., “chai-calibration”)ghcr_url— GitHub Container Registry URLgithub_repo_url— link to source codedescription— what the software does
Database model: ccat_ops_db.models.ReductionSoftware
Versions are tracked separately in ReductionSoftwareVersion, which records the
image tag, digest, and is_latest flag.
ReductionStepConfig#
Configuration for a specific step: parameters, environment variables, command template,
and resource requirements. No polymorphism — all instrument-specific parameters go
in the config_parameters JSON column.
Key attributes:
name— descriptive nameconfig_parameters— JSON dict of parameters passed to the containerenvironment_variables— JSON dict of extra env varscommand_template— the command to run inside the containerresource_requirements— JSON dict (cpu, memory_gb, time_hours, gpu)
Database model: ccat_ops_db.models.ReductionStepConfig
Example resource requirements:
{"cpu": 4, "memory_gb": 16, "time_hours": 2, "gpu": 0}
These are translated to platform-specific flags by the HPC backend:
SLURM:
--cpus-per-task=4 --mem=16G --time=02:00:00Kubernetes: resource requests/limits
Local: informational only
ExecutedReductionStep#
A concrete run of a ReductionStep for a specific sub-group. Created by the trigger-manager, progressed through statuses by the workflow-manager and result-manager.
Key attributes:
reduction_step_id— which step this is a run ofreduction_software_version_id— exact version usedstatus— current lifecycle state (see Execution Flow)sub_group_key— identifies the data subset (e.g.,source=NGC253|line=CO43)hpc_job_id— backend-specific job identifierexecution_command— full Apptainer command (stored for reproducibility)processing_time_s,peak_memory_gb,cpu_hours— execution metrics
Database model: ccat_ops_db.models.ExecutedReductionStep
DataProduct#
An output file produced by a run. Classified by type based on the output directory convention (see Container Contract).
Types: SCIENCE, INTERMEDIATE, QA_METRIC, PLOT, STATISTICS, LOG
Key attributes:
executed_reduction_step_id— which run produced thisdata_product_type— classificationfile_path— path to the file on diskfile_size— in byteschecksum— integrity verificationmetadata— JSON from optional sidecar files
Provenance lineage is tracked via the data_product_lineage association table,
linking each DataProduct back to the RawDataPackages that contributed to it.
Database model: ccat_ops_db.models.DataProduct
Version Management#
Two modes for software version binding:
- Compatible (default)
Use the latest version of the ReductionSoftware. Intermediates from older versions are still valid. Non-breaking version bumps don’t trigger re-processing.
- Exact
Use exactly the pinned version. Only intermediates from this exact version count as valid. Changing the pinned version invalidates old intermediates, causing the gap evaluation to trigger full re-processing.