Data Grouping & Filter Engine#
DataGroupings define “what data” a pipeline should process. The generic filter engine replaces the earlier polymorphic approach (CHAI/PrimeCam-specific resolvers) with a declarative, data-driven system that works for any instrument.
DataGrouping#
A DataGrouping defines a coherent dataset using declarative filter rules. For example, “all archived CHAI data for NGC253, CO(4-3)”.
Key attributes:
name— human-readable labeldescription— what this grouping selectsfilter_rules— JSON list of filter rules (see below)instrument_module_id— optional FK to scope to a specific instrument
Database model: ccat_ops_db.models.DataGrouping
Filter Rules#
Filter rules are JSON objects that describe conditions on the join graph. The filter engine translates these into SQLAlchemy queries at runtime.
Rule structure:
{
"table": "Source",
"column": "name",
"operator": "eq",
"value": "NGC253"
}
Supported operators:
Operator |
SQL equivalent |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
JSON path drilling: For nested JSON/JSONB columns, use the json_path field to
drill into the structure:
{
"table": "RawDataPackage",
"column": "metadata",
"json_path": "observation.frequency_ghz",
"operator": "gt",
"value": 400
}
Example: CHAI data for NGC253, CO(4-3):
[
{"table": "Source", "column": "name", "operator": "eq", "value": "NGC253"},
{"table": "ObsUnit", "column": "line_id", "operator": "eq", "value": "CO43"},
{"table": "InstrumentModule", "column": "name", "operator": "eq", "value": "CHAI"},
{"table": "RawDataPackage", "column": "state", "operator": "eq", "value": "archived"}
]
Join Graph#
The filter engine maintains a declarative map of how database tables connect. When filter rules reference multiple tables, the engine automatically builds the necessary joins.
The join graph includes:
RawDataPackage<->ObsUnit(via executed_obs_unit)ObsUnit<->SourceRawDataPackage<->InstrumentModuleAnd other paths through the observation model
This is defined in ccat_workflow_manager.grouping.engine.JOIN_GRAPH.
Sub-Group Resolution#
The group_by parameter (on ReductionStep, not DataGrouping) controls how matched
data is split into sub-groups for parallel execution.
How it works:
The filter engine applies
filter_rulesto find all matching RawDataPackagesThe
group_bydimensions determine how to partition the resultsEach unique combination of group_by values becomes a sub-group
Each sub-group gets its own
ExecutedReductionStep
Examples:
group_by=["Source.name", "ObsUnit.line_id"]One run per (source, line) combination. E.g.,
source=NGC253|line=CO43.group_by=["ExecutedObsUnit.id"]One run per individual scan.
group_by=[]One run for everything (aggregation step). All matched data in a single run.
Different granularities in one Pipeline:
DataGrouping: source=NGC253, line=CO43, state=archived
│
├── Step 1 (calibrate): group_by=[ExecutedObsUnit.id] → N runs (per scan)
├── Step 2 (baseline): group_by=[ExecutedObsUnit.id] → N runs (per scan)
└── Step 3 (grid+map): group_by=[] → 1 run (all data)
Steps 1→2 have matching sub-group keys (1:1). Step 3 aggregates: it collects ALL intermediates from Step 2 across all sub-groups.
Presets#
Curated filter/group_by templates are available for common workflows. Scientists can select a preset in the UI and customize from there.
Available presets:
chai_by_source_line — CHAI data grouped by source and spectral line
chai_by_source_line_obsconfig — CHAI data with observation configuration
primecam_by_source_obsmode — PrimeCam data grouped by source and obs mode
Presets are listed via GET /pipelines/groupings/presets and defined in
ccat_workflow_manager.grouping.presets.
Frontend Workflow#
Scientist selects a preset or builds custom filters in the UI
Frontend calls
GET /pipelines/groupings/{id}/resolve?group_by=Source.name,ObsUnit.line_idto preview sub-groupsScientist sees what data would be included, adjusts filters
Saves the DataGrouping
Creates Pipeline(s) attached to it, each with their own
group_byand trigger