Data Grouping & Filter Engine#

✓

Documentation Verified Last checked: 2026-03-07 Reviewer: Christof Buchbender

DataGroupings define “what data” a pipeline should process. The generic filter engine replaces the earlier polymorphic approach (CHAI/PrimeCam-specific resolvers) with a declarative, data-driven system that works for any instrument.

DataGrouping#

A DataGrouping defines a coherent dataset using declarative filter rules. For example, “all archived CHAI data for NGC253, CO(4-3)”.

Key attributes:

name — human-readable label
description — what this grouping selects
filter_rules — JSON list of filter rules (see below)
instrument_module_id — optional FK to scope to a specific instrument

Database model: ccat_ops_db.models.DataGrouping

Filter Rules#

Filter rules are JSON objects that describe conditions on the join graph. The filter engine translates these into SQLAlchemy queries at runtime.

Rule structure:

{
  "table": "Source",
  "column": "name",
  "operator": "eq",
  "value": "NGC253"
}

Supported operators:

Operator	SQL equivalent
`eq`	`column = value`
`neq`	`column != value`
`in`	`column IN (values)`
`not_in`	`column NOT IN (values)`
`gt`	`column > value`
`gte`	`column >= value`
`lt`	`column < value`
`lte`	`column <= value`
`like`	`column LIKE value`

JSON path drilling: For nested JSON/JSONB columns, use the json_path field to drill into the structure:

{
  "table": "RawDataPackage",
  "column": "metadata",
  "json_path": "observation.frequency_ghz",
  "operator": "gt",
  "value": 400
}

Example: CHAI data for NGC253, CO(4-3):

[
  {"table": "Source", "column": "name", "operator": "eq", "value": "NGC253"},
  {"table": "ObsUnit", "column": "line_id", "operator": "eq", "value": "CO43"},
  {"table": "InstrumentModule", "column": "name", "operator": "eq", "value": "CHAI"},
  {"table": "RawDataPackage", "column": "state", "operator": "eq", "value": "archived"}
]

Join Graph#

The filter engine maintains a declarative map of how database tables connect. When filter rules reference multiple tables, the engine automatically builds the necessary joins.

The join graph includes:

RawDataPackage <-> ObsUnit (via executed_obs_unit)
ObsUnit <-> Source
RawDataPackage <-> InstrumentModule
And other paths through the observation model

This is defined in ccat_workflow_manager.grouping.engine.JOIN_GRAPH.

Sub-Group Resolution#

The group_by parameter (on ReductionStep, not DataGrouping) controls how matched data is split into sub-groups for parallel execution.

How it works:

The filter engine applies filter_rules to find all matching RawDataPackages
The group_by dimensions determine how to partition the results
Each unique combination of group_by values becomes a sub-group
Each sub-group gets its own ExecutedReductionStep

Examples:

group_by=["Source.name", "ObsUnit.line_id"]: One run per (source, line) combination. E.g., source=NGC253|line=CO43.
group_by=["ExecutedObsUnit.id"]: One run per individual scan.
group_by=[]: One run for everything (aggregation step). All matched data in a single run.

Different granularities in one Pipeline:

DataGrouping: source=NGC253, line=CO43, state=archived
  │
  ├── Step 1 (calibrate):  group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  ├── Step 2 (baseline):   group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  └── Step 3 (grid+map):   group_by=[]                    → 1 run  (all data)

Steps 1→2 have matching sub-group keys (1:1). Step 3 aggregates: it collects ALL intermediates from Step 2 across all sub-groups.

Presets#

Curated filter/group_by templates are available for common workflows. Scientists can select a preset in the UI and customize from there.

Available presets:

chai_by_source_line — CHAI data grouped by source and spectral line
chai_by_source_line_obsconfig — CHAI data with observation configuration
primecam_by_source_obsmode — PrimeCam data grouped by source and obs mode

Presets are listed via GET /pipelines/groupings/presets and defined in ccat_workflow_manager.grouping.presets.

Frontend Workflow#

Scientist selects a preset or builds custom filters in the UI
Frontend calls GET /pipelines/groupings/{id}/resolve?group_by=Source.name,ObsUnit.line_id to preview sub-groups
Scientist sees what data would be included, adjusts filters
Saves the DataGrouping
Creates Pipeline(s) attached to it, each with their own group_by and trigger