Data Grouping & Filter Engine#

Documentation Verified Last checked: 2026-03-07 Reviewer: Christof Buchbender

DataGroupings define “what data” a pipeline should process. The generic filter engine replaces the earlier polymorphic approach (CHAI/PrimeCam-specific resolvers) with a declarative, data-driven system that works for any instrument.

DataGrouping#

A DataGrouping defines a coherent dataset using declarative filter rules. For example, “all archived CHAI data for NGC253, CO(4-3)”.

Key attributes:

  • name — human-readable label

  • description — what this grouping selects

  • filter_rules — JSON list of filter rules (see below)

  • instrument_module_id — optional FK to scope to a specific instrument

Database model: ccat_ops_db.models.DataGrouping

Filter Rules#

Filter rules are JSON objects that describe conditions on the join graph. The filter engine translates these into SQLAlchemy queries at runtime.

Rule structure:

{
  "table": "Source",
  "column": "name",
  "operator": "eq",
  "value": "NGC253"
}

Supported operators:

Operator

SQL equivalent

eq

column = value

neq

column != value

in

column IN (values)

not_in

column NOT IN (values)

gt

column > value

gte

column >= value

lt

column < value

lte

column <= value

like

column LIKE value

JSON path drilling: For nested JSON/JSONB columns, use the json_path field to drill into the structure:

{
  "table": "RawDataPackage",
  "column": "metadata",
  "json_path": "observation.frequency_ghz",
  "operator": "gt",
  "value": 400
}

Example: CHAI data for NGC253, CO(4-3):

[
  {"table": "Source", "column": "name", "operator": "eq", "value": "NGC253"},
  {"table": "ObsUnit", "column": "line_id", "operator": "eq", "value": "CO43"},
  {"table": "InstrumentModule", "column": "name", "operator": "eq", "value": "CHAI"},
  {"table": "RawDataPackage", "column": "state", "operator": "eq", "value": "archived"}
]

Join Graph#

The filter engine maintains a declarative map of how database tables connect. When filter rules reference multiple tables, the engine automatically builds the necessary joins.

The join graph includes:

  • RawDataPackage <-> ObsUnit (via executed_obs_unit)

  • ObsUnit <-> Source

  • RawDataPackage <-> InstrumentModule

  • And other paths through the observation model

This is defined in ccat_workflow_manager.grouping.engine.JOIN_GRAPH.

Sub-Group Resolution#

The group_by parameter (on ReductionStep, not DataGrouping) controls how matched data is split into sub-groups for parallel execution.

How it works:

  1. The filter engine applies filter_rules to find all matching RawDataPackages

  2. The group_by dimensions determine how to partition the results

  3. Each unique combination of group_by values becomes a sub-group

  4. Each sub-group gets its own ExecutedReductionStep

Examples:

group_by=["Source.name", "ObsUnit.line_id"]

One run per (source, line) combination. E.g., source=NGC253|line=CO43.

group_by=["ExecutedObsUnit.id"]

One run per individual scan.

group_by=[]

One run for everything (aggregation step). All matched data in a single run.

Different granularities in one Pipeline:

DataGrouping: source=NGC253, line=CO43, state=archived
  │
  ├── Step 1 (calibrate):  group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  ├── Step 2 (baseline):   group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  └── Step 3 (grid+map):   group_by=[]                    → 1 run  (all data)

Steps 1→2 have matching sub-group keys (1:1). Step 3 aggregates: it collects ALL intermediates from Step 2 across all sub-groups.

Presets#

Curated filter/group_by templates are available for common workflows. Scientists can select a preset in the UI and customize from there.

Available presets:

  • chai_by_source_line — CHAI data grouped by source and spectral line

  • chai_by_source_line_obsconfig — CHAI data with observation configuration

  • primecam_by_source_obsmode — PrimeCam data grouped by source and obs mode

Presets are listed via GET /pipelines/groupings/presets and defined in ccat_workflow_manager.grouping.presets.

Frontend Workflow#

  1. Scientist selects a preset or builds custom filters in the UI

  2. Frontend calls GET /pipelines/groupings/{id}/resolve?group_by=Source.name,ObsUnit.line_id to preview sub-groups

  3. Scientist sees what data would be included, adjusts filters

  4. Saves the DataGrouping

  5. Creates Pipeline(s) attached to it, each with their own group_by and trigger