# Data Grouping & Filter Engine

```{eval-rst}
.. verified:: 2026-03-07
   :reviewer: Christof Buchbender
```

DataGroupings define "what data" a pipeline should process. The generic filter engine
replaces the earlier polymorphic approach (CHAI/PrimeCam-specific resolvers) with a
declarative, data-driven system that works for any instrument.

## DataGrouping

A DataGrouping defines a coherent dataset using declarative filter rules. For example,
"all archived CHAI data for NGC253, CO(4-3)".

**Key attributes:**

- `name` --- human-readable label
- `description` --- what this grouping selects
- `filter_rules` --- JSON list of filter rules (see below)
- `instrument_module_id` --- optional FK to scope to a specific instrument

**Database model**: {py:class}`ccat_ops_db.models.DataGrouping`

## Filter Rules

Filter rules are JSON objects that describe conditions on the join graph. The filter
engine translates these into SQLAlchemy queries at runtime.

**Rule structure:**

```json
{
  "table": "Source",
  "column": "name",
  "operator": "eq",
  "value": "NGC253"
}
```

**Supported operators:**

| Operator | SQL equivalent           |
| -------- | ------------------------ |
| `eq`     | `column = value`         |
| `neq`    | `column != value`        |
| `in`     | `column IN (values)`     |
| `not_in` | `column NOT IN (values)` |
| `gt`     | `column > value`         |
| `gte`    | `column >= value`        |
| `lt`     | `column < value`         |
| `lte`    | `column <= value`        |
| `like`   | `column LIKE value`      |

**JSON path drilling:** For nested JSON/JSONB columns, use the `json_path` field to
drill into the structure:

```json
{
  "table": "RawDataPackage",
  "column": "metadata",
  "json_path": "observation.frequency_ghz",
  "operator": "gt",
  "value": 400
}
```

**Example: CHAI data for NGC253, CO(4-3):**

```json
[
  {"table": "Source", "column": "name", "operator": "eq", "value": "NGC253"},
  {"table": "ObsUnit", "column": "line_id", "operator": "eq", "value": "CO43"},
  {"table": "InstrumentModule", "column": "name", "operator": "eq", "value": "CHAI"},
  {"table": "RawDataPackage", "column": "state", "operator": "eq", "value": "archived"}
]
```

## Join Graph

The filter engine maintains a declarative map of how database tables connect. When
filter rules reference multiple tables, the engine automatically builds the necessary
joins.

The join graph includes:

- `RawDataPackage` \<-> `ObsUnit` (via executed_obs_unit)
- `ObsUnit` \<-> `Source`
- `RawDataPackage` \<-> `InstrumentModule`
- And other paths through the observation model

This is defined in `ccat_workflow_manager.grouping.engine.JOIN_GRAPH`.

## Sub-Group Resolution

The `group_by` parameter (on ReductionStep, not DataGrouping) controls how matched
data is split into sub-groups for parallel execution.

**How it works:**

1. The filter engine applies `filter_rules` to find all matching RawDataPackages
2. The `group_by` dimensions determine how to partition the results
3. Each unique combination of group_by values becomes a sub-group
4. Each sub-group gets its own `ExecutedReductionStep`

**Examples:**

`group_by=["Source.name", "ObsUnit.line_id"]`

: One run per (source, line) combination. E.g., `source=NGC253|line=CO43`.

`group_by=["ExecutedObsUnit.id"]`

: One run per individual scan.

`group_by=[]`

: One run for everything (aggregation step). All matched data in a single run.

**Different granularities in one Pipeline:**

```text
DataGrouping: source=NGC253, line=CO43, state=archived
  │
  ├── Step 1 (calibrate):  group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  ├── Step 2 (baseline):   group_by=[ExecutedObsUnit.id]  → N runs (per scan)
  └── Step 3 (grid+map):   group_by=[]                    → 1 run  (all data)
```

Steps 1→2 have matching sub-group keys (1:1). Step 3 aggregates: it collects ALL
intermediates from Step 2 across all sub-groups.

## Presets

Curated filter/group_by templates are available for common workflows. Scientists can
select a preset in the UI and customize from there.

Available presets:

- **chai_by_source_line** --- CHAI data grouped by source and spectral line
- **chai_by_source_line_obsconfig** --- CHAI data with observation configuration
- **primecam_by_source_obsmode** --- PrimeCam data grouped by source and obs mode

Presets are listed via `GET /pipelines/groupings/presets` and defined in
`ccat_workflow_manager.grouping.presets`.

## Frontend Workflow

1. Scientist selects a preset or builds custom filters in the UI
2. Frontend calls `GET /pipelines/groupings/{id}/resolve?group_by=Source.name,ObsUnit.line_id`
   to preview sub-groups
3. Scientist sees what data would be included, adjusts filters
4. Saves the DataGrouping
5. Creates Pipeline(s) attached to it, each with their own `group_by` and trigger

## Related Documentation

- {doc}`pipeline_hierarchy` - How DataGrouping fits in the model hierarchy
- {doc}`/source/architecture/filter_engine` - Technical details of the filter engine