# Data Grouping & Filter Engine ```{eval-rst} .. verified:: 2026-03-07 :reviewer: Christof Buchbender ``` DataGroupings define "what data" a pipeline should process. The generic filter engine replaces the earlier polymorphic approach (CHAI/PrimeCam-specific resolvers) with a declarative, data-driven system that works for any instrument. ## DataGrouping A DataGrouping defines a coherent dataset using declarative filter rules. For example, "all archived CHAI data for NGC253, CO(4-3)". **Key attributes:** - `name` --- human-readable label - `description` --- what this grouping selects - `filter_rules` --- JSON list of filter rules (see below) - `instrument_module_id` --- optional FK to scope to a specific instrument **Database model**: {py:class}`ccat_ops_db.models.DataGrouping` ## Filter Rules Filter rules are JSON objects that describe conditions on the join graph. The filter engine translates these into SQLAlchemy queries at runtime. **Rule structure:** ```json { "table": "Source", "column": "name", "operator": "eq", "value": "NGC253" } ``` **Supported operators:** | Operator | SQL equivalent | | -------- | ------------------------ | | `eq` | `column = value` | | `neq` | `column != value` | | `in` | `column IN (values)` | | `not_in` | `column NOT IN (values)` | | `gt` | `column > value` | | `gte` | `column >= value` | | `lt` | `column < value` | | `lte` | `column <= value` | | `like` | `column LIKE value` | **JSON path drilling:** For nested JSON/JSONB columns, use the `json_path` field to drill into the structure: ```json { "table": "RawDataPackage", "column": "metadata", "json_path": "observation.frequency_ghz", "operator": "gt", "value": 400 } ``` **Example: CHAI data for NGC253, CO(4-3):** ```json [ {"table": "Source", "column": "name", "operator": "eq", "value": "NGC253"}, {"table": "ObsUnit", "column": "line_id", "operator": "eq", "value": "CO43"}, {"table": "InstrumentModule", "column": "name", "operator": "eq", "value": "CHAI"}, {"table": "RawDataPackage", "column": "state", "operator": "eq", "value": "archived"} ] ``` ## Join Graph The filter engine maintains a declarative map of how database tables connect. When filter rules reference multiple tables, the engine automatically builds the necessary joins. The join graph includes: - `RawDataPackage` \<-> `ObsUnit` (via executed_obs_unit) - `ObsUnit` \<-> `Source` - `RawDataPackage` \<-> `InstrumentModule` - And other paths through the observation model This is defined in `ccat_workflow_manager.grouping.engine.JOIN_GRAPH`. ## Sub-Group Resolution The `group_by` parameter (on ReductionStep, not DataGrouping) controls how matched data is split into sub-groups for parallel execution. **How it works:** 1. The filter engine applies `filter_rules` to find all matching RawDataPackages 2. The `group_by` dimensions determine how to partition the results 3. Each unique combination of group_by values becomes a sub-group 4. Each sub-group gets its own `ExecutedReductionStep` **Examples:** `group_by=["Source.name", "ObsUnit.line_id"]` : One run per (source, line) combination. E.g., `source=NGC253|line=CO43`. `group_by=["ExecutedObsUnit.id"]` : One run per individual scan. `group_by=[]` : One run for everything (aggregation step). All matched data in a single run. **Different granularities in one Pipeline:** ```text DataGrouping: source=NGC253, line=CO43, state=archived │ ├── Step 1 (calibrate): group_by=[ExecutedObsUnit.id] → N runs (per scan) ├── Step 2 (baseline): group_by=[ExecutedObsUnit.id] → N runs (per scan) └── Step 3 (grid+map): group_by=[] → 1 run (all data) ``` Steps 1→2 have matching sub-group keys (1:1). Step 3 aggregates: it collects ALL intermediates from Step 2 across all sub-groups. ## Presets Curated filter/group_by templates are available for common workflows. Scientists can select a preset in the UI and customize from there. Available presets: - **chai_by_source_line** --- CHAI data grouped by source and spectral line - **chai_by_source_line_obsconfig** --- CHAI data with observation configuration - **primecam_by_source_obsmode** --- PrimeCam data grouped by source and obs mode Presets are listed via `GET /pipelines/groupings/presets` and defined in `ccat_workflow_manager.grouping.presets`. ## Frontend Workflow 1. Scientist selects a preset or builds custom filters in the UI 2. Frontend calls `GET /pipelines/groupings/{id}/resolve?group_by=Source.name,ObsUnit.line_id` to preview sub-groups 3. Scientist sees what data would be included, adjusts filters 4. Saves the DataGrouping 5. Creates Pipeline(s) attached to it, each with their own `group_by` and trigger ## Related Documentation - {doc}`pipeline_hierarchy` - How DataGrouping fits in the model hierarchy - {doc}`/source/architecture/filter_engine` - Technical details of the filter engine