# ADR-0003 — Monitoring alert substrate **Status.** Proposed, 2026-05-12. Revised 2026-05-13: Tier 1 deferred to monitoring framework PRD (#158; see "Revision: 2026-05-13" below). **Related.** ccatobs/system-integration#112 (scoping issue), ADR-0002 (parent substrate decision), ccatobs/system-integration#95 (PRD; Phase E gating), ccatobs/system-integration#155 (closed — Tier 1 provisioning deferred), ccatobs/system-integration#158 (monitoring framework PRD scoping — where the Tier 1 channel decision lands). **Supersedes (partial).** ADR-0002 §"Decision: Alert substrate" Tier 1 wiring — see "Decision: Tier 1 — deferred" below. ADR-0002's Grafana-driven T1 is rejected; the replacement is deferred rather than committed. ## Revision: 2026-05-13 After #154 landed (Tier 2 mail), the team re-evaluated the cost/benefit of immediately committing to a Tier 1 channel pre-Phase-E. Conclusion: defer. - **Pre-production state.** No on-call rotation; no 24/7 expectation. Dual-path redundancy is over-engineered for current operations. - **Tier 2 mail is itself TLS-independent.** Each host exits via postfix → `smtp.uni-koeln.de` (smarthost), outside the cert plane this substrate monitors. The load-bearing property "substrate alerts survive cert-plane failure" is satisfied by mail alone. - **Premature substrate lock-in.** The monitoring framework PRD scoping issue (#158) is already open. Provisioning Matrix bot + rooms now risks being sunk cost if that PRD selects a different paging channel (PagerDuty, Healthchecks.io, Grafana-native Matrix when it ships, etc.). - **Single-path risk is acknowledged and bounded.** Until #158's PRD lands, any mail-flow anomaly is treated as substrate-critical. The risk window narrows naturally as the (week-old) postfix install proves out. Decisions §"Tier 1" and §"Severity model + routing", §"Validation gate", and §"Implementation" are adjusted accordingly. Tier 2 (MTA) and the broader two-layer model are unchanged. ## Context Phase A of PRD #95 is code-complete but Check 8 (page-path E2E sign-off) is blocked. ADR-0002 specified a tiered alert substrate (T1 Telegraf→InfluxDB→Grafana→Matrix, T2 cron+mailx, T3 journald audit). Two issues block Check 8: - **The T1 path stops at Grafana.** `grafana/provisioning//` has only `datasources/` + `dashboards/` — no alert rules, no contact points, no notification policies provisioned anywhere. - **The T2 path is broken.** `roles/system_setup/tasks/sendmail.yml` configures `/etc/aliases` only; no MTA is started, so `mailx` fails with `postdrop: unable to look up public/pickup` on every input-* host. Mail never leaves. A deeper issue surfaces on closer reading: ADR-0002's T1 path is shared-fate with the very TLS chain it monitors. Telegraf → InfluxDB → Grafana run on input-b and depend on TLS. If the cert plane fails, T1 likely fails with it. The substrate's load-bearing property — TLS-independent paging — is not actually satisfied by Grafana-driven T1. This ADR addresses both gaps. Scope is intentionally tight: substrate-layer (cert plane) alerts only. Non-substrate signals (disk, containers, application health) ride on a Grafana alerting layer designed in a separate follow-up. ## Decision Two committed decisions plus one deferral: 1. **Tier 1 is deferred to the monitoring framework PRD (#158).** Source-push (systemd `OnFailure=` → direct Matrix POST) was designed and remains the leading candidate, but channel choice is re-evaluated as part of #158's broader scope. For Phase E cutover, substrate-layer alerts ride on Tier 2 mail alone. Acceptable given pre-production state and the fact that Tier 2 mail is itself TLS-independent (smarthost exit, outside the cert plane). Design captured below as "Decision: Tier 1 — deferred" for future reference. 2. **MTA is postfix null-client + `smtp.uni-koeln.de` smarthost.** Extend `roles/system_setup/tasks/sendmail.yml` with package install, service enable, minimal `main.cf`, `newaliases`. Each host runs its own MTA — no central relay (preserves "no shared infra dependency" property). 3. **Two-layer model.** This ADR establishes the substrate-layer pattern. Follow-up #158 scopes the full monitoring framework (Tier 1 channel decision + Grafana alerting layer for non-substrate signals). --- ## Decision: Tier 1 — deferred ### Status **Deferred** to the monitoring framework PRD (#158, 2026-05-13 revision). For Phase E, substrate alerts ride on Tier 2 mail alone. The design captured below documents what was considered and remains the leading candidate when #158 revisits the question; it is **not** currently implemented and the issues that would have implemented it (B and C in the implementation graph) are closed/descoped. ### Design considered (not implemented) Each renewal timer (`step-renew-x509@%i.service`, the SSH-cert-plane equivalent) would gain `OnFailure=ccat-cert-matrix@%i.service` alongside the existing `OnFailure=ccat-cert-mail@%i.service`. The new unit would invoke `/usr/local/bin/ccat-cert-matrix.sh %i` which POSTs to Matrix using the proven `ci/notify_matrix.py` pattern: `PUT /_matrix/client/v3/rooms/{room}/send/m.room.message/{txn_id}` with `Authorization: Bearer ` and an `m.notice` body containing host, service, unit, and last journalctl excerpt. ### Why source-push was preferred over Grafana-driven (retained for future PRD) - **TLS-independent end-to-end.** T1 path is `systemd → script → one HTTP call`. No Telegraf, no InfluxDB, no Grafana hop. Survives catastrophic failure of the entire observability stack. - **Symmetric with T2 mail.** Both fire from the same `OnFailure=` trigger via the same script-template pattern. One mental model, two destinations. - **Zero Grafana templating complexity.** No webhook contact point, no body template, no txn_id derivation in YAML. - **`:latest` Grafana future-compat.** If Grafana ever ships native Matrix support, it's adopted at the Grafana-alerting-layer, not retrofitted into the substrate. ### Alternatives considered (retained for future PRD) - **Grafana webhook → Matrix client-server API.** Non-trivial body template in YAML; txn_id-as-URL-segment has a small theoretical collision window; T1 inherits Grafana's shared-fate with TLS infrastructure. Appropriate for a future Grafana-alerting-layer for non-substrate signals. - **Thin webhook shim service.** Adds a new service to docker-compose with its own monitoring concerns ("who monitors the alert shim?"). Fewer moving parts via source-push. - **matrix-hookshot bridge.** Full bot infrastructure; overkill for substrate alerts. - **Non-Matrix channels** (added 2026-05-13). PagerDuty, Healthchecks.io dead-man's-switch, or a future Grafana-native Matrix channel are open candidates for #158. ### Consequences of deferral - No new template files in cert roles; `ccat-cert-mail@.service.j2` / `ccat-cert-mail.sh.j2` remain the sole `OnFailure=` wiring. - No monitoring bot user, no new Matrix rooms, no `vault_matrix_monitoring_*` schema entries. - **Single-path risk.** Substrate alerts depend solely on the Tier 2 mail path. Operational rule (see ADR §"Operational notes"): any mail-flow anomaly is substrate-critical until #158's PRD lands a Tier 1. --- ## Decision: Tier 2 — postfix null-client MTA ### Decision Extend `roles/system_setup/tasks/sendmail.yml`: - Install `postfix`, `s-nail` (provides `/usr/bin/mailx`), and on RHEL 9 `postfix-lmdb` (RHEL 10's base postfix already ships lmdb). - Drop minimal `/etc/postfix/main.cf` from a Jinja template: - `relayhost = {{ smtp_relayhost }}` - `inet_interfaces = loopback-only` - `mydestination = $myhostname, localhost.$mydomain, localhost` - `myhostname = {{ ansible_host }}` - `myorigin = $myhostname` - `alias_database = lmdb:/etc/aliases` - `alias_maps = lmdb:/etc/aliases` - Enable + start `postfix.service`. - Route system service mail to root via `/etc/aliases`: `logcheck`, `logwatch`, `postmaster`, `mailer-daemon`, `abuse` → `root`; then `root` → `admin_email_addresses`. - Rebuild `/etc/aliases.lmdb` via `newaliases` as an idempotent task (stat-driven, not a handler) so the role self-heals from any stuck state. Add group var `smtp_relayhost: smtp.uni-koeln.de` at `group_vars/all/`, overridable per-env or per-host. No auth required for ITCC-internal mail submission from on-campus hosts; if that changes, a `vault_smtp_relay_password` vault entry can be added in a future change. ### Implementation notes (post-landing, 2026-05-12) The directives above are the corrected forms after running the role against both environments. The original prescription (`mydestination =` empty, `myhostname = {{ inventory_hostname }}.uni-koeln.de`, hash-based aliases DB, postfix-only package list) did not survive contact with the live hosts: - **`mydestination =` (empty) bypasses `/etc/aliases`.** With no domain considered local, `local(8)` never runs, so root → admins rewriting is silently skipped and mail ships to a non-existent destination domain. - **`{{ inventory_hostname }}.uni-koeln.de` is not the host's real FQDN.** Inventory shorthand (`input-c.staging`) plus `.uni-koeln.de` produces a string that does not resolve in DNS. Pulls from `ansible_host` (the real inventory FQDN) instead. `ansible_fqdn` is not safe to use either — kernel hostnames on production input nodes are intentionally short (per host_vars), so `hostname --fqdn` returns garbage. Convention for this role and any future ones: use `ansible_host`, never `ansible_fqdn`. - **`hash` is gone on RHEL 10.** RHEL 10's postfix dropped the Berkeley DB hash map driver entirely (postfix's own diagnostic falsely suggests a `postfix-hash` subpackage; that package does not exist). Switched to `lmdb`. RHEL 9 supports both but ships `lmdb` as a separate `postfix-lmdb` subpackage; the role installs it conditionally on RHEL major ≤ 9. - **`s-nail` is the mailx provider.** Production was a barer install than staging and lacked `mailx`. Added to the role package list so the dependency is explicit and colocated with its consumer. - **System service aliases.** `logcheck`, `logwatch`, `postmaster`, `mailer-daemon`, `abuse` are all aliased to `root` so their output folds into the same root → admin chain. Without these, ~218 days of stranded logcheck reports flushed to the relay the moment postfix started. ### Why per-host MTA, not central relay The substrate's entire purpose is "no shared infrastructure dependency". A central relay host would re-introduce a single point of failure that the very alert path depends on. ### Consequences - `roles/system_setup` gains a postfix dependency on every host. Cost: ~5MB memory, one daemon. Acceptable. - `ccat-cert-mail.sh` and `ccat-cert-heartbeat.sh` start delivering mail immediately on every host after the role runs. - Operational rule "absence of daily mail ≥36h on any host = problem" becomes enforceable for the first time. - REUNA dropped from scope — was a deprecated test VM. Future observatory site MTA hookup is a separate concern when that infrastructure lands. --- ## Decision: Severity model + routing ### Decision - **Severity = tier subscription, not a label.** Substrate signals currently subscribe to T2 only (mail). When #158's PRD lands a Tier 1 channel, substrate signals will subscribe to both T1+T2. Non-substrate signals (future) may subscribe to one or both depending on urgency. - **Unified mail alias** (`admin_email_addresses` — current: `buchbend@ph1.uni-koeln.de, ngo@ph1.uni-koeln.de`). Staging mail uses `[STAGING]` subject prefix for inbox disambiguation. - **Tier 1 routing deferred.** The original section enumerated two new Matrix rooms (`#ccat-monitoring-prod`, `#ccat-monitoring-staging`) and three vault entries (`vault_matrix_monitoring_bot_token`, `vault_matrix_monitoring_room_id_prod`, `vault_matrix_monitoring_room_id_staging`). All deferred to #158 where channel choice is re-evaluated. No vault schema changes land with this ADR. ### Why no severity label The team has no on-call rotation today. A 3-level taxonomy (critical/warning/info) implies a hierarchy that doesn't exist. Tier subscription expresses what we actually mean: "is this paging or FYI?". Layer in labels when an on-call rotation is decided in the follow-up framework PRD. ### Why unified mail alias Two humans on the alias; subject-line prefix is sufficient disambiguation. Splitting aliases is overhead until the alias grows past 4 humans. ### On separating staging vs prod Tier 1 rooms (retained for future PRD) When Tier 1 lands, staging noise should not mix into the production paging room (desensitisation risk within weeks). The two-room split remains the recommended shape; only the *channel substrate* is open. --- ## Decision: Grafana provisioning approach ### Decision Grafana provisioning continues with the existing declarative YAML pattern: `grafana/provisioning//{datasources,dashboards}/`. **No `alerting/` subdir is created in this ADR's scope** — Tier 1 is source-push, not Grafana-driven. The Grafana alerting layer for non-substrate signals is designed in the follow-up framework PRD; if/when it lands, it will use the same `grafana/provisioning//alerting/` declarative-YAML shape. ### Authoring workflow (informational) Authors draft dashboards/alerts in the Grafana UI against provisioned datasources, then formalise the JSON/YAML back into the provisioning tree before committing. Drift between UI state and YAML is accepted at authoring time and resolved at commit time. --- ## Decision: Validation gate Check 8 (PRD #95 / runbook) signs off page-path E2E on the canary cert on `input-c.staging`: - **Induce primary (network partition):** `sudo iptables -I OUTPUT -d -p tcp --dport 9000 -j DROP` + `sudo systemctl start step-renew-x509-x509-canary.service`. Renewal should fail within ~30s. - **Expected:** Within ~60s, `admin_email_addresses` receives mail with `[STAGING]` subject prefix. - **Reset:** remove iptables rule. - **Sanity sibling (non-network):** relocate canary key file (`sudo mv /opt/x509-canary/canary.key /opt/x509-canary/canary.key.bak`), re-trigger renewal, confirm mail fires on the script error path. Reset: `mv` back. Tier 1 / Matrix half of dual-path validation is deferred along with the Tier 1 channel decision; Check 8 currently verifies the mail path only. **Procedure caveat (observed 2026-05-13 during #157 sign-off).** `step ca renew` short-circuits when the cert is not yet inside its renewal window, so the primary induce method (iptables DROP + `systemctl start`) does **not** fire `OnFailure=` on a freshly-issued canary — the renewal exits success before reaching the network call. The sanity-sibling test is the load-bearing evidence until the procedure is refined. Refinement (mirror `--force` from #151) tracked in #168. --- ## Out of scope Deferred to the follow-up scoping issue #158 (*"Scope monitoring framework PRD for non-substrate signals"*): - Inventory of all current telemetry (Telegraf inputs, container metrics, log sources). - Grafana alerting layer design for non-substrate signals. - Severity granularity beyond tier subscription. - On-call hand-off contract (acks, escalation, MTTR). - Concrete non-substrate anchor signals (disk fill, container restart loops, replication lag, etc.). - Tier 3 issuance-audit alerting (Loki rules per ADR-0002 §Monitoring). - Dashboards-as-code maturity (UI/JSON export workflow formalisation). ## Operational notes - The role task-ordering bug from finding 3 of #112's 2026-05-08 verification comment is resolved separately in **PR #152**. This ADR does not re-fix it. - **Mail is single-path for substrate alerts** until #158's PRD lands a Tier 1. Treat any mail-flow anomaly (queue stall, ≥36h absence of daily mail on any host, SMTP submission failure logs) as substrate-critical. Re-evaluate this rule once Tier 1 lands. - If `smtp.uni-koeln.de` requires auth in the future, add `vault_smtp_relay_password` to the schema and `smtp_sasl_password_maps` to the postfix template. Backward-compat: empty/missing vault var = no auth (current behaviour). - When #158's PRD revisits Tier 1: if the chosen channel is Matrix, bot-token rotation will be a schema-driven `ccat secrets set` + `ccat secrets provision` flow; if a different substrate is chosen, the rotation surface attaches to whatever the PRD defines. ## Implementation Issues that originally spun off this ADR. Original graph was `A ∥ B → C → D`; revised graph is `A → D`: - **A** — MTA install (`system_setup/sendmail.yml` extension). **Landed via #154.** - **B** — Matrix monitoring bot + rooms provisioning. **Closed #155 (2026-05-13)**; re-emerges in #158. - **C** — Matrix push wire-up for `step_ca_service_cert` and `ssh_service_cert` roles. **Descoped**; existing `OnFailure=ccat-cert-mail@%i.service` is sufficient for Phase E. - **D** — Check 8 page-path E2E sign-off on `input-c.staging` (mail-only). Phase E of PRD #95 unblocks on **D** landing.