# ADR-0003 — Monitoring alert substrate

**Status.** Proposed, 2026-05-12. Revised 2026-05-13: Tier 1 deferred to monitoring framework PRD (#158; see "Revision: 2026-05-13" below).

**Related.** ccatobs/system-integration#112 (scoping issue), ADR-0002 (parent substrate decision), ccatobs/system-integration#95 (PRD; Phase E gating), ccatobs/system-integration#155 (closed — Tier 1 provisioning deferred), ccatobs/system-integration#158 (monitoring framework PRD scoping — where the Tier 1 channel decision lands).

**Supersedes (partial).** ADR-0002 §"Decision: Alert substrate" Tier 1 wiring — see "Decision: Tier 1 — deferred" below. ADR-0002's Grafana-driven T1 is rejected; the replacement is deferred rather than committed.

## Revision: 2026-05-13

After #154 landed (Tier 2 mail), the team re-evaluated the cost/benefit of immediately committing to a Tier 1 channel pre-Phase-E. Conclusion: defer.

- **Pre-production state.** No on-call rotation; no 24/7 expectation. Dual-path redundancy is over-engineered for current operations.
- **Tier 2 mail is itself TLS-independent.** Each host exits via postfix → `smtp.uni-koeln.de` (smarthost), outside the cert plane this substrate monitors. The load-bearing property "substrate alerts survive cert-plane failure" is satisfied by mail alone.
- **Premature substrate lock-in.** The monitoring framework PRD scoping issue (#158) is already open. Provisioning Matrix bot + rooms now risks being sunk cost if that PRD selects a different paging channel (PagerDuty, Healthchecks.io, Grafana-native Matrix when it ships, etc.).
- **Single-path risk is acknowledged and bounded.** Until #158's PRD lands, any mail-flow anomaly is treated as substrate-critical. The risk window narrows naturally as the (week-old) postfix install proves out.

Decisions §"Tier 1" and §"Severity model + routing", §"Validation gate", and §"Implementation" are adjusted accordingly. Tier 2 (MTA) and the broader two-layer model are unchanged.

## Context

Phase A of PRD #95 is code-complete but Check 8 (page-path E2E sign-off) is blocked. ADR-0002 specified a tiered alert substrate (T1 Telegraf→InfluxDB→Grafana→Matrix, T2 cron+mailx, T3 journald audit). Two issues block Check 8:

- **The T1 path stops at Grafana.** `grafana/provisioning/<env>/` has only `datasources/` + `dashboards/` — no alert rules, no contact points, no notification policies provisioned anywhere.
- **The T2 path is broken.** `roles/system_setup/tasks/sendmail.yml` configures `/etc/aliases` only; no MTA is started, so `mailx` fails with `postdrop: unable to look up public/pickup` on every input-* host. Mail never leaves.

A deeper issue surfaces on closer reading: ADR-0002's T1 path is shared-fate with the very TLS chain it monitors. Telegraf → InfluxDB → Grafana run on input-b and depend on TLS. If the cert plane fails, T1 likely fails with it. The substrate's load-bearing property — TLS-independent paging — is not actually satisfied by Grafana-driven T1.

This ADR addresses both gaps. Scope is intentionally tight: substrate-layer (cert plane) alerts only. Non-substrate signals (disk, containers, application health) ride on a Grafana alerting layer designed in a separate follow-up.

## Decision

Two committed decisions plus one deferral:

1. **Tier 1 is deferred to the monitoring framework PRD (#158).** Source-push (systemd `OnFailure=` → direct Matrix POST) was designed and remains the leading candidate, but channel choice is re-evaluated as part of #158's broader scope. For Phase E cutover, substrate-layer alerts ride on Tier 2 mail alone. Acceptable given pre-production state and the fact that Tier 2 mail is itself TLS-independent (smarthost exit, outside the cert plane). Design captured below as "Decision: Tier 1 — deferred" for future reference.
2. **MTA is postfix null-client + `smtp.uni-koeln.de` smarthost.** Extend `roles/system_setup/tasks/sendmail.yml` with package install, service enable, minimal `main.cf`, `newaliases`. Each host runs its own MTA — no central relay (preserves "no shared infra dependency" property).
3. **Two-layer model.** This ADR establishes the substrate-layer pattern. Follow-up #158 scopes the full monitoring framework (Tier 1 channel decision + Grafana alerting layer for non-substrate signals).

---

## Decision: Tier 1 — deferred

### Status

**Deferred** to the monitoring framework PRD (#158, 2026-05-13 revision). For Phase E, substrate alerts ride on Tier 2 mail alone. The design captured below documents what was considered and remains the leading candidate when #158 revisits the question; it is **not** currently implemented and the issues that would have implemented it (B and C in the implementation graph) are closed/descoped.

### Design considered (not implemented)

Each renewal timer (`step-renew-x509@%i.service`, the SSH-cert-plane equivalent) would gain `OnFailure=ccat-cert-matrix@%i.service` alongside the existing `OnFailure=ccat-cert-mail@%i.service`. The new unit would invoke `/usr/local/bin/ccat-cert-matrix.sh %i` which POSTs to Matrix using the proven `ci/notify_matrix.py` pattern: `PUT <homeserver>/_matrix/client/v3/rooms/{room}/send/m.room.message/{txn_id}` with `Authorization: Bearer <bot_token>` and an `m.notice` body containing host, service, unit, and last journalctl excerpt.

### Why source-push was preferred over Grafana-driven (retained for future PRD)

- **TLS-independent end-to-end.** T1 path is `systemd → script → one HTTP call`. No Telegraf, no InfluxDB, no Grafana hop. Survives catastrophic failure of the entire observability stack.
- **Symmetric with T2 mail.** Both fire from the same `OnFailure=` trigger via the same script-template pattern. One mental model, two destinations.
- **Zero Grafana templating complexity.** No webhook contact point, no body template, no txn_id derivation in YAML.
- **`:latest` Grafana future-compat.** If Grafana ever ships native Matrix support, it's adopted at the Grafana-alerting-layer, not retrofitted into the substrate.

### Alternatives considered (retained for future PRD)

- **Grafana webhook → Matrix client-server API.** Non-trivial body template in YAML; txn_id-as-URL-segment has a small theoretical collision window; T1 inherits Grafana's shared-fate with TLS infrastructure. Appropriate for a future Grafana-alerting-layer for non-substrate signals.
- **Thin webhook shim service.** Adds a new service to docker-compose with its own monitoring concerns ("who monitors the alert shim?"). Fewer moving parts via source-push.
- **matrix-hookshot bridge.** Full bot infrastructure; overkill for substrate alerts.
- **Non-Matrix channels** (added 2026-05-13). PagerDuty, Healthchecks.io dead-man's-switch, or a future Grafana-native Matrix channel are open candidates for #158.

### Consequences of deferral

- No new template files in cert roles; `ccat-cert-mail@.service.j2` / `ccat-cert-mail.sh.j2` remain the sole `OnFailure=` wiring.
- No monitoring bot user, no new Matrix rooms, no `vault_matrix_monitoring_*` schema entries.
- **Single-path risk.** Substrate alerts depend solely on the Tier 2 mail path. Operational rule (see ADR §"Operational notes"): any mail-flow anomaly is substrate-critical until #158's PRD lands a Tier 1.

---

## Decision: Tier 2 — postfix null-client MTA

### Decision

Extend `roles/system_setup/tasks/sendmail.yml`:

- Install `postfix`, `s-nail` (provides `/usr/bin/mailx`), and on RHEL 9 `postfix-lmdb` (RHEL 10's base postfix already ships lmdb).
- Drop minimal `/etc/postfix/main.cf` from a Jinja template:
  - `relayhost = {{ smtp_relayhost }}`
  - `inet_interfaces = loopback-only`
  - `mydestination = $myhostname, localhost.$mydomain, localhost`
  - `myhostname = {{ ansible_host }}`
  - `myorigin = $myhostname`
  - `alias_database = lmdb:/etc/aliases`
  - `alias_maps = lmdb:/etc/aliases`
- Enable + start `postfix.service`.
- Route system service mail to root via `/etc/aliases`: `logcheck`, `logwatch`, `postmaster`, `mailer-daemon`, `abuse` → `root`; then `root` → `admin_email_addresses`.
- Rebuild `/etc/aliases.lmdb` via `newaliases` as an idempotent task (stat-driven, not a handler) so the role self-heals from any stuck state.

Add group var `smtp_relayhost: smtp.uni-koeln.de` at `group_vars/all/`, overridable per-env or per-host. No auth required for ITCC-internal mail submission from on-campus hosts; if that changes, a `vault_smtp_relay_password` vault entry can be added in a future change.

### Implementation notes (post-landing, 2026-05-12)

The directives above are the corrected forms after running the role against both environments. The original prescription (`mydestination =` empty, `myhostname = {{ inventory_hostname }}.uni-koeln.de`, hash-based aliases DB, postfix-only package list) did not survive contact with the live hosts:

- **`mydestination =` (empty) bypasses `/etc/aliases`.** With no domain considered local, `local(8)` never runs, so root → admins rewriting is silently skipped and mail ships to a non-existent destination domain.
- **`{{ inventory_hostname }}.uni-koeln.de` is not the host's real FQDN.** Inventory shorthand (`input-c.staging`) plus `.uni-koeln.de` produces a string that does not resolve in DNS. Pulls from `ansible_host` (the real inventory FQDN) instead. `ansible_fqdn` is not safe to use either — kernel hostnames on production input nodes are intentionally short (per host_vars), so `hostname --fqdn` returns garbage. Convention for this role and any future ones: use `ansible_host`, never `ansible_fqdn`.
- **`hash` is gone on RHEL 10.** RHEL 10's postfix dropped the Berkeley DB hash map driver entirely (postfix's own diagnostic falsely suggests a `postfix-hash` subpackage; that package does not exist). Switched to `lmdb`. RHEL 9 supports both but ships `lmdb` as a separate `postfix-lmdb` subpackage; the role installs it conditionally on RHEL major ≤ 9.
- **`s-nail` is the mailx provider.** Production was a barer install than staging and lacked `mailx`. Added to the role package list so the dependency is explicit and colocated with its consumer.
- **System service aliases.** `logcheck`, `logwatch`, `postmaster`, `mailer-daemon`, `abuse` are all aliased to `root` so their output folds into the same root → admin chain. Without these, ~218 days of stranded logcheck reports flushed to the relay the moment postfix started.

### Why per-host MTA, not central relay

The substrate's entire purpose is "no shared infrastructure dependency". A central relay host would re-introduce a single point of failure that the very alert path depends on.

### Consequences

- `roles/system_setup` gains a postfix dependency on every host. Cost: ~5MB memory, one daemon. Acceptable.
- `ccat-cert-mail.sh` and `ccat-cert-heartbeat.sh` start delivering mail immediately on every host after the role runs.
- Operational rule "absence of daily mail ≥36h on any host = problem" becomes enforceable for the first time.
- REUNA dropped from scope — was a deprecated test VM. Future observatory site MTA hookup is a separate concern when that infrastructure lands.

---

## Decision: Severity model + routing

### Decision

- **Severity = tier subscription, not a label.** Substrate signals currently subscribe to T2 only (mail). When #158's PRD lands a Tier 1 channel, substrate signals will subscribe to both T1+T2. Non-substrate signals (future) may subscribe to one or both depending on urgency.
- **Unified mail alias** (`admin_email_addresses` — current: `buchbend@ph1.uni-koeln.de, ngo@ph1.uni-koeln.de`). Staging mail uses `[STAGING]` subject prefix for inbox disambiguation.
- **Tier 1 routing deferred.** The original section enumerated two new Matrix rooms (`#ccat-monitoring-prod`, `#ccat-monitoring-staging`) and three vault entries (`vault_matrix_monitoring_bot_token`, `vault_matrix_monitoring_room_id_prod`, `vault_matrix_monitoring_room_id_staging`). All deferred to #158 where channel choice is re-evaluated. No vault schema changes land with this ADR.

### Why no severity label

The team has no on-call rotation today. A 3-level taxonomy (critical/warning/info) implies a hierarchy that doesn't exist. Tier subscription expresses what we actually mean: "is this paging or FYI?". Layer in labels when an on-call rotation is decided in the follow-up framework PRD.

### Why unified mail alias

Two humans on the alias; subject-line prefix is sufficient disambiguation. Splitting aliases is overhead until the alias grows past 4 humans.

### On separating staging vs prod Tier 1 rooms (retained for future PRD)

When Tier 1 lands, staging noise should not mix into the production paging room (desensitisation risk within weeks). The two-room split remains the recommended shape; only the *channel substrate* is open.

---

## Decision: Grafana provisioning approach

### Decision

Grafana provisioning continues with the existing declarative YAML pattern: `grafana/provisioning/<env>/{datasources,dashboards}/`. **No `alerting/` subdir is created in this ADR's scope** — Tier 1 is source-push, not Grafana-driven. The Grafana alerting layer for non-substrate signals is designed in the follow-up framework PRD; if/when it lands, it will use the same `grafana/provisioning/<env>/alerting/` declarative-YAML shape.

### Authoring workflow (informational)

Authors draft dashboards/alerts in the Grafana UI against provisioned datasources, then formalise the JSON/YAML back into the provisioning tree before committing. Drift between UI state and YAML is accepted at authoring time and resolved at commit time.

---

## Decision: Validation gate

Check 8 (PRD #95 / runbook) signs off page-path E2E on the canary cert on `input-c.staging`:

- **Induce primary (network partition):** `sudo iptables -I OUTPUT -d <step-ca-IP> -p tcp --dport 9000 -j DROP` + `sudo systemctl start step-renew-x509-x509-canary.service`. Renewal should fail within ~30s.
- **Expected:** Within ~60s, `admin_email_addresses` receives mail with `[STAGING]` subject prefix.
- **Reset:** remove iptables rule.
- **Sanity sibling (non-network):** relocate canary key file (`sudo mv /opt/x509-canary/canary.key /opt/x509-canary/canary.key.bak`), re-trigger renewal, confirm mail fires on the script error path. Reset: `mv` back.

Tier 1 / Matrix half of dual-path validation is deferred along with the Tier 1 channel decision; Check 8 currently verifies the mail path only.

**Procedure caveat (observed 2026-05-13 during #157 sign-off).** `step ca renew` short-circuits when the cert is not yet inside its renewal window, so the primary induce method (iptables DROP + `systemctl start`) does **not** fire `OnFailure=` on a freshly-issued canary — the renewal exits success before reaching the network call. The sanity-sibling test is the load-bearing evidence until the procedure is refined. Refinement (mirror `--force` from #151) tracked in #168.

---

## Out of scope

Deferred to the follow-up scoping issue #158 (*"Scope monitoring framework PRD for non-substrate signals"*):

- Inventory of all current telemetry (Telegraf inputs, container metrics, log sources).
- Grafana alerting layer design for non-substrate signals.
- Severity granularity beyond tier subscription.
- On-call hand-off contract (acks, escalation, MTTR).
- Concrete non-substrate anchor signals (disk fill, container restart loops, replication lag, etc.).
- Tier 3 issuance-audit alerting (Loki rules per ADR-0002 §Monitoring).
- Dashboards-as-code maturity (UI/JSON export workflow formalisation).

## Operational notes

- The role task-ordering bug from finding 3 of #112's 2026-05-08 verification comment is resolved separately in **PR #152**. This ADR does not re-fix it.
- **Mail is single-path for substrate alerts** until #158's PRD lands a Tier 1. Treat any mail-flow anomaly (queue stall, ≥36h absence of daily mail on any host, SMTP submission failure logs) as substrate-critical. Re-evaluate this rule once Tier 1 lands.
- If `smtp.uni-koeln.de` requires auth in the future, add `vault_smtp_relay_password` to the schema and `smtp_sasl_password_maps` to the postfix template. Backward-compat: empty/missing vault var = no auth (current behaviour).
- When #158's PRD revisits Tier 1: if the chosen channel is Matrix, bot-token rotation will be a schema-driven `ccat secrets set` + `ccat secrets provision` flow; if a different substrate is chosen, the rotation surface attaches to whatever the PRD defines.

## Implementation

Issues that originally spun off this ADR. Original graph was `A ∥ B → C → D`; revised graph is `A → D`:

- **A** — MTA install (`system_setup/sendmail.yml` extension). **Landed via #154.**
- **B** — Matrix monitoring bot + rooms provisioning. **Closed #155 (2026-05-13)**; re-emerges in #158.
- **C** — Matrix push wire-up for `step_ca_service_cert` and `ssh_service_cert` roles. **Descoped**; existing `OnFailure=ccat-cert-mail@%i.service` is sufficient for Phase E.
- **D** — Check 8 page-path E2E sign-off on `input-c.staging` (mail-only).

Phase E of PRD #95 unblocks on **D** landing.