ADR-0003 — Monitoring alert substrate#
Status. Proposed, 2026-05-12. Revised 2026-05-13: Tier 1 deferred to monitoring framework PRD (#158; see “Revision: 2026-05-13” below).
Related. ccatobs/system-integration#112 (scoping issue), ADR-0002 (parent substrate decision), ccatobs/system-integration#95 (PRD; Phase E gating), ccatobs/system-integration#155 (closed — Tier 1 provisioning deferred), ccatobs/system-integration#158 (monitoring framework PRD scoping — where the Tier 1 channel decision lands).
Supersedes (partial). ADR-0002 §”Decision: Alert substrate” Tier 1 wiring — see “Decision: Tier 1 — deferred” below. ADR-0002’s Grafana-driven T1 is rejected; the replacement is deferred rather than committed.
Revision: 2026-05-13#
After #154 landed (Tier 2 mail), the team re-evaluated the cost/benefit of immediately committing to a Tier 1 channel pre-Phase-E. Conclusion: defer.
Pre-production state. No on-call rotation; no 24/7 expectation. Dual-path redundancy is over-engineered for current operations.
Tier 2 mail is itself TLS-independent. Each host exits via postfix →
smtp.uni-koeln.de(smarthost), outside the cert plane this substrate monitors. The load-bearing property “substrate alerts survive cert-plane failure” is satisfied by mail alone.Premature substrate lock-in. The monitoring framework PRD scoping issue (#158) is already open. Provisioning Matrix bot + rooms now risks being sunk cost if that PRD selects a different paging channel (PagerDuty, Healthchecks.io, Grafana-native Matrix when it ships, etc.).
Single-path risk is acknowledged and bounded. Until #158’s PRD lands, any mail-flow anomaly is treated as substrate-critical. The risk window narrows naturally as the (week-old) postfix install proves out.
Decisions §”Tier 1” and §”Severity model + routing”, §”Validation gate”, and §”Implementation” are adjusted accordingly. Tier 2 (MTA) and the broader two-layer model are unchanged.
Context#
Phase A of PRD #95 is code-complete but Check 8 (page-path E2E sign-off) is blocked. ADR-0002 specified a tiered alert substrate (T1 Telegraf→InfluxDB→Grafana→Matrix, T2 cron+mailx, T3 journald audit). Two issues block Check 8:
The T1 path stops at Grafana.
grafana/provisioning/<env>/has onlydatasources/+dashboards/— no alert rules, no contact points, no notification policies provisioned anywhere.The T2 path is broken.
roles/system_setup/tasks/sendmail.ymlconfigures/etc/aliasesonly; no MTA is started, somailxfails withpostdrop: unable to look up public/pickupon every input-* host. Mail never leaves.
A deeper issue surfaces on closer reading: ADR-0002’s T1 path is shared-fate with the very TLS chain it monitors. Telegraf → InfluxDB → Grafana run on input-b and depend on TLS. If the cert plane fails, T1 likely fails with it. The substrate’s load-bearing property — TLS-independent paging — is not actually satisfied by Grafana-driven T1.
This ADR addresses both gaps. Scope is intentionally tight: substrate-layer (cert plane) alerts only. Non-substrate signals (disk, containers, application health) ride on a Grafana alerting layer designed in a separate follow-up.
Decision#
Two committed decisions plus one deferral:
Tier 1 is deferred to the monitoring framework PRD (#158). Source-push (systemd
OnFailure=→ direct Matrix POST) was designed and remains the leading candidate, but channel choice is re-evaluated as part of #158’s broader scope. For Phase E cutover, substrate-layer alerts ride on Tier 2 mail alone. Acceptable given pre-production state and the fact that Tier 2 mail is itself TLS-independent (smarthost exit, outside the cert plane). Design captured below as “Decision: Tier 1 — deferred” for future reference.MTA is postfix null-client +
smtp.uni-koeln.desmarthost. Extendroles/system_setup/tasks/sendmail.ymlwith package install, service enable, minimalmain.cf,newaliases. Each host runs its own MTA — no central relay (preserves “no shared infra dependency” property).Two-layer model. This ADR establishes the substrate-layer pattern. Follow-up #158 scopes the full monitoring framework (Tier 1 channel decision + Grafana alerting layer for non-substrate signals).
Decision: Tier 1 — deferred#
Status#
Deferred to the monitoring framework PRD (#158, 2026-05-13 revision). For Phase E, substrate alerts ride on Tier 2 mail alone. The design captured below documents what was considered and remains the leading candidate when #158 revisits the question; it is not currently implemented and the issues that would have implemented it (B and C in the implementation graph) are closed/descoped.
Design considered (not implemented)#
Each renewal timer (step-renew-x509@%i.service, the SSH-cert-plane equivalent) would gain OnFailure=ccat-cert-matrix@%i.service alongside the existing OnFailure=ccat-cert-mail@%i.service. The new unit would invoke /usr/local/bin/ccat-cert-matrix.sh %i which POSTs to Matrix using the proven ci/notify_matrix.py pattern: PUT <homeserver>/_matrix/client/v3/rooms/{room}/send/m.room.message/{txn_id} with Authorization: Bearer <bot_token> and an m.notice body containing host, service, unit, and last journalctl excerpt.
Why source-push was preferred over Grafana-driven (retained for future PRD)#
TLS-independent end-to-end. T1 path is
systemd → script → one HTTP call. No Telegraf, no InfluxDB, no Grafana hop. Survives catastrophic failure of the entire observability stack.Symmetric with T2 mail. Both fire from the same
OnFailure=trigger via the same script-template pattern. One mental model, two destinations.Zero Grafana templating complexity. No webhook contact point, no body template, no txn_id derivation in YAML.
:latestGrafana future-compat. If Grafana ever ships native Matrix support, it’s adopted at the Grafana-alerting-layer, not retrofitted into the substrate.
Alternatives considered (retained for future PRD)#
Grafana webhook → Matrix client-server API. Non-trivial body template in YAML; txn_id-as-URL-segment has a small theoretical collision window; T1 inherits Grafana’s shared-fate with TLS infrastructure. Appropriate for a future Grafana-alerting-layer for non-substrate signals.
Thin webhook shim service. Adds a new service to docker-compose with its own monitoring concerns (“who monitors the alert shim?”). Fewer moving parts via source-push.
matrix-hookshot bridge. Full bot infrastructure; overkill for substrate alerts.
Non-Matrix channels (added 2026-05-13). PagerDuty, Healthchecks.io dead-man’s-switch, or a future Grafana-native Matrix channel are open candidates for #158.
Consequences of deferral#
No new template files in cert roles;
ccat-cert-mail@.service.j2/ccat-cert-mail.sh.j2remain the soleOnFailure=wiring.No monitoring bot user, no new Matrix rooms, no
vault_matrix_monitoring_*schema entries.Single-path risk. Substrate alerts depend solely on the Tier 2 mail path. Operational rule (see ADR §”Operational notes”): any mail-flow anomaly is substrate-critical until #158’s PRD lands a Tier 1.
Decision: Tier 2 — postfix null-client MTA#
Decision#
Extend roles/system_setup/tasks/sendmail.yml:
Install
postfix,s-nail(provides/usr/bin/mailx), and on RHEL 9postfix-lmdb(RHEL 10’s base postfix already ships lmdb).Drop minimal
/etc/postfix/main.cffrom a Jinja template:relayhost = {{ smtp_relayhost }}inet_interfaces = loopback-onlymydestination = $myhostname, localhost.$mydomain, localhostmyhostname = {{ ansible_host }}myorigin = $myhostnamealias_database = lmdb:/etc/aliasesalias_maps = lmdb:/etc/aliases
Enable + start
postfix.service.Route system service mail to root via
/etc/aliases:logcheck,logwatch,postmaster,mailer-daemon,abuse→root; thenroot→admin_email_addresses.Rebuild
/etc/aliases.lmdbvianewaliasesas an idempotent task (stat-driven, not a handler) so the role self-heals from any stuck state.
Add group var smtp_relayhost: smtp.uni-koeln.de at group_vars/all/, overridable per-env or per-host. No auth required for ITCC-internal mail submission from on-campus hosts; if that changes, a vault_smtp_relay_password vault entry can be added in a future change.
Implementation notes (post-landing, 2026-05-12)#
The directives above are the corrected forms after running the role against both environments. The original prescription (mydestination = empty, myhostname = {{ inventory_hostname }}.uni-koeln.de, hash-based aliases DB, postfix-only package list) did not survive contact with the live hosts:
mydestination =(empty) bypasses/etc/aliases. With no domain considered local,local(8)never runs, so root → admins rewriting is silently skipped and mail ships to a non-existent destination domain.{{ inventory_hostname }}.uni-koeln.deis not the host’s real FQDN. Inventory shorthand (input-c.staging) plus.uni-koeln.deproduces a string that does not resolve in DNS. Pulls fromansible_host(the real inventory FQDN) instead.ansible_fqdnis not safe to use either — kernel hostnames on production input nodes are intentionally short (per host_vars), sohostname --fqdnreturns garbage. Convention for this role and any future ones: useansible_host, neveransible_fqdn.hashis gone on RHEL 10. RHEL 10’s postfix dropped the Berkeley DB hash map driver entirely (postfix’s own diagnostic falsely suggests apostfix-hashsubpackage; that package does not exist). Switched tolmdb. RHEL 9 supports both but shipslmdbas a separatepostfix-lmdbsubpackage; the role installs it conditionally on RHEL major ≤ 9.s-nailis the mailx provider. Production was a barer install than staging and lackedmailx. Added to the role package list so the dependency is explicit and colocated with its consumer.System service aliases.
logcheck,logwatch,postmaster,mailer-daemon,abuseare all aliased torootso their output folds into the same root → admin chain. Without these, ~218 days of stranded logcheck reports flushed to the relay the moment postfix started.
Why per-host MTA, not central relay#
The substrate’s entire purpose is “no shared infrastructure dependency”. A central relay host would re-introduce a single point of failure that the very alert path depends on.
Consequences#
roles/system_setupgains a postfix dependency on every host. Cost: ~5MB memory, one daemon. Acceptable.ccat-cert-mail.shandccat-cert-heartbeat.shstart delivering mail immediately on every host after the role runs.Operational rule “absence of daily mail ≥36h on any host = problem” becomes enforceable for the first time.
REUNA dropped from scope — was a deprecated test VM. Future observatory site MTA hookup is a separate concern when that infrastructure lands.
Decision: Severity model + routing#
Decision#
Severity = tier subscription, not a label. Substrate signals currently subscribe to T2 only (mail). When #158’s PRD lands a Tier 1 channel, substrate signals will subscribe to both T1+T2. Non-substrate signals (future) may subscribe to one or both depending on urgency.
Unified mail alias (
admin_email_addresses— current:buchbend@ph1.uni-koeln.de, ngo@ph1.uni-koeln.de). Staging mail uses[STAGING]subject prefix for inbox disambiguation.Tier 1 routing deferred. The original section enumerated two new Matrix rooms (
#ccat-monitoring-prod,#ccat-monitoring-staging) and three vault entries (vault_matrix_monitoring_bot_token,vault_matrix_monitoring_room_id_prod,vault_matrix_monitoring_room_id_staging). All deferred to #158 where channel choice is re-evaluated. No vault schema changes land with this ADR.
Why no severity label#
The team has no on-call rotation today. A 3-level taxonomy (critical/warning/info) implies a hierarchy that doesn’t exist. Tier subscription expresses what we actually mean: “is this paging or FYI?”. Layer in labels when an on-call rotation is decided in the follow-up framework PRD.
Why unified mail alias#
Two humans on the alias; subject-line prefix is sufficient disambiguation. Splitting aliases is overhead until the alias grows past 4 humans.
On separating staging vs prod Tier 1 rooms (retained for future PRD)#
When Tier 1 lands, staging noise should not mix into the production paging room (desensitisation risk within weeks). The two-room split remains the recommended shape; only the channel substrate is open.
Decision: Grafana provisioning approach#
Decision#
Grafana provisioning continues with the existing declarative YAML pattern: grafana/provisioning/<env>/{datasources,dashboards}/. No alerting/ subdir is created in this ADR’s scope — Tier 1 is source-push, not Grafana-driven. The Grafana alerting layer for non-substrate signals is designed in the follow-up framework PRD; if/when it lands, it will use the same grafana/provisioning/<env>/alerting/ declarative-YAML shape.
Decision: Validation gate#
Check 8 (PRD #95 / runbook) signs off page-path E2E on the canary cert on input-c.staging:
Induce primary (network partition):
sudo iptables -I OUTPUT -d <step-ca-IP> -p tcp --dport 9000 -j DROP+sudo systemctl start step-renew-x509-x509-canary.service. Renewal should fail within ~30s.Expected: Within ~60s,
admin_email_addressesreceives mail with[STAGING]subject prefix.Reset: remove iptables rule.
Sanity sibling (non-network): relocate canary key file (
sudo mv /opt/x509-canary/canary.key /opt/x509-canary/canary.key.bak), re-trigger renewal, confirm mail fires on the script error path. Reset:mvback.
Tier 1 / Matrix half of dual-path validation is deferred along with the Tier 1 channel decision; Check 8 currently verifies the mail path only.
Procedure caveat (observed 2026-05-13 during #157 sign-off). step ca renew short-circuits when the cert is not yet inside its renewal window, so the primary induce method (iptables DROP + systemctl start) does not fire OnFailure= on a freshly-issued canary — the renewal exits success before reaching the network call. The sanity-sibling test is the load-bearing evidence until the procedure is refined. Refinement (mirror --force from #151) tracked in #168.
Out of scope#
Deferred to the follow-up scoping issue #158 (“Scope monitoring framework PRD for non-substrate signals”):
Inventory of all current telemetry (Telegraf inputs, container metrics, log sources).
Grafana alerting layer design for non-substrate signals.
Severity granularity beyond tier subscription.
On-call hand-off contract (acks, escalation, MTTR).
Concrete non-substrate anchor signals (disk fill, container restart loops, replication lag, etc.).
Tier 3 issuance-audit alerting (Loki rules per ADR-0002 §Monitoring).
Dashboards-as-code maturity (UI/JSON export workflow formalisation).
Operational notes#
The role task-ordering bug from finding 3 of #112’s 2026-05-08 verification comment is resolved separately in PR #152. This ADR does not re-fix it.
Mail is single-path for substrate alerts until #158’s PRD lands a Tier 1. Treat any mail-flow anomaly (queue stall, ≥36h absence of daily mail on any host, SMTP submission failure logs) as substrate-critical. Re-evaluate this rule once Tier 1 lands.
If
smtp.uni-koeln.derequires auth in the future, addvault_smtp_relay_passwordto the schema andsmtp_sasl_password_mapsto the postfix template. Backward-compat: empty/missing vault var = no auth (current behaviour).When #158’s PRD revisits Tier 1: if the chosen channel is Matrix, bot-token rotation will be a schema-driven
ccat secrets set+ccat secrets provisionflow; if a different substrate is chosen, the rotation surface attaches to whatever the PRD defines.
Implementation#
Issues that originally spun off this ADR. Original graph was A ∥ B → C → D; revised graph is A → D:
A — MTA install (
system_setup/sendmail.ymlextension). Landed via #154.B — Matrix monitoring bot + rooms provisioning. Closed #155 (2026-05-13); re-emerges in #158.
C — Matrix push wire-up for
step_ca_service_certandssh_service_certroles. Descoped; existingOnFailure=ccat-cert-mail@%i.serviceis sufficient for Phase E.D — Check 8 page-path E2E sign-off on
input-c.staging(mail-only).
Phase E of PRD #95 unblocks on D landing.