CCAT Certificate Authority — Threat Model & Attack Surface#
This document is the security-review companion to
CCAT Certificate Authority — Operations Guide. Where that document explains how the
CCAT step-ca deployment is built and operated, this one explains what
it is exposed to, what defends it, and what to do when something goes
wrong. It is intended as reference material for security review,
incident response, and the Phase 3 go/no-go decision on worldwide
exposure of ca.ccat.uni-koeln.de:9000.
For PKI fundamentals (what a cert is, how chains work, why short lifetimes matter), see TLS, Certificates, and Public Key Infrastructure. For OS and container patch cadence — which directly affects how fast we close CVEs in step-ca itself — see Patch Management, Container Security & Supply Chains.
Scope#
What this document covers. The public and auth-gated attack surface of the deployed step-ca instance, the Dex OIDC provider it relies on, realistic attacker scenarios, the defense-in-depth layers that stop them, and an incident-response sketch per failure mode.
What it does not cover. Cryptographic primitives and PKI theory (see TLS, Certificates, and Public Key Infrastructure), day-to-day operational runbooks such as adding provisioners or rotating the intermediate (see CCAT Certificate Authority — Operations Guide), and the broader supply-chain story for OS and container updates (see Patch Management, Container Security & Supply Chains).
Deployment phases#
The threat model differs sharply between the rollout phases:
Phase |
Root key |
Production trust |
Threat-model posture |
|---|---|---|---|
Phase 1 — dry-run |
File-based, throwaway |
No production service trusts the root yet |
Everything issued is a dress rehearsal. A full compromise = wipe and redo the ceremony. Low stakes. |
Phase 2 — HSM rehearsal |
HSM-backed, real |
Still no production trust; rotation drills only |
Keys are extraction-resistant. Signing key never leaves the HSM. |
Phase 3 — steady state |
HSM-backed, real |
All CCAT hosts trust the root; SSH, mTLS, internal web UIs chain to it |
Real blast radius. The operational hardening checklist below must be complete before entering this phase. |
Scale assumption#
CCAT is a small, university-hosted, internal-use PKI: roughly 20 trusted humans (the ccatobs GitHub org) plus a handful of service accounts and hosts. The threat model reflects this — we are not defending a public CA with millions of subscribers. We do not need WebPKI-grade transparency logs, CRLs, or staffed SOCs. We do need the discipline that comes with being an internal trust anchor for real infrastructure.
Attack surface#
Two DNS names are relevant:
ca.ccat.uni-koeln.de— step-ca’s API. In Phase 1 it rides behind nginx-proxy on port 443 (the “trust-bundle hack”); the steady-state plan is to expose step-ca directly on TCP 9000, worldwide, once Uni Köln IT opens the firewall.auth.ccat.uni-koeln.de— Dex’s OIDC endpoints and GitHub login redirect page, on port 443 with Let’s Encrypt termination via acme-companion. Dex has no admin UI and no path that needs IP gating.
Public-by-design endpoints#
These endpoints are meant to be reachable by anyone. Serving them to the entire internet is not a security regression — it is how the protocols are specified. The equivalent endpoints on Let’s Encrypt and every other public CA are worldwide-reachable by design.
Endpoint |
Purpose |
What an attacker learns |
|---|---|---|
|
Liveness probe |
The CA is up |
|
Public root certificate |
The public half of our trust anchor — exactly the material every client has to fetch anyway |
|
Provisioner discovery |
Provisioner names, types, and public config: OIDC issuer URL, OIDC client ID, allowed group claims, ACME directory URL |
ACME directory ( |
RFC 8555 discovery |
Standard ACME endpoints |
Dex OIDC discovery ( |
OIDC metadata |
Issuer URL, JWKS URI, supported flows |
What is not in those responses:
JWK provisioner passwords (these are held in the Ansible vault and never served)
OIDC client secrets (held by Dex via its static clients config, never emitted)
Signing keys or any secret material — the CA emits only public certs
User identities, SSH principal lists, or issued-cert history
Publishing provisioner metadata is intentional. A client that wants to request a cert needs to know which provisioners exist and how to authenticate to them. This is the same discovery model Let’s Encrypt uses via its ACME directory.
Auth-gated endpoints — where the real security lives#
Everything that actually issues a cert sits behind one of four authentication gates, each with its own strength profile.
POST /1.0/ssh/sign — OIDC (CCAT-GitHub provisioner)
Requires a Dex-issued OIDC token whose groups claim contains
the ccatobs/datacenter GitHub team slug. Strong by design: an
attacker has to clear three independent gates — a valid GitHub
identity, successful GitHub OAuth with read:org scope, and
actual membership in the ccatobs/datacenter team at the moment
Dex calls GitHub’s team-membership endpoint. Membership is checked
live against GitHub on every authentication, so a user removed
from the team cannot authenticate even if their browser session
is still warm. Output is a 16-hour SSH user cert with the user’s
principal only.
POST /1.0/sign — JWK (prod-services, staging-services)
Requires knowledge of the provisioner password (STEP_CA_PASSWORD,
held in the Ansible vault as vault_step_ca_password). Medium
strength: it is a single long-lived secret, but it is high-entropy,
scoped to a single provisioner, and never leaves the vault except
during ccat secrets provision. An attacker with vault access already
has a much larger problem. Output is 30- or 90-day x509 certs with
principal restrictions enforced by the provisioner template.
POST /1.0/sign — JWK (service-accounts)
Same mechanism as above, but issues 24-hour SSH service certs that are auto-renewed every 6 hours by the target host. Blast radius on compromise is small because the certs expire quickly on their own.
POST /acme/... — ACME challenge response
Strong by protocol design: the attacker must prove control of the
hostname they are requesting a cert for, via HTTP-01, DNS-01, or
TLS-ALPN-01. An attacker who does not control example.ccat.uni-koeln.de
cannot satisfy a challenge for it, full stop. Output is 90-day x509.
POST /1.0/ssh/renew — SSHPOP
Requires possession of a currently-valid SSH host cert. There is no bootstrap path through this endpoint — it only renews an existing cert, it never issues the first one. An attacker who already has a valid host cert is an attacker who already has the host.
Realistic attack scenarios#
Ordered roughly from “happens every day” to “we hope this is hypothetical.”
1. Random reconnaissance scans#
Botnets sweep the internet continuously. Exposing port 9000 worldwide means we will see constant, low-grade scan traffic.
What they find.
/health,/roots.pem,/provisioners. Everything public-by-design.What’s actionable for them. Nothing. The information is equivalent to what Let’s Encrypt publishes about itself.
What stops them. Nothing needs to — there’s nothing to steal.
What monitoring catches it. Loki + Grafana will show steady low-rate 200s on the public endpoints. Useful as baseline.
2. Vulnerability scanning against known step-ca CVEs#
Scanners try CVEs indiscriminately. step-ca is open source and actively maintained by Smallstep.
What they gain. If we’re patched, nothing. If we’re not, it depends on the CVE.
What stops them. Prompt patching. Subscribe to Smallstep’s security advisories. See Patch Management, Container Security & Supply Chains for the general upgrade cadence story.
What monitoring catches it. 4xx/5xx spikes, odd user-agent strings in Loki, Grafana alerts on error-rate anomalies.
3. Brute-force against JWK provisioner passwords#
The prod-services / staging-services / service-accounts
provisioners authenticate with a password. An attacker who knows the
provisioner name could attempt to guess the password by repeatedly
POSTing sign requests.
What they gain. Arithmetically bounded to nothing: the password is high-entropy (generated via
ccat secrets rotate), and step-ca enforces request rate limits. A 128-bit password against a rate-limited endpoint is not brute-forceable in any human timescale.What stops them. Password entropy + step-ca rate limits + nginx-proxy rate limiting if needed + fail2ban on repeated 401s.
What monitoring catches it. Repeated auth failures from the same source IP in the step-ca log.
4. Denial of service / cert spamming#
Flood the sign endpoint to exhaust resources or fill the DB with issued certs.
What they gain. Degraded availability for legitimate issuance.
What stops them. step-ca has built-in per-provisioner rate limits. nginx-proxy can add an outer rate limit. Docker port binding to a specific interface limits blast surface. host iptables provides a final layer.
What monitoring catches it. Request-rate dashboards in Grafana; alerts on sustained high throughput.
5. Targeted phishing of the OIDC flow#
A social-engineering attack against a ccatobs/datacenter team member that tricks them into completing a Dex OIDC flow the attacker initiated.
Key observation. This attack works identically against a localhost-only CA. Exposing port 9000 worldwide neither helps nor hurts the attacker here — the flow is in the browser, not on the network.
What they gain. A 16-hour SSH user cert for the victim’s principal.
What stops them. GitHub 2FA on the upstream identity (mandatory on ccatobs org); the team-membership check happens on every login, so any user not currently in
ccatobs/datacenteris rejected at Dex; user awareness training; the 16-hour lifetime bounds the blast window; removing the victim from the GitHub team immediately blocks any new authentication attempts.What monitoring catches it. Anomalous issuance patterns for a user (odd hours, unexpected source IP), cross-checked against the user’s usual behavior.
Defense layers#
The exposure above is safe because no single layer is load-bearing — each attack scenario is stopped by multiple independent defenses.
Network layer. Optional and cumulative:
Uni Köln firewall (outermost — currently closed on TCP 9000, opening is the Phase 3 request)
Host iptables on input-b
Docker port binding (can bind 9000 to a specific interface only)
nginx-proxy IP allowlists (available via
proxy/data/vhost.d/drop-in files if ever needed; not currently used on auth.ccat.uni-koeln.de because Dex has no admin UI to gate)
Application layer. step-ca’s own auth model: every endpoint that issues a cert requires one of the auth mechanisms above. There is no unauthenticated path to issuance.
Provisioner layer. Each provisioner has its own independent auth gate. Compromising one provisioner does not compromise the others.
Authorization gate. Role-based or challenge-based checks on top of authentication:
ccatobs/datacenterGitHub team membership enforced by Dex for OIDC, password secrecy for JWK, challenge response for ACME, cert possession for SSHPOP.Issued-cert constraints. Short lifetimes are the single largest blast-radius reducer:
16h human SSH user certs
24h service-account SSH certs (renewed every 6h)
7d SSH host certs
30–90d TLS certs
Signing key never leaves the HSM (Phase 2+)
Target-host opt-in. A cert is only useful against hosts that have been told to trust the CCAT CA, via
/etc/ssh/trusted_user_ca_keysdeployed by theca_trustAnsible role. A leaked cert against a host that doesn’t trust us is a leaked cert against a host that doesn’t care.
Scenario |
Stopped by layer(s) |
|---|---|
Recon scans |
1, 2 |
Known-CVE scanning |
Patch cadence + 1, 2 |
JWK brute force |
1, 2, 3, 4 |
DoS / cert spam |
1, 2, 5 |
OIDC phishing |
4, 5, 6 |
The Phase 3 decision: worldwide port 9000#
The team’s current direction is yes, request worldwide opening of
TCP 9000 inbound on ca.ccat.uni-koeln.de, and replace the Phase 1
trust-bundle hack behind nginx-proxy with direct exposure.
Why it is defensible.
step-ca is designed for public internet exposure. Smallstep’s own commercial offering runs this way, as do many third-party hosted step-ca deployments. The protocols it speaks (ACME, OIDC) require public reachability for large parts of the client base.
The auth gates do the work. Nothing in the threat model above gets easier for an attacker when the endpoint moves from “reachable from Uni Köln networks” to “reachable worldwide.” The network-layer restriction is not the security boundary.
Let’s Encrypt publishes the same kind of metadata to the entire internet and issues certs that far more people trust than anything CCAT will ever sign.
What it buys operationally.
Developers can SSH to CCAT hosts from anywhere (home, conferences, observing runs) without needing a VPN or Uni Köln network access. This matches the “GitHub identity + 16h SSH cert” model we chose in the first place.
ACME clients for
*.ccat.uni-koeln.deendpoints can reach us from wherever they need to.SSHPOP renewal works for hosts at remote sites (Chile, US) without complicated site-to-site tunneling.
What it does not change.
The attack surface in “public-by-design endpoints” above is identical whether the network range is “Uni Köln” or “the world.” Auth gates are what stop issuance, not source IP.
Incident-response procedures are unchanged.
What mitigations remain available even with worldwide exposure.
Docker port binding can restrict which host interface 9000 binds to.
Host iptables can drop or rate-limit traffic from specific ranges.
fail2ban watching the step-ca log can block sources with repeated auth failures.
nginx-proxy with SNI passthrough + ACLs remains an option for Phase 2+ if we ever decide to narrow exposure again.
Uni Köln firewall can be reclosed at any time — the opening is reversible.
Phase 3 operational hardening checklist#
These items are not blockers for Phase 1 dry-run or Phase 2 HSM rehearsal. They are blockers for Phase 3, when real services start trusting the CA.
step-ca logs shipped to Loki via promtail and visible in Grafana.
Grafana dashboard: cert issuance per provisioner per hour, with baseline annotations.
Alert: 4xx error rate spikes above baseline (sign of abuse, misconfig, or scanning).
Alert: repeated auth failures from a single source IP above a threshold in a rolling window.
fail2ban (or equivalent) watching the step-ca log and temporarily blocking sources that trip the repeated-failure threshold.
Subscribed to Smallstep security advisories; upgrade procedure for step-ca documented in CCAT Certificate Authority — Operations Guide and tested on staging.
Weekly “who got certs, who tried and failed” review as part of the security hygiene rhythm.
JWK provisioner password rotation procedure documented and rehearsed end-to-end.
Backup verification:
step-ca-datavolume backed up, and a restore rehearsed into a scratch environment. Dex state is regenerated fromstep-ca/dex/config.yamlin git; no separate backup needed.HSM #1 access procedure (the offline root ceremony for intermediate rotation) documented and walked through by at least two operators.
Dex GitHub OAuth App audit: client ID + secret in vault, app restricted to the
ccatobsorg, scope isread:org, no other apps share the secret.GitHub team
ccatobs/datacentermembership reviewed — everyone in the team should have a current operational need for SSH access to CCAT Data Center hosts. Leavers pruned.
Incident response sketch#
Response procedures by suspected-failure mode. In every case the short cert lifetimes mean most remediation is “revoke the mechanism that issues, and wait” rather than “chase down every issued cert.”
JWK provisioner password leaked#
Rotate the password. In-flight short-lived certs expire on their own; no client re-bootstrap is needed because clients authenticate to the CA with the password, not to each other.
ccat secrets rotate vault_step_ca_password --env production && ccat secrets provision --host input-b
After provisioning, restart step-ca on input-b and confirm new issuance works. Audit the step-ca log for any issuance during the exposure window and revoke suspicious certs.
An SSH user cert has been used maliciously#
Identify the principal from the step-ca issuance log. Remove the
user from the ccatobs/datacenter GitHub team — Dex checks team
membership on every authentication, so the next step ssh login
from that user fails immediately. The cert itself expires within
16 hours; no host-side action is required unless the principal
shows ongoing activity.
# Find issuance events for a given principal
docker compose logs step-ca | grep '"principal":"alice"'
Dex static client secret leaked#
The secret that step-ca uses to authenticate to Dex
(vault_dex_stepca_client_secret) could, if leaked, let an attacker
exchange OIDC codes for tokens on step-ca’s behalf — useful only in
combination with an already-valid user authentication flow, so the
risk is bounded. Rotate:
ccat secrets rotate vault_dex_stepca_client_secret --env production
ccat secrets provision --host input-b
ccat ca down && ccat ca up # reload Dex with new secret
ccat ca provisioner remove CCAT-GitHub
ccat ca provisioner sync # re-add with new secret
A ccatobs/datacenter team member has gone rogue#
Remove them from the ccatobs/datacenter GitHub team. Dex checks
team membership live on every authentication, so any new
step ssh login will fail at the Dex layer. Their existing SSH
user cert expires within 16 hours. For faster eviction from active
sessions, force-terminate their SSH connections on the target
hosts.
input-b has been compromised#
The blast radius depends on the phase.
Phase 1 (file-based intermediate key). The intermediate signing key must be assumed compromised — it sits on disk. Response: full intermediate rotation via an offline root ceremony, redeploy trust bundles (but since Phase 1 = throwaway, the simpler answer is to wipe and restart the ceremony from scratch).
Phase 2+ (HSM-backed intermediate key). The key itself cannot be extracted from the HSM, but the attacker could have used it while they had access to input-b. Response: rotate the intermediate (ceremony with HSM #1, no root rotation needed), audit the step-ca issuance log for anything signed during the exposure window, and revoke or actively expire any suspicious certs. The root stays intact; clients do not need to re-bootstrap.
In both phases, Dex’s state on input-b is also within blast radius, but Dex has no user database to leak — its entire config is in git and the only secrets it holds are the GitHub OAuth client secret and the static step-ca client secret, both in the Ansible vault. Rotate both as part of the same response.
Further reading#
TLS, Certificates, and Public Key Infrastructure — PKI fundamentals, cert chains, key material
CCAT Certificate Authority — Operations Guide — the operational guide for the CCAT CA: provisioner config, rotation runbooks, client bootstrap
Patch Management, Container Security & Supply Chains — the upgrade cadence and supply-chain story that keeps step-ca itself patched
Smallstep step-ca documentation — upstream reference
RFC 8555 — Automatic Certificate Management Environment (ACME)