CCAT Certificate Authority — Threat Model & Attack Surface#

This document is the security-review companion to CCAT Certificate Authority — Operations Guide. Where that document explains how the CCAT step-ca deployment is built and operated, this one explains what it is exposed to, what defends it, and what to do when something goes wrong. It is intended as reference material for security review, incident response, and the Phase 3 go/no-go decision on worldwide exposure of ca.ccat.uni-koeln.de:9000.

For PKI fundamentals (what a cert is, how chains work, why short lifetimes matter), see TLS, Certificates, and Public Key Infrastructure. For OS and container patch cadence — which directly affects how fast we close CVEs in step-ca itself — see Patch Management, Container Security & Supply Chains.

Scope#

What this document covers. The public and auth-gated attack surface of the deployed step-ca instance, the Dex OIDC provider it relies on, realistic attacker scenarios, the defense-in-depth layers that stop them, and an incident-response sketch per failure mode.

What it does not cover. Cryptographic primitives and PKI theory (see TLS, Certificates, and Public Key Infrastructure), day-to-day operational runbooks such as adding provisioners or rotating the intermediate (see CCAT Certificate Authority — Operations Guide), and the broader supply-chain story for OS and container updates (see Patch Management, Container Security & Supply Chains).

Deployment phases#

The threat model differs sharply between the rollout phases:

Phase

Root key

Production trust

Threat-model posture

Phase 1 — dry-run

File-based, throwaway

No production service trusts the root yet

Everything issued is a dress rehearsal. A full compromise = wipe and redo the ceremony. Low stakes.

Phase 2 — HSM rehearsal

HSM-backed, real

Still no production trust; rotation drills only

Keys are extraction-resistant. Signing key never leaves the HSM.

Phase 3 — steady state

HSM-backed, real

All CCAT hosts trust the root; SSH, mTLS, internal web UIs chain to it

Real blast radius. The operational hardening checklist below must be complete before entering this phase.

Scale assumption#

CCAT is a small, university-hosted, internal-use PKI: roughly 20 trusted humans (the ccatobs GitHub org) plus a handful of service accounts and hosts. The threat model reflects this — we are not defending a public CA with millions of subscribers. We do not need WebPKI-grade transparency logs, CRLs, or staffed SOCs. We do need the discipline that comes with being an internal trust anchor for real infrastructure.

Attack surface#

Two DNS names are relevant:

  • ca.ccat.uni-koeln.de — step-ca’s API. In Phase 1 it rides behind nginx-proxy on port 443 (the “trust-bundle hack”); the steady-state plan is to expose step-ca directly on TCP 9000, worldwide, once Uni Köln IT opens the firewall.

  • auth.ccat.uni-koeln.de — Dex’s OIDC endpoints and GitHub login redirect page, on port 443 with Let’s Encrypt termination via acme-companion. Dex has no admin UI and no path that needs IP gating.

Public-by-design endpoints#

These endpoints are meant to be reachable by anyone. Serving them to the entire internet is not a security regression — it is how the protocols are specified. The equivalent endpoints on Let’s Encrypt and every other public CA are worldwide-reachable by design.

Endpoint

Purpose

What an attacker learns

GET /health

Liveness probe

The CA is up

GET /roots.pem

Public root certificate

The public half of our trust anchor — exactly the material every client has to fetch anyway

GET /provisioners

Provisioner discovery

Provisioner names, types, and public config: OIDC issuer URL, OIDC client ID, allowed group claims, ACME directory URL

ACME directory (/acme/acme/directory)

RFC 8555 discovery

Standard ACME endpoints

Dex OIDC discovery (/.well-known/openid-configuration)

OIDC metadata

Issuer URL, JWKS URI, supported flows

What is not in those responses:

  • JWK provisioner passwords (these are held in the Ansible vault and never served)

  • OIDC client secrets (held by Dex via its static clients config, never emitted)

  • Signing keys or any secret material — the CA emits only public certs

  • User identities, SSH principal lists, or issued-cert history

Publishing provisioner metadata is intentional. A client that wants to request a cert needs to know which provisioners exist and how to authenticate to them. This is the same discovery model Let’s Encrypt uses via its ACME directory.

Auth-gated endpoints — where the real security lives#

Everything that actually issues a cert sits behind one of four authentication gates, each with its own strength profile.

POST /1.0/ssh/sign — OIDC (CCAT-GitHub provisioner)

Requires a Dex-issued OIDC token whose groups claim contains the ccatobs/datacenter GitHub team slug. Strong by design: an attacker has to clear three independent gates — a valid GitHub identity, successful GitHub OAuth with read:org scope, and actual membership in the ccatobs/datacenter team at the moment Dex calls GitHub’s team-membership endpoint. Membership is checked live against GitHub on every authentication, so a user removed from the team cannot authenticate even if their browser session is still warm. Output is a 16-hour SSH user cert with the user’s principal only.

POST /1.0/sign — JWK (prod-services, staging-services)

Requires knowledge of the provisioner password (STEP_CA_PASSWORD, held in the Ansible vault as vault_step_ca_password). Medium strength: it is a single long-lived secret, but it is high-entropy, scoped to a single provisioner, and never leaves the vault except during ccat secrets provision. An attacker with vault access already has a much larger problem. Output is 30- or 90-day x509 certs with principal restrictions enforced by the provisioner template.

POST /1.0/sign — JWK (service-accounts)

Same mechanism as above, but issues 24-hour SSH service certs that are auto-renewed every 6 hours by the target host. Blast radius on compromise is small because the certs expire quickly on their own.

POST /acme/... — ACME challenge response

Strong by protocol design: the attacker must prove control of the hostname they are requesting a cert for, via HTTP-01, DNS-01, or TLS-ALPN-01. An attacker who does not control example.ccat.uni-koeln.de cannot satisfy a challenge for it, full stop. Output is 90-day x509.

POST /1.0/ssh/renew — SSHPOP

Requires possession of a currently-valid SSH host cert. There is no bootstrap path through this endpoint — it only renews an existing cert, it never issues the first one. An attacker who already has a valid host cert is an attacker who already has the host.

Realistic attack scenarios#

Ordered roughly from “happens every day” to “we hope this is hypothetical.”

1. Random reconnaissance scans#

Botnets sweep the internet continuously. Exposing port 9000 worldwide means we will see constant, low-grade scan traffic.

  • What they find. /health, /roots.pem, /provisioners. Everything public-by-design.

  • What’s actionable for them. Nothing. The information is equivalent to what Let’s Encrypt publishes about itself.

  • What stops them. Nothing needs to — there’s nothing to steal.

  • What monitoring catches it. Loki + Grafana will show steady low-rate 200s on the public endpoints. Useful as baseline.

2. Vulnerability scanning against known step-ca CVEs#

Scanners try CVEs indiscriminately. step-ca is open source and actively maintained by Smallstep.

  • What they gain. If we’re patched, nothing. If we’re not, it depends on the CVE.

  • What stops them. Prompt patching. Subscribe to Smallstep’s security advisories. See Patch Management, Container Security & Supply Chains for the general upgrade cadence story.

  • What monitoring catches it. 4xx/5xx spikes, odd user-agent strings in Loki, Grafana alerts on error-rate anomalies.

3. Brute-force against JWK provisioner passwords#

The prod-services / staging-services / service-accounts provisioners authenticate with a password. An attacker who knows the provisioner name could attempt to guess the password by repeatedly POSTing sign requests.

  • What they gain. Arithmetically bounded to nothing: the password is high-entropy (generated via ccat secrets rotate), and step-ca enforces request rate limits. A 128-bit password against a rate-limited endpoint is not brute-forceable in any human timescale.

  • What stops them. Password entropy + step-ca rate limits + nginx-proxy rate limiting if needed + fail2ban on repeated 401s.

  • What monitoring catches it. Repeated auth failures from the same source IP in the step-ca log.

4. Denial of service / cert spamming#

Flood the sign endpoint to exhaust resources or fill the DB with issued certs.

  • What they gain. Degraded availability for legitimate issuance.

  • What stops them. step-ca has built-in per-provisioner rate limits. nginx-proxy can add an outer rate limit. Docker port binding to a specific interface limits blast surface. host iptables provides a final layer.

  • What monitoring catches it. Request-rate dashboards in Grafana; alerts on sustained high throughput.

5. Targeted phishing of the OIDC flow#

A social-engineering attack against a ccatobs/datacenter team member that tricks them into completing a Dex OIDC flow the attacker initiated.

  • Key observation. This attack works identically against a localhost-only CA. Exposing port 9000 worldwide neither helps nor hurts the attacker here — the flow is in the browser, not on the network.

  • What they gain. A 16-hour SSH user cert for the victim’s principal.

  • What stops them. GitHub 2FA on the upstream identity (mandatory on ccatobs org); the team-membership check happens on every login, so any user not currently in ccatobs/datacenter is rejected at Dex; user awareness training; the 16-hour lifetime bounds the blast window; removing the victim from the GitHub team immediately blocks any new authentication attempts.

  • What monitoring catches it. Anomalous issuance patterns for a user (odd hours, unexpected source IP), cross-checked against the user’s usual behavior.

Defense layers#

The exposure above is safe because no single layer is load-bearing — each attack scenario is stopped by multiple independent defenses.

  1. Network layer. Optional and cumulative:

    • Uni Köln firewall (outermost — currently closed on TCP 9000, opening is the Phase 3 request)

    • Host iptables on input-b

    • Docker port binding (can bind 9000 to a specific interface only)

    • nginx-proxy IP allowlists (available via proxy/data/vhost.d/ drop-in files if ever needed; not currently used on auth.ccat.uni-koeln.de because Dex has no admin UI to gate)

  2. Application layer. step-ca’s own auth model: every endpoint that issues a cert requires one of the auth mechanisms above. There is no unauthenticated path to issuance.

  3. Provisioner layer. Each provisioner has its own independent auth gate. Compromising one provisioner does not compromise the others.

  4. Authorization gate. Role-based or challenge-based checks on top of authentication: ccatobs/datacenter GitHub team membership enforced by Dex for OIDC, password secrecy for JWK, challenge response for ACME, cert possession for SSHPOP.

  5. Issued-cert constraints. Short lifetimes are the single largest blast-radius reducer:

    • 16h human SSH user certs

    • 24h service-account SSH certs (renewed every 6h)

    • 7d SSH host certs

    • 30–90d TLS certs

    • Signing key never leaves the HSM (Phase 2+)

  6. Target-host opt-in. A cert is only useful against hosts that have been told to trust the CCAT CA, via /etc/ssh/trusted_user_ca_keys deployed by the ca_trust Ansible role. A leaked cert against a host that doesn’t trust us is a leaked cert against a host that doesn’t care.

Scenario

Stopped by layer(s)

Recon scans

1, 2

Known-CVE scanning

Patch cadence + 1, 2

JWK brute force

1, 2, 3, 4

DoS / cert spam

1, 2, 5

OIDC phishing

4, 5, 6

The Phase 3 decision: worldwide port 9000#

The team’s current direction is yes, request worldwide opening of TCP 9000 inbound on ca.ccat.uni-koeln.de, and replace the Phase 1 trust-bundle hack behind nginx-proxy with direct exposure.

Why it is defensible.

  • step-ca is designed for public internet exposure. Smallstep’s own commercial offering runs this way, as do many third-party hosted step-ca deployments. The protocols it speaks (ACME, OIDC) require public reachability for large parts of the client base.

  • The auth gates do the work. Nothing in the threat model above gets easier for an attacker when the endpoint moves from “reachable from Uni Köln networks” to “reachable worldwide.” The network-layer restriction is not the security boundary.

  • Let’s Encrypt publishes the same kind of metadata to the entire internet and issues certs that far more people trust than anything CCAT will ever sign.

What it buys operationally.

  • Developers can SSH to CCAT hosts from anywhere (home, conferences, observing runs) without needing a VPN or Uni Köln network access. This matches the “GitHub identity + 16h SSH cert” model we chose in the first place.

  • ACME clients for *.ccat.uni-koeln.de endpoints can reach us from wherever they need to.

  • SSHPOP renewal works for hosts at remote sites (Chile, US) without complicated site-to-site tunneling.

What it does not change.

  • The attack surface in “public-by-design endpoints” above is identical whether the network range is “Uni Köln” or “the world.” Auth gates are what stop issuance, not source IP.

  • Incident-response procedures are unchanged.

What mitigations remain available even with worldwide exposure.

  • Docker port binding can restrict which host interface 9000 binds to.

  • Host iptables can drop or rate-limit traffic from specific ranges.

  • fail2ban watching the step-ca log can block sources with repeated auth failures.

  • nginx-proxy with SNI passthrough + ACLs remains an option for Phase 2+ if we ever decide to narrow exposure again.

  • Uni Köln firewall can be reclosed at any time — the opening is reversible.

Phase 3 operational hardening checklist#

These items are not blockers for Phase 1 dry-run or Phase 2 HSM rehearsal. They are blockers for Phase 3, when real services start trusting the CA.

  • step-ca logs shipped to Loki via promtail and visible in Grafana.

  • Grafana dashboard: cert issuance per provisioner per hour, with baseline annotations.

  • Alert: 4xx error rate spikes above baseline (sign of abuse, misconfig, or scanning).

  • Alert: repeated auth failures from a single source IP above a threshold in a rolling window.

  • fail2ban (or equivalent) watching the step-ca log and temporarily blocking sources that trip the repeated-failure threshold.

  • Subscribed to Smallstep security advisories; upgrade procedure for step-ca documented in CCAT Certificate Authority — Operations Guide and tested on staging.

  • Weekly “who got certs, who tried and failed” review as part of the security hygiene rhythm.

  • JWK provisioner password rotation procedure documented and rehearsed end-to-end.

  • Backup verification: step-ca-data volume backed up, and a restore rehearsed into a scratch environment. Dex state is regenerated from step-ca/dex/config.yaml in git; no separate backup needed.

  • HSM #1 access procedure (the offline root ceremony for intermediate rotation) documented and walked through by at least two operators.

  • Dex GitHub OAuth App audit: client ID + secret in vault, app restricted to the ccatobs org, scope is read:org, no other apps share the secret.

  • GitHub team ccatobs/datacenter membership reviewed — everyone in the team should have a current operational need for SSH access to CCAT Data Center hosts. Leavers pruned.

Incident response sketch#

Response procedures by suspected-failure mode. In every case the short cert lifetimes mean most remediation is “revoke the mechanism that issues, and wait” rather than “chase down every issued cert.”

JWK provisioner password leaked#

Rotate the password. In-flight short-lived certs expire on their own; no client re-bootstrap is needed because clients authenticate to the CA with the password, not to each other.

ccat secrets rotate vault_step_ca_password --env production && ccat secrets provision --host input-b

After provisioning, restart step-ca on input-b and confirm new issuance works. Audit the step-ca log for any issuance during the exposure window and revoke suspicious certs.

An SSH user cert has been used maliciously#

Identify the principal from the step-ca issuance log. Remove the user from the ccatobs/datacenter GitHub team — Dex checks team membership on every authentication, so the next step ssh login from that user fails immediately. The cert itself expires within 16 hours; no host-side action is required unless the principal shows ongoing activity.

# Find issuance events for a given principal
docker compose logs step-ca | grep '"principal":"alice"'

Dex static client secret leaked#

The secret that step-ca uses to authenticate to Dex (vault_dex_stepca_client_secret) could, if leaked, let an attacker exchange OIDC codes for tokens on step-ca’s behalf — useful only in combination with an already-valid user authentication flow, so the risk is bounded. Rotate:

ccat secrets rotate vault_dex_stepca_client_secret --env production
ccat secrets provision --host input-b
ccat ca down && ccat ca up        # reload Dex with new secret
ccat ca provisioner remove CCAT-GitHub
ccat ca provisioner sync          # re-add with new secret

A ccatobs/datacenter team member has gone rogue#

Remove them from the ccatobs/datacenter GitHub team. Dex checks team membership live on every authentication, so any new step ssh login will fail at the Dex layer. Their existing SSH user cert expires within 16 hours. For faster eviction from active sessions, force-terminate their SSH connections on the target hosts.

input-b has been compromised#

The blast radius depends on the phase.

Phase 1 (file-based intermediate key). The intermediate signing key must be assumed compromised — it sits on disk. Response: full intermediate rotation via an offline root ceremony, redeploy trust bundles (but since Phase 1 = throwaway, the simpler answer is to wipe and restart the ceremony from scratch).

Phase 2+ (HSM-backed intermediate key). The key itself cannot be extracted from the HSM, but the attacker could have used it while they had access to input-b. Response: rotate the intermediate (ceremony with HSM #1, no root rotation needed), audit the step-ca issuance log for anything signed during the exposure window, and revoke or actively expire any suspicious certs. The root stays intact; clients do not need to re-bootstrap.

In both phases, Dex’s state on input-b is also within blast radius, but Dex has no user database to leak — its entire config is in git and the only secrets it holds are the GitHub OAuth client secret and the static step-ca client secret, both in the Ansible vault. Rotate both as part of the same response.

Further reading#