CCAT Certificate Authority — Threat Model & Attack Surface#

This document is the security-review companion to CCAT Certificate Authority — Architecture and Design. Where that document explains how the CCAT step-ca deployment is built and operated, this one explains what it is exposed to, what defends it, and what to do when something goes wrong. It is intended as reference material for security review, incident response, and the Phase 3 go/no-go decision on worldwide exposure of ca.ccat.uni-koeln.de:9000.

For PKI fundamentals (what a cert is, how chains work, why short lifetimes matter), see TLS, Certificates, and Public Key Infrastructure. For OS and container patch cadence — which directly affects how fast we close CVEs in step-ca itself — see Patch Management, Container Security & Supply Chains.

Scope #

What this document covers. The public and auth-gated attack surface of the deployed step-ca instance, the Dex OIDC provider it relies on, realistic attacker scenarios, the defense-in-depth layers that stop them, and an incident-response sketch per failure mode.

What it does not cover. Cryptographic primitives and PKI theory (see TLS, Certificates, and Public Key Infrastructure), day-to-day operational runbooks such as adding provisioners or rotating the intermediate (see CA provisioner management and CA rotation and disaster recovery), and the broader supply-chain story for OS and container updates (see Patch Management, Container Security & Supply Chains).

Deployment phases #

The threat model differs sharply between the rollout phases:

Phase	Root key	Production trust	Threat-model posture
Phase 1 — dry-run	File-based, throwaway	No production service trusts the root yet	Everything issued is a dress rehearsal. A full compromise = wipe and redo the ceremony. Low stakes.
Phase 2 — HSM rehearsal	HSM-backed, real	Still no production trust; rotation drills only	Keys are extraction-resistant. Signing key never leaves the HSM.
Phase 3 — steady state	HSM-backed, real	All CCAT hosts trust the root; SSH, mTLS, internal web UIs chain to it	Real blast radius. The operational hardening checklist below must be complete before entering this phase.

Scale assumption #

CCAT is a small, university-hosted, internal-use PKI: roughly 20 trusted humans (the ccatobs GitHub org) plus a handful of service accounts and hosts. The threat model reflects this — we are not defending a public CA with millions of subscribers. We do not need WebPKI-grade transparency logs, CRLs, or staffed SOCs. We do need the discipline that comes with being an internal trust anchor for real infrastructure.

Attack surface #

Two DNS names are relevant:

ca.ccat.uni-koeln.de — step-ca’s API. Fronted by nginx-proxy on port 443 with a CCAT-rooted TLS cert (issued by step-ca itself, renewed via systemd timer), so step-cli’s RootCAs-only verification works without any client-side trust-bundle plumbing. step-ca’s native :9000 is also published on input-b’s host for same-host workflows but is firewalled to the Uni Köln /16 (and dropped between subnets by Uni IT regardless), so :443 is the universal client path.
auth.ccat.uni-koeln.de — Dex’s OIDC endpoints and GitHub login redirect page, on port 443 with Let’s Encrypt termination via acme-companion. Dex has no admin UI and no path that needs IP gating.

Public-by-design endpoints #

These endpoints are meant to be reachable by anyone. Serving them to the entire internet is not a security regression — it is how the protocols are specified. The equivalent endpoints on Let’s Encrypt and every other public CA are worldwide-reachable by design.

Endpoint	Purpose	What an attacker learns
`GET /health`	Liveness probe	The CA is up
`GET /roots.pem`	Public root certificate	The public half of our trust anchor — exactly the material every client has to fetch anyway
`GET /provisioners`	Provisioner discovery	Provisioner names, types, and public config: OIDC issuer URL, OIDC client ID, allowed group claims, ACME directory URL
ACME directory (`/acme/acme/directory`)	RFC 8555 discovery	Standard ACME endpoints
Dex OIDC discovery (`/.well-known/openid-configuration`)	OIDC metadata	Issuer URL, JWKS URI, supported flows

What is not in those responses:

JWK provisioner passwords (these are held in the Ansible vault and never served)
OIDC client secrets (held by Dex via its static clients config, never emitted)
Signing keys or any secret material — the CA emits only public certs
User identities, SSH principal lists, or issued-cert history

Publishing provisioner metadata is intentional. A client that wants to request a cert needs to know which provisioners exist and how to authenticate to them. This is the same discovery model Let’s Encrypt uses via its ACME directory.

Auth-gated endpoints — where the real security lives #

Everything that actually issues a cert sits behind one of four authentication gates, each with its own strength profile.

POST /1.0/ssh/sign — OIDC (CCAT-GitHub provisioner)

Requires a Dex-issued OIDC token whose groups claim contains the ccatobs/datacenter GitHub team slug. Strong by design: an attacker has to clear three independent gates — a valid GitHub identity, successful GitHub OAuth with read:org scope, and actual membership in the ccatobs/datacenter team at the moment Dex calls GitHub’s team-membership endpoint. Membership is checked live against GitHub on every authentication, so a user removed from the team cannot authenticate even if their browser session is still warm. Output is a 16-hour SSH user cert with the user’s principal only.

POST /1.0/sign — JWK (prod-services, staging-services)

Requires knowledge of the provisioner password (STEP_CA_PASSWORD, held in the Ansible vault as vault_step_ca_password). Medium strength: it is a single long-lived secret, but it is high-entropy, scoped to a single provisioner, and never leaves the vault except during ccat secrets provision. An attacker with vault access already has a much larger problem. Output is 30- or 90-day x509 certs with principal restrictions enforced by the provisioner template.

POST /1.0/sign — JWK (service-accounts)

Same mechanism as above, but issues 24-hour SSH service certs that are auto-renewed every 6 hours by the target host. Blast radius on compromise is small because the certs expire quickly on their own.

POST /acme/... — ACME challenge response

Strong by protocol design: the attacker must prove control of the hostname they are requesting a cert for, via HTTP-01, DNS-01, or TLS-ALPN-01. An attacker who does not control example.ccat.uni-koeln.de cannot satisfy a challenge for it, full stop. Output is 90-day x509.

POST /1.0/ssh/renew — SSHPOP

Requires possession of a currently-valid SSH host cert. There is no bootstrap path through this endpoint — it only renews an existing cert, it never issues the first one. An attacker who already has a valid host cert is an attacker who already has the host.

Realistic attack scenarios #

Ordered roughly from “happens every day” to “we hope this is hypothetical.”

1. Random reconnaissance scans #

Botnets sweep the internet continuously. Exposing port 9000 worldwide means we will see constant, low-grade scan traffic.

What they find. /health, /roots.pem, /provisioners. Everything public-by-design.
What’s actionable for them. Nothing. The information is equivalent to what Let’s Encrypt publishes about itself.
What stops them. Nothing needs to — there’s nothing to steal.
What monitoring catches it. Loki + Grafana will show steady low-rate 200s on the public endpoints. Useful as baseline.

2. Vulnerability scanning against known step-ca CVEs #

Scanners try CVEs indiscriminately. step-ca is open source and actively maintained by Smallstep.

What they gain. If we’re patched, nothing. If we’re not, it depends on the CVE.
What stops them. Prompt patching. Subscribe to Smallstep’s security advisories. See Patch Management, Container Security & Supply Chains for the general upgrade cadence story.
What monitoring catches it. 4xx/5xx spikes, odd user-agent strings in Loki, Grafana alerts on error-rate anomalies.

3. Brute-force against JWK provisioner passwords #

The prod-services / staging-services / service-accounts provisioners authenticate with a password. An attacker who knows the provisioner name could attempt to guess the password by repeatedly POSTing sign requests.

What they gain. Arithmetically bounded to nothing: the password is high-entropy (generated via ccat secrets rotate), and step-ca enforces request rate limits. A 128-bit password against a rate-limited endpoint is not brute-forceable in any human timescale.
What stops them. Password entropy + step-ca rate limits + nginx-proxy rate limiting if needed + fail2ban on repeated 401s.
What monitoring catches it. Repeated auth failures from the same source IP in the step-ca log.

4. Denial of service / cert spamming #

Flood the sign endpoint to exhaust resources or fill the DB with issued certs.

What they gain. Degraded availability for legitimate issuance.
What stops them. step-ca has built-in per-provisioner rate limits. nginx-proxy can add an outer rate limit. Docker port binding to a specific interface limits blast surface. host iptables provides a final layer.
What monitoring catches it. Request-rate dashboards in Grafana; alerts on sustained high throughput.

5. Targeted phishing of the OIDC flow #

A social-engineering attack against a ccatobs/datacenter team member that tricks them into completing a Dex OIDC flow the attacker initiated.

Key observation. This attack works identically against a localhost-only CA. Exposing port 9000 worldwide neither helps nor hurts the attacker here — the flow is in the browser, not on the network.
What they gain. A 16-hour SSH user cert for the victim’s principal.
What stops them. GitHub 2FA on the upstream identity (mandatory on ccatobs org); the team-membership check happens on every login, so any user not currently in ccatobs/datacenter is rejected at Dex; user awareness training; the 16-hour lifetime bounds the blast window; removing the victim from the GitHub team immediately blocks any new authentication attempts.
What monitoring catches it. Anomalous issuance patterns for a user (odd hours, unexpected source IP), cross-checked against the user’s usual behavior.

Defense layers #

The exposure above is safe because no single layer is load-bearing — each attack scenario is stopped by multiple independent defenses.

Network layer. Optional and cumulative:
- Uni Köln firewall (outermost — currently closed on TCP 9000, opening is the Phase 3 request)
- Host iptables on input-b
- Docker port binding (can bind 9000 to a specific interface only)
- nginx-proxy IP allowlists (available via proxy/data/vhost.d/ drop-in files if ever needed; not currently used on auth.ccat.uni-koeln.de because Dex has no admin UI to gate)
Application layer. step-ca’s own auth model: every endpoint that issues a cert requires one of the auth mechanisms above. There is no unauthenticated path to issuance.
Provisioner layer. Each provisioner has its own independent auth gate. Compromising one provisioner does not compromise the others.
Authorization gate. Role-based or challenge-based checks on top of authentication: ccatobs/datacenter GitHub team membership enforced by Dex for OIDC, password secrecy for JWK, challenge response for ACME, cert possession for SSHPOP.
Issued-cert constraints. Short lifetimes are the single largest blast-radius reducer:
- 16h human SSH user certs
- 24h service-account SSH certs (renewed every 6h)
- 7d SSH host certs
- 30–90d TLS certs
- Signing key never leaves the HSM (Phase 2+)
Target-host opt-in. A cert is only useful against hosts that have been told to trust the CCAT CA, via /etc/ssh/trusted_user_ca_keys deployed by the ca_trust Ansible role. A leaked cert against a host that doesn’t trust us is a leaked cert against a host that doesn’t care.

Scenario	Stopped by layer(s)
Recon scans	1, 2
Known-CVE scanning	Patch cadence + 1, 2
JWK brute force	1, 2, 3, 4
DoS / cert spam	1, 2, 5
OIDC phishing	4, 5, 6

The Phase 3 decision: per-vhost cert split on :443 #

The team explored two designs for exposing the CA API to the internal client population (and eventually the wider observatory team):

Plan B: open TCP 9000 worldwide and have step-cli talk directly to step-ca’s native TLS endpoint, bypassing nginx-proxy.
Plan B-revised (chosen): keep nginx-proxy but give the ca.ccat.uni-koeln.de vhost a CCAT-rooted cert (issued by step-ca itself via the prod-services JWK provisioner, renewed by a systemd timer); other vhosts in the same proxy stack keep Let’s Encrypt because they serve browsers and OAuth callbacks.

Plan B failed in practice because Uni Köln IT drops :9000 between subnets — the only hosts that could reach :9000 directly were on input-b’s own /24, which excludes essentially all clients. Plan B-revised sidesteps the firewall constraint entirely (since :443 is already permitted everywhere) without sacrificing the CCAT trust chain that step-cli’s RootCAs-only verification requires.

Why it is defensible from a threat-model standpoint.

step-ca is designed for public internet exposure. Smallstep’s own commercial offering runs this way, as do many third-party hosted step-ca deployments. The protocols it speaks (ACME, OIDC) require public reachability for large parts of the client base.
The auth gates do the work. Nothing in the threat model above gets easier for an attacker when the endpoint moves from “reachable from one /24” to “reachable from the whole campus” or “the world.” The network-layer restriction is not the security boundary.
Per-vhost cert split also dodges the operational fragility of the Phase 1 trust-bundle workaround (SSL_CERT_FILE + appended PEMs in ~/.step/certs/root_ca.crt broke step ssh certificate with multi-PEM errors). The CA presents a CCAT-rooted chain at the wire and clients have no client-side bundle plumbing to maintain.

What it buys operationally.

Developers can SSH to CCAT hosts from anywhere on the Uni Köln network with the same cert flow. Worldwide reach is incremental from here — Uni IT firewall narrows :443 and the :9000 same-host rule today; it can be opened later without architectural change.
SSHPOP renewal works for hosts at remote sites that can reach :443.

What it does not change.

The attack surface in “public-by-design endpoints” above is identical whether the network range is “input-b /24”, “Uni Köln”, or “the world.” Auth gates stop issuance, not source IP.
Incident-response procedures are unchanged.

Mitigations available now and later.

nginx-proxy can apply per-vhost ACLs (already used to lock the CA vhost to the Uni Köln /16; see proxy/data/vhost.d/).
step-ca’s native :9000 is gated by firewalld via the hsm_host role’s ca_allowed_source_cidrs variable.
fail2ban watching the step-ca log can block sources with repeated auth failures.
Uni Köln firewall opening is reversible; the per-vhost cert posture works regardless of whether the subnet is widened, narrowed, or worldwide.

Incident response sketch #

Response procedures by suspected-failure mode. In every case the short cert lifetimes mean most remediation is “revoke the mechanism that issues, and wait” rather than “chase down every issued cert.”

JWK provisioner password leaked #

Rotate the password. In-flight short-lived certs expire on their own; no client re-bootstrap is needed because clients authenticate to the CA with the password, not to each other.

ccat secrets rotate vault_step_ca_password --env production && ccat secrets provision --host input-b

After provisioning, restart step-ca on input-b and confirm new issuance works. Audit the step-ca log for any issuance during the exposure window and revoke suspicious certs.

An SSH user cert has been used maliciously #

Identify the principal from the step-ca issuance log. Remove the user from the ccatobs/datacenter GitHub team — Dex checks team membership on every authentication, so the next step ssh login from that user fails immediately. The cert itself expires within 16 hours; no host-side action is required unless the principal shows ongoing activity.

# Find issuance events for a given principal
docker compose logs step-ca | grep '"principal":"alice"'

Dex static client secret leaked #

The secret that step-ca uses to authenticate to Dex (vault_dex_stepca_client_secret) could, if leaked, let an attacker exchange OIDC codes for tokens on step-ca’s behalf — useful only in combination with an already-valid user authentication flow, so the risk is bounded. Rotate:

ccat secrets rotate vault_dex_stepca_client_secret --env production
ccat secrets provision --host input-b
ccat ca down && ccat ca up        # reload Dex with new secret
ccat ca provisioner remove CCAT-GitHub
ccat ca provisioner sync          # re-add with new secret

A ccatobs/datacenter team member has gone rogue #

Remove them from the ccatobs/datacenter GitHub team. Dex checks team membership live on every authentication, so any new step ssh login will fail at the Dex layer. Their existing SSH user cert expires within 16 hours. For faster eviction from active sessions, force-terminate their SSH connections on the target hosts.

input-b has been compromised #

The blast radius depends on the phase.

Phase 1 (file-based intermediate key). The intermediate signing key must be assumed compromised — it sits on disk. Response: full intermediate rotation via an offline root ceremony, redeploy trust bundles (but since Phase 1 = throwaway, the simpler answer is to wipe and restart the ceremony from scratch).

Phase 2+ (HSM-backed intermediate key). The key itself cannot be extracted from the HSM, but the attacker could have used it while they had access to input-b. Response: rotate the intermediate (ceremony with HSM #1, no root rotation needed), audit the step-ca issuance log for anything signed during the exposure window, and revoke or actively expire any suspicious certs. The root stays intact; clients do not need to re-bootstrap.

In both phases, Dex’s state on input-b is also within blast radius, but Dex has no user database to leak — its entire config is in git and the only secrets it holds are the GitHub OAuth client secret and the static step-ca client secret, both in the Ansible vault. Rotate both as part of the same response.