CCAT Certificate Authority — Operations Guide#

This document describes how the CCAT Observatory’s private Certificate Authority is designed, commissioned, and operated. It is the working reference for anyone touching the CA — whether adding a new provisioner, rotating the intermediate, recovering from a failure, or just trying to understand why a Redis client can’t validate a cert.

For fundamentals (what is a cert, how TLS handshakes work, the role of public vs private keys), see TLS, Certificates, and Public Key Infrastructure. This document assumes you already know those concepts and focuses on our specific setup, the reasons for it, and the runbooks that keep it healthy.

What the CA is for#

Before the CA existed, every TLS or mTLS need across the CCAT stack was solved with a hand-rolled OpenSSL script. Redis mTLS has a ccat redis-certs generate workflow that runs eight openssl commands to produce a per-variant CA and a set of client/server certs. Postgres replication uses its own cert pair. Developer SSH access uses static ~/.ssh/authorized_keys files per machine. Every service reinvents the cert layer at a slightly different angle.

The CA replaces that pattern. Instead of “each service has its own trust root and its own cert-generation script,” every service gets its certs from one central authority that issues short-lived certs on demand. The benefits compound:

  • One trust root. Clients bootstrap against the CCAT root once and trust everything it signs — Redis, Postgres, SSH hosts, internal web UIs, bbcp endpoints.

  • Short-lived certs. 16-hour SSH user certs, 30–90 day TLS certs, 7-day SSH host certs — no more “rotate these 12-month certs once a year and hope nothing breaks.” Expiry becomes a health property, not a calendar event.

  • Identity-aware issuance. SSH user certs come from GitHub OAuth via Dex, with ccatobs GitHub team membership as the authorization gate. No more adding public keys to authorized_keys files across machines — if you’re in the ccatobs/datacenter team, you run step ssh login, get a fresh cert valid for 16 hours, and SSH in. Off-boarding is “remove from the GitHub team” — certs expire on their own, no key-removal ceremony.

  • Automation paths. ACME and SSHPOP provisioners let services auto-renew their own certs without human touch.

The CA is not a goal in itself — it is how we stop writing cert-management code.

Trust architecture: the two-tier, two-HSM model#

The CA is private — nothing publicly trusted on the wider internet connects to it. It is used only by CCAT hosts, containers, and developer laptops that have explicitly bootstrapped against our root. This shapes the threat model and the choices below.

Two tiers: root and intermediate#

Every X.509 PKI has a notion of a trust hierarchy. We use the standard two-tier layout:

  • Root CA: the ultimate trust anchor. All CCAT certs chain back to it. Used only to sign the intermediate. Rotated ~never (20-year lifetime).

  • Intermediate CA: the working signing key. Signs every cert step-ca issues day-to-day. Rotated every 5–10 years or immediately upon suspected compromise.

The reason this separation exists: if the intermediate is compromised, you can revoke and replace it by pulling the root out of the safe and doing a controlled ceremony — clients keep trusting the (unchanged) root. If the root is compromised, every client must re-bootstrap against a new root, which for CCAT means touching every server and every developer laptop. We optimize for recovery from intermediate compromise, not for root compromise never happening.

Two HSMs: offline root, online intermediate#

We use two Nitrokey HSM 2 dongles:

Role

HSM

Physical location

Online?

Root signing key

HSM #1

Locked safe, off-site ideally

Never

Intermediate signing key

HSM #2

R640 internal USB on input-b

Always

HSM #1 comes out of the safe only during signing ceremonies (once at commissioning, then once per intermediate rotation). It is plugged into an air-gapped laptop for those ceremonies, never into input-b.

HSM #2 lives inside the R640 chassis permanently, in the internal USB-A port that sits on the motherboard. Getting physical access to it requires pulling the server out of its rack and opening the lid — a bar high enough that the “someone unplugs the dongle” threat is effectively closed. This is only a partial protection though: see the next section.

What the HSMs actually protect#

An HSM protects against key extraction, not against key use. The private key is generated on the device and is physically impossible to export — even root on the host cannot read the key bytes. But anyone with:

  1. access to the machine the HSM is plugged into, and

  2. the HSM user PIN

can invoke signing operations via PKCS#11. The HSM will happily comply — that’s its job.

This distinction drives several design choices:

  • The intermediate PIN lives in the CCAT vault (encrypted with ansible-vault) and is rendered into /opt/data-center/system-integration/.env on input-b as STEP_CA_HSM_PIN. step-ca’s ca.json references it via pin-source=/run/secrets/hsm-pin mounted from a tmpfs. The PIN is therefore reachable by any process running as root on input-b. An attacker who owns input-b can ask HSM #2 to sign arbitrary certs for the duration of their access.

  • Accepting that risk is fine because when you kick the attacker out, the key is still on the dongle. Recovery is “rotate the intermediate” — pull HSM #1 from the safe, do a ceremony, install the new intermediate on HSM #2 (or a fresh dongle), restart step-ca. Clients do not notice because the root is unchanged. Total downtime: ~1 hour, mostly ceremony overhead. Compare to a file-on-disk intermediate, where the attacker walks off with the key and can continue issuing certs even after being expelled — that requires a root rotation, which is catastrophic.

  • The root HSM is never plugged into anything networked. Period. If you need the root to sign something, you do it on a freshly wiped laptop, offline, with pre-printed procedures. The root user PIN is never typed on input-b or saved in any vault. It is kept on paper in the safe, in a sealed envelope alongside the dongle.

  • HSM #1 failure = emergency root rotation. We do not maintain DKEK backup shares for Nitrokey key recovery. If HSM #1 physically fails (electronics die in the safe), we have no way to recover the root key, and every CCAT client must re-bootstrap. This is a deliberate simplification for our scale (~20 clients) — the recovery pain is bounded, and DKEK introduces its own operational complexity. Revisit if the observatory grows past ~50 clients.

Why step-ca is NOT behind nginx-proxy#

During Phase 1 commissioning we initially put step-ca behind nginx-proxy with Let’s Encrypt TLS termination, for the same reason Dex uses it: it’s easy and browsers like LE. This was a mistake for step-ca specifically, and we fixed it.

The problem: step-cli clients verify the CA’s TLS cert against the CCAT root they downloaded during step ca bootstrap. When nginx-proxy terminates TLS with an LE cert, the cert presented to step-cli is signed by Let’s Encrypt, not by the CCAT root. Verification fails with x509: certificate signed by unknown authority, every bootstrapped client breaks.

The fix is to expose step-ca’s native TLS on port 9000, bypassing nginx-proxy for the CA API entirely. step-ca always listens on port 9000 internally with its own self-signed cert (signed by the CCAT intermediate, chains to the CCAT root). Clients hit https://ca.ccat.uni-koeln.de:9000 and the trust chain works as step-cli expects.

Dex stays behind nginx-proxy on port 443 with LE, because browsers (and GitHub’s OAuth callback) genuinely need a publicly-trusted cert on that endpoint.

So the CA stack has two distinct TLS endpoints with two different trust models:

Endpoint

Port

TLS signed by

Trusted by

ca.ccat.uni-koeln.de

9000

CCAT root (self)

step-cli clients via step ca bootstrap

auth.ccat.uni-koeln.de

443

Let’s Encrypt

browsers (Dex login flow), GitHub OAuth

Firewall requirement: port 9000 inbound to input-b must be open from all sites where developers or services need to reach the CA. Until the firewall rule is in place, off-network clients can use the trust-bundle workaround documented in the Troubleshooting section.

Let’s Encrypt layering#

The CA has two public-facing HTTPS endpoints: ca.ccat.uni-koeln.de (step-ca itself) and auth.ccat.uni-koeln.de (Dex). These are the URLs that developer laptops, step ca bootstrap, and GitHub’s OAuth callback hit over the public internet.

These endpoints are served by the existing nginx-proxy + acme-companion stack on input-b, with certs from Let’s Encrypt — not from our own CCAT CA. The reasons:

  1. Browsers and GitHub’s OAuth callback will not follow redirects to a TLS endpoint signed by an untrusted CA. Using the CCAT root for ca.ccat.uni-koeln.de would mean every step ca bootstrap needs a manually-verified pre-shared fingerprint — a chicken-and-egg problem.

  2. Let’s Encrypt is free, automated, and trusted by every OS. We get working HTTPS on both domains with zero additional infrastructure and zero per-client trust configuration.

Two ACME endpoints exist on input-b once commissioning is done, and they are not the same thing:

  • Let’s Encrypt ACME at acme-v02.api.letsencrypt.org — used by acme-companion to obtain public TLS certs for ca.ccat.uni-koeln.de and auth.ccat.uni-koeln.de. Renewed automatically every ~60 days.

  • step-ca ACME at https://ca.ccat.uni-koeln.de/acme/acme/directory — used by internal services (future: cert-manager in K8s, or a simple step command on each host) to obtain CCAT-issued internal TLS certs. Provisioner is added to step-ca during commissioning.

Both speak the same ACME protocol but serve different trust domains. Don’t confuse them.

Physical and network preconditions#

Before commissioning, these must all be true:

  1. input-b is physical (R640 in a locked HA hall). The internal USB port on the motherboard is accessible. Physical access to the room is gated.

  2. DNS records exist for ca.ccat.uni-koeln.de and auth.ccat.uni-koeln.de, both pointing at input-b’s public IP.

  3. Firewall: ports 80 and 443 are reachable from the public internet (for Let’s Encrypt HTTP-01 challenges and client traffic).

  4. nginx-proxy + acme-companion is already running on input-b via docker-compose.proxy.yml (ccat proxy status).

  5. GitHub OAuth App has been created in the ccatobs organization with callback URL https://auth.ccat.uni-koeln.de/callback and read:org scope (Dex needs it to check team membership).

  6. GitHub team ccatobs/datacenter exists and contains the people who should get SSH access via step ssh login. Dex rejects everyone outside this team at the authentication step.

Commissioning strategy — phases#

We deliberately commission the CA in two stages, using the HSM arrival as a built-in rehearsal of the most dangerous operation in the CA’s lifetime (root rotation). The phases are:

Phase

What

When

Outcome

Phase 1

Dry-run commissioning with a throwaway auto-init root

Now, without HSMs

Working CA, used by a small test cohort, ca_trust role proven in production

Phase 2

Offline root ceremony + HSM cutover = rotation rehearsal

When both HSMs arrive

CA migrated to the intended HSM-backed steady state, test cohort re-bootstraps

Phase 3

Rollout to real services (Redis mTLS, Postgres TLS, SSH host certs, etc.)

After Phase 2 has been stable for ~1 week

CA is trusted by production services

Why this ordering:

  • Phase 1 de-risks everything that isn’t HSM-specific. DNS, Let’s Encrypt issuance, nginx-proxy wiring, Dex + GitHub team enforcement, OIDC redirect URIs, step-ca provisioner syntax, the step ca bootstrapstep ssh login flow, the ca_trust Ansible role end-to-end — all verified in a low-stakes setting before hardware arrives.

  • Phase 2 exercises the root rotation procedure. Root rotation is the one operation the team otherwise never practices; it is also the catastrophic disaster-recovery path. Doing it once intentionally, with throwaway clients and zero production impact, is the best rehearsal possible. If it fails, you learn while stakes are zero.

  • Phase 3 is gated on Phase 2 success. Nothing outside the small Phase 1 test cohort bootstraps against the CA until after the HSM cutover. This discipline is non-negotiable: if production clients trusted the Phase 1 throwaway root, Phase 2 would require re-bootstrapping them for real, defeating the rehearsal framing.

The one rule that makes Phase 1 safe#

Nothing production-critical bootstraps against the Phase 1 CA. The test cohort is 2–3 people who know they’re on a test CA and have agreed to re-bootstrap at Phase 2 cutover. No Redis, no Postgres, no SSH hosts, no CI systems, no automation.

The Phase 1 CA’s blast radius is therefore near-zero: even if someone compromised input-b during the dry-run window and stole the auto-init root key from the docker volume, the certs they could sign would be trusted only by the test cohort’s laptops — which are going to be re-bootstrapped in Phase 2 anyway. The Phase 1 root goes in the bin regardless.

Phase 1: dry-run commissioning with a throwaway root#

Phase 1 uses the DOCKER_STEPCA_INIT_* environment variables in docker-compose.ca.yml to auto-generate root, intermediate, and SSH CA keys inside the step-ca container on first boot. All keys are encrypted with STEP_CA_PASSWORD and live in the step-ca-data docker volume. This is not the intended steady-state configuration — it exists specifically to make Phase 1 fast.

Before starting#

  • GitHub OAuth App created in ccatobs org (callback: https://auth.ccat.uni-koeln.de/callback, scope read:org). Save Client ID and Secret in a personal password manager.

  • GitHub team ccatobs/datacenter exists and has the initial operator(s) as members. Dex rejects non-members at the auth step.

  • Vault secrets populated: vault_step_ca_password, vault_dex_github_client_id, vault_dex_github_client_secret, vault_dex_stepca_client_secret (use ccat secrets set for the GitHub values, ccat secrets rotate for the step-ca client secret, then ccat secrets provision --host input-b).

  • ccat proxy status confirms nginx-proxy + acme-companion are running.

Bring up the stack#

ccat ca up        # on input-b, or: ccat ca up --remote from a laptop
ccat ca logs      # watch for step-ca "server listening" + dex "listening (http)"

Verify public HTTPS endpoints issue LE certs and that both services respond (~60s after first bring-up):

# step-ca: hits step-ca's own /health endpoint, should return 200
curl -sI https://ca.ccat.uni-koeln.de/health

# Dex: fetch the OIDC discovery document. Should return JSON with an
# `issuer` field matching the public URL.
curl -s https://auth.ccat.uni-koeln.de/.well-known/openid-configuration | jq .issuer
# → "https://auth.ccat.uni-koeln.de"

Dex has no admin UI to visit. If the discovery endpoint returns valid JSON, the service is alive and correctly wired through nginx-proxy. The only user-facing Dex page is the GitHub login redirect, which you’ll exercise during the smoke test below.

Export throwaway trust artifacts#

The auto-init path generates everything ca_trust needs. Pull the public files out of the running container and commit them:

docker exec $(docker ps -qf name=step-ca) \
  cat /home/step/certs/root_ca.crt > ansible/roles/ca_trust/files/root_ca.crt
docker exec $(docker ps -qf name=step-ca) \
  cat /home/step/certs/ssh_user_ca_key.pub > ansible/roles/ca_trust/files/ssh_user_ca.pub
docker exec $(docker ps -qf name=step-ca) \
  cat /home/step/certs/ssh_host_ca_key.pub > ansible/roles/ca_trust/files/ssh_host_ca.pub

# Compute the fingerprint — this is what Phase 1 test clients will verify
step certificate fingerprint ansible/roles/ca_trust/files/root_ca.crt

Commit the three files with a message that loudly marks them as throwaway:

ca_trust: commit THROWAWAY root artifacts from Phase 1 dry-run

To be replaced by HSM-backed ceremony outputs in Phase 2.
Fingerprint: <from step certificate fingerprint above>

Distribute via ca_trust#

cd ansible
ansible-playbook playbook_setup_vms.yml --tags ca_trust

This is the first real-world test of the ca_trust role. It should deploy the throwaway root cert to every managed host’s system trust store, install the SSH user CA, and register the SSH host CA. Watch for any errors — this is exactly what you want to find now, not during the Phase 2 rotation.

Configure step-ca provisioners#

Dex needs no runtime configuration — its entire config lives in step-ca/dex/config.yaml in this repo and is bind-mounted into the container read-only. Bringing the stack up is enough.

What does need commissioning is the step-ca provisioner set. Run the idempotent installer:

ccat ca provisioner sync
# Prompts for the Dex step-ca client secret (from the vault) and
# the OIDC admin email. Uses step-ca/provisioners-add.sh under
# the hood.

This adds the six CCAT provisioners (CCAT-GitHub, prod-services, staging-services, service-accounts, acme, sshpop) and restarts step-ca so ca.json takes effect. All this state lives in the step-ca-data volume and is wiped during the Phase 2 HSM cutover, which is why ccat ca provisioner sync is designed to be re-run: it’s the same script both at Phase 1 commissioning and at Phase 2 cutover. Dex has no persistent state that matters across the Phase 2 transition — the sqlite3 dex-data volume just caches signing keys + session state, and its config is regenerated from git on every restart.

Smoke test#

step ca bootstrap \
  --ca-url https://ca.ccat.uni-koeln.de \
  --fingerprint <THROWAWAY-FINGERPRINT>
step ssh login
# Browser → Dex → GitHub OAuth → (ccatobs/datacenter team check) → cert in ssh-agent
ssh input-a.data.ccat.uni-koeln.de   # should Just Work

Use the CA for real SSH for 1–2 weeks. Keep notes on anything weird. If something fails catastrophically, ccat ca down, docker volume rm ccat-ca_step-ca-data, fix the problem, ccat ca up, re-export the new root cert, re-run ca_trust. The test cohort absorbs the disruption.

Phase 1 exit criteria#

  • Test cohort (2–3 humans) has been using step ssh login for at least a week without issues

  • ca_trust role runs cleanly and idempotently against all managed hosts

  • LE auto-renewal has ticked over at least once (or you’ve inspected acme-companion logs to confirm it will)

  • Dex survives a container restart without losing signing keys (sqlite3 dex-data volume persists them; if it were wiped, step-ca’s cached JWKS would go stale until the next refetch)

  • You are confident in the step-ca provisioner configuration

When these are all true, Phase 2 is ready.

Phase 2: offline root ceremony#

This is the highest-stakes operation in the CA’s lifetime. It must be done on an air-gapped machine with two people present for review.

Note on framing: Phase 2 is simultaneously the first real root-creation event and a rehearsal of the root rotation procedure. The commands you run here, the ceremony checklist you follow, and the final step ca bootstrap --force that the test cohort runs — these are exactly what a future “HSM #1 died, we need emergency root rotation” day looks like. Treat Phase 2 as a practice run for that day, not as a one-off.

The ceremony creates:

  1. A root private key that will never touch the network.

  2. A self-signed root certificate (public).

  3. An intermediate private key on HSM #2.

  4. An intermediate certificate signed by the root (public).

Preparation (day before)#

  • Both Nitrokey HSM 2 dongles in hand, unopened or verified genuine.

  • Spare laptop with a wipeable SSD (not your daily driver).

  • Fresh Ubuntu LTS Live USB.

  • Second USB stick, empty, labeled “CCAT CA ceremony export”.

  • Printed copy of this ceremony procedure.

  • Paper, pen, envelope for recording PINs and fingerprints.

  • Cached offline packages: opensc, opensc-pkcs11, step-cli, step-kms-plugin (.deb files downloaded on a networked machine and copied to a USB).

  • Four PIN values chosen, written on paper:

    • Root HSM user PIN (6–8 digits, memorable but not guessable)

    • Root HSM SO PIN (same)

    • Intermediate HSM user PIN (this will go in the vault)

    • Intermediate HSM SO PIN (stays on paper, in the safe)

  • Reviewer/witness on-site.

Ceremony execution#

These steps happen in sequence on the air-gapped laptop. Do not deviate. If something goes wrong, stop and re-plan — do not improvise with a networked machine.

Warning

Before starting: physically disconnect ethernet, disable wifi in BIOS if possible, remove the wifi card if you are paranoid. Verify no interfaces are up with ip link show. A briefly-networked laptop during this ceremony is how every post-mortem “our root key was on a compromised machine” starts.

  1. Boot the laptop from the Live USB. Open a terminal.

  2. Install offline packages from the packages USB:

    sudo dpkg -i /media/usb/packages/*.deb
    
  3. Confirm no network:

    ip link show | grep -E 'state UP' && echo "STOP — network is up"
    
  4. Plug in both HSMs. Confirm they are visible:

    pkcs11-tool --list-slots
    

    You should see two slots. Note which is which by unplugging HSM #1 and running --list-slots again to verify identification.

  5. Initialize HSM #1 as the root:

    sc-hsm-tool --initialize --label "ccat-root" \
      --so-pin <ROOT-SO-PIN> --pin <ROOT-USER-PIN>
    
  6. Initialize HSM #2 as the intermediate:

    sc-hsm-tool --initialize --label "ccat-intermediate" \
      --so-pin <INT-SO-PIN> --pin <INT-USER-PIN>
    
  7. Generate the root key on HSM #1 (elliptic curve, 384-bit):

    pkcs11-tool --module /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so \
      --token-label ccat-root --login --pin <ROOT-USER-PIN> \
      --keypairgen --key-type EC:secp384r1 \
      --label "ccat-root-key" --id 01
    
  8. Generate the intermediate key on HSM #2:

    pkcs11-tool --module /usr/lib/x86_64-linux-gnu/opensc-pkcs11.so \
      --token-label ccat-intermediate --login --pin <INT-USER-PIN> \
      --keypairgen --key-type EC:secp384r1 \
      --label "ccat-intermediate-key" --id 01
    
  9. Create the self-signed root cert, referring to the HSM-stored key via a PKCS#11 URI:

    step certificate create "CCAT Observatory Root CA" \
      root_ca.crt root_ca_key_ref \
      --kms "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;token=ccat-root" \
      --kms-key-id "pkcs11:object=ccat-root-key;id=%01" \
      --profile root-ca \
      --not-after 240960h   # ~27.5 years
    

    Note: root_ca_key_ref is a reference to the HSM key, not the key bytes. The real key never leaves HSM #1.

  10. Create the intermediate CSR referring to the HSM #2 key:

    step certificate create "CCAT Observatory Intermediate CA" \
      intermediate.csr intermediate_key_ref \
      --csr \
      --kms "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;token=ccat-intermediate" \
      --kms-key-id "pkcs11:object=ccat-intermediate-key;id=%01"
    
  11. Sign the intermediate CSR with the root:

    step certificate sign intermediate.csr root_ca.crt root_ca_key_ref \
      --profile intermediate-ca \
      --not-after 87600h  # 10 years
      --kms "pkcs11:module-path=/usr/lib/x86_64-linux-gnu/opensc-pkcs11.so;token=ccat-root" \
      > intermediate_ca.crt
    
  12. Critically: compute and record the root cert SHA-256 fingerprint on paper. This is what every client verifies during step ca bootstrap:

    step certificate fingerprint root_ca.crt
    

    Write the full fingerprint on paper. Put the paper in the envelope with the root HSM PINs.

  13. step-ca also uses separate SSH CA keys (different from the X.509 root/intermediate). Generate those on HSM #2 too so step-ca can sign SSH certs:

    pkcs11-tool --module ... --token-label ccat-intermediate \
      --login --pin <INT-USER-PIN> \
      --keypairgen --key-type EC:secp384r1 \
      --label "ccat-ssh-user-ca-key" --id 02
    pkcs11-tool --module ... --token-label ccat-intermediate \
      --login --pin <INT-USER-PIN> \
      --keypairgen --key-type EC:secp384r1 \
      --label "ccat-ssh-host-ca-key" --id 03
    

    Export the SSH public keys in OpenSSH format (via a step-ca helper or ssh-keygen conversion). Record in files ssh_user_ca.pub and ssh_host_ca.pub.

  14. Copy the public artifacts to the export USB:

    • root_ca.crt

    • intermediate_ca.crt

    • ssh_user_ca.pub

    • ssh_host_ca.pub

    • A text file FINGERPRINT.txt containing the SHA-256 fingerprint.

    Do not copy any private files. There are no private files to copy — the keys are inside the HSMs.

  15. Unplug both HSMs. Physically label them: one sticker “CCAT root” on HSM #1, “CCAT intermediate” on HSM #2.

  16. Power off the laptop. Remove and physically destroy the SSD (or securely wipe it if destruction isn’t feasible).

Post-ceremony distribution#

  • HSM #1 → sealed envelope with root PINs and fingerprint paper → the safe. Does not enter input-b. Ever.

  • HSM #2 → carried to the server room → installed in the R640 internal USB port → chassis closed → server returned to rack.

  • Export USB → mounted on a developer machine → public artifacts (root_ca.crt, ssh_user_ca.pub, ssh_host_ca.pub) copied into ansible/roles/ca_trust/files/ → committed to git with a clear commit message (“ca: commit public trust material from root ceremony 2026-XX-XX, fingerprint …”).

The public artifacts are safe to commit — they contain no secret material, and every client needs to be able to fetch them. The fingerprint in the commit message is the cross-check: any future developer inspecting history can verify the committed root cert matches the ceremony fingerprint on paper.

Phase 2: HSM cutover on input-b#

Once ceremony artifacts are in the repo and HSM #2 is in the server, execute the cutover. This replaces the Phase 1 throwaway step-ca state with HSM-backed steady state. Dex needs no special handling — its config is in git and its dex-data volume holds only signing keys + session state, both of which are regenerated cleanly on restart.

Before you start this cutover, verify:

  • Ceremony fingerprint recorded on paper matches the committed root_ca.crt in ansible/roles/ca_trust/files/.

  • The test cohort is ready to re-bootstrap (this is the practiced rotation event for them — they will run step ca bootstrap --force --fingerprint <new> right after cutover).

Cutover procedure:

  1. Install HSM #2 in input-b. Power down, open chassis, insert HSM into the internal USB port, close chassis, power up. Verify:

    sudo pkcs11-tool --list-slots  # expect ccat-intermediate token
    
  2. Add the HSM PIN to the vault schema + populate:

    ccat secrets add vault_step_ca_hsm_pin --env production
    # Prompts for env_name (STEP_CA_HSM_PIN), description, value.
    # Paste the intermediate HSM user PIN from the ceremony.
    ccat secrets provision --host input-b
    
  3. Run the Ansible hsm_host role to finalize OpenSC + udev rule:

    cd ansible
    ansible-playbook playbook_setup_vms.yml --tags hsm_host -l input-b
    
  4. Overwrite the ca_trust artifacts with ceremony outputs. Replace the Phase 1 throwaway files with the real ones from the ceremony export USB:

    cp /media/ceremony-usb/root_ca.crt       ansible/roles/ca_trust/files/
    cp /media/ceremony-usb/ssh_user_ca.pub   ansible/roles/ca_trust/files/
    cp /media/ceremony-usb/ssh_host_ca.pub   ansible/roles/ca_trust/files/
    git add ansible/roles/ca_trust/files/
    git commit -m "ca_trust: rotate to HSM-backed root (Phase 2 cutover)
    
    Replaces Phase 1 throwaway. New fingerprint: <paper>"
    git push origin main
    
  5. Re-run ca_trust to distribute the new root. The file paths don’t change, so update-ca-trust automatically rehashes the trust store and the SSH lineinfile tasks update the CA lines in place:

    ansible-playbook playbook_setup_vms.yml --tags ca_trust
    

    No host is offline during this — the new root and old root coexist in each host’s trust store briefly, and existing certs issued by the Phase 1 intermediate continue to validate against the Phase 1 root until step 8 below tears down step-ca.

  6. Edit docker-compose.ca.yml for HSM-backed mode. Changes needed:

    • Remove all DOCKER_STEPCA_INIT_* env vars from the step-ca service (no more auto-init).

    • Add devices: ["/dev/bus/usb/<bus>/<dev>"] pointing at the HSM device node (find bus/dev with lsusb — vendor 20a0, product 4230).

    • Add a tmpfs mount for the PIN file: tmpfs: ["/run/secrets:mode=0400,size=1M"].

    • Add an entrypoint wrapper or init container that writes $STEP_CA_HSM_PIN to /run/secrets/hsm-pin.

    • Build step-ca from a thin derived Dockerfile that adds opensc-pkcs11 — the stock smallstep/step-ca image does not include a PKCS#11 module.

    Commit and push these compose changes before the next step.

  7. Announce the cutover to the test cohort. They should close open SSH sessions cleanly and be ready to re-bootstrap when you signal.

  8. Cut over the step-ca half of the stack. Stop the stack, wipe only step-ca-data, pre-populate with ceremony outputs + new ca.json, bring back up:

    ccat ca down
    docker volume rm ccat-ca_step-ca-data
    docker volume create ccat-ca_step-ca-data
    
    # Pre-populate via a throwaway alpine container
    TMP=$(docker create --rm -v ccat-ca_step-ca-data:/home/step alpine sleep 300)
    docker cp /media/ceremony-usb/root_ca.crt         $TMP:/home/step/certs/root_ca.crt
    docker cp /media/ceremony-usb/intermediate_ca.crt $TMP:/home/step/certs/intermediate_ca.crt
    docker cp ./step-ca/ca.json.hsm                   $TMP:/home/step/config/ca.json
    docker kill $TMP
    
    ccat ca up
    ccat ca logs step-ca   # expect "Loaded key from PKCS#11 URI" + "server listening"
    

    Dex keeps running throughout — its config file and dex-data volume are not touched, so the OIDC issuer URL, JWKS, and static step-ca client secret are all unchanged. step-ca’s CCAT-GitHub provisioner reconnects to the same Dex discovery endpoint it was using before the wipe.

  9. Verify Let’s Encrypt certs still serve. /opt/proxy/certs was untouched, so both public domains should keep their valid LE certs through the cutover:

    curl -sI https://ca.ccat.uni-koeln.de/health
    curl -s https://auth.ccat.uni-koeln.de/.well-known/openid-configuration | jq .issuer
    
  10. Re-add step-ca provisioners. The Phase 1 provisioner list lived inside the wiped ca.json, so it needs to be re-created. Run the same commissioning command:

    ccat ca provisioner sync
    

    The set is unchanged from Phase 1: CCAT-GitHub, prod-services, staging-services, service-accounts, acme, sshpop. The CCAT-GitHub OIDC provisioner reuses the Dex step-ca client secret from the vault — nothing in Dex’s state has changed, so the old secret keeps working.

  11. Test cohort re-bootstraps with the new fingerprint. This is the rehearsed step — the one command every future root-rotation event depends on:

    step ca bootstrap --force \
      --ca-url https://ca.ccat.uni-koeln.de \
      --fingerprint <NEW-FINGERPRINT-FROM-CEREMONY>
    step ssh login
    ssh input-a.data.ccat.uni-koeln.de
    

    Every test-cohort member must verify the new fingerprint against the paper copy from the ceremony before accepting it. This is the critical integrity checkpoint. If the fingerprint doesn’t match, something is wrong — stop and investigate. Do not click through.

At this point the CA is in its intended HSM-backed steady state and Phase 2 is done. Do not roll out to production services yet — let the HSM-backed config run for a week to shake out any PKCS#11 / udev / container device mount issues before putting real services behind it. The Phase 3 rollout checklist lives in step-ca/COMMISSIONING-TODO.md.

Day-to-day operations#

Bringing the stack up, down, restart#

All via the ccat ca CLI, which wraps docker compose -f docker-compose.ca.yml:

ccat ca status             # show container status
ccat ca logs               # tail all services
ccat ca logs step-ca       # tail a specific service
ccat ca restart step-ca    # restart without image pull
ccat ca update             # git pull → image pull → up -d
ccat ca down               # stop, preserve volumes (always)

ccat ca down never passes -v. This is deliberate. The step-ca-data volume is irreplaceable in Phase 2+ — losing it means re-doing the root ceremony and re-bootstrapping every client. The dex-data volume is safe to wipe in principle (Dex regenerates signing keys on startup), but a fresh JWKS briefly invalidates step-ca’s cached discovery until it refetches, so there’s no reason to do it during a normal restart. If you need to truly wipe the CA, do it by hand with docker volume rm and think three times.

Issuing new certs#

Humans use the step CLI after step ca bootstrap:

# SSH user cert (opens browser for GitHub OAuth)
step ssh login

# x509 cert for a service (JWK provisioner)
step ca certificate service.local service.crt service.key \
  --provisioner prod-services

# ACME cert (automatic, for internal services)
step ca certificate service.local service.crt service.key \
  --acme

Monitoring cert expiry#

step-ca’s internal database tracks issued certs. For CCAT operational visibility, expiry should be surfaced in Grafana via InfluxDB. The pattern (to be implemented in Phase 2):

  • A systemd timer on each managed host runs step certificate inspect --format json <cert> periodically and pushes an cert_expiry_days metric to InfluxDB.

  • Grafana alerts on cert_expiry_days < 7 for any service.

Backup#

Two Docker volumes on input-b must be backed up:

  • ccat-ca_step-ca-data — step-ca config, db, intermediate public cert (not the key — that’s on HSM #2).

  • ccat-proxy_html + /opt/proxy/certs — LE certs (cheap to re-issue but saves a round-trip on DR).

The ccat-ca_dex-data sqlite3 volume does not need backup: Dex’s entire config is in git (step-ca/dex/config.yaml), and the volume holds only ephemeral session state + signing keys that are safely regenerated on first start.

The CCAT backup pipeline should cover these paths (see Backup and Restore for the backup architecture).

The HSM keys themselves are not in any backup — they cannot be. This is acceptable because:

  • HSM #1 failure is a planned-for disaster with a documented recovery procedure (emergency root rotation, re-bootstrap all clients).

  • HSM #2 failure is a routine rotation (root ceremony, new intermediate, swap in fresh HSM).

Provisioner management#

Provisioners are the entry points through which clients request certs from step-ca. Each provisioner has a type (OIDC, JWK, ACME, SSHPOP, …), an authentication mechanism, and a set of claims that govern what kinds of certs it can issue and with what lifetimes. They live in ca.json inside the step-ca-data volume.

CCAT runs six provisioners. They are installed by the script step-ca/provisioners-add.sh.

The CCAT provisioner set#

Name

Type

Purpose

Default lifetime

Max lifetime

CCAT-GitHub

OIDC

Interactive SSH user certs via Dex + GitHub (team-gated)

16 h

16 h

prod-services

JWK

Production x509 / TLS certs

90 d (2160 h)

90 d

staging-services

JWK

Staging x509 / TLS certs

30 d (720 h)

30 d

service-accounts

JWK

SSH certs for automated services (Jenkins, ccat_transfer/bbcp, CI)

24 h

24 h

acme

ACME

Auto-TLS for internal services via ACME protocol

90 d

90 d

sshpop

SSHPOP

SSH host cert auto-renewal (host re-proves by signing with old cert)

7 d (168 h)

7 d

Authorization model — we trust GitHub, not email domains#

A common pitfall when wiring step-ca’s OIDC provisioner is to use the --domain flag to restrict which users can get certs. That flag checks the email claim of the OIDC token against an allowlist of domains. For a tenant that uses a single corporate email domain (Google Workspace, Microsoft 365), it’s a reasonable coarse gate.

For CCAT, it is the wrong model. Our trust chain is:

  1. Dex federates GitHub as the identity provider.

  2. Authorization is membership in the ccatobs/datacenter GitHub team, not email domain membership.

  3. Team members have wildly different email domains — uni-koeln.de, ph1.uni-koeln.de, cornell.edu, fyst.org, personal addresses. None of these reflect CCAT membership in any structural way.

Filtering by email domain is simultaneously too strict (rejects valid ccatobs members whose GitHub primary email isn’t a uni address) and too loose (accepts anyone with a uni-koeln.de email regardless of whether they’re in ccatobs — that’s a huge public domain). The script therefore omits --domain by default.

What actually provides the authorization gate:

Dex enforces ccatobs/datacenter team membership directly in its GitHub connector, before step-ca ever sees a token. Config is in step-ca/dex/config.yaml:

connectors:
  - type: github
    id: github
    config:
      orgs:
        - name: ccatobs
          teams:
            - datacenter

How it works end-to-end:

  1. User runs step ssh login. step-cli opens a browser to the CCAT-GitHub provisioner’s configured OIDC issuer (Dex).

  2. Dex redirects the browser to GitHub for OAuth.

  3. GitHub authenticates the user and returns an OAuth token with read:org scope.

  4. Dex calls GitHub’s /user/teams endpoint with that token and checks whether the user is a member of ccatobs/datacenter.

  5. If yes: Dex issues an OIDC ID token with a groups claim containing the team slug, redirects back to step-cli, step-ca validates the token, issues a 16h SSH cert. Done.

  6. If no: Dex returns an “access denied” page, no token is issued, step-cli errors out with “OIDC flow failed.” The user never reaches step-ca.

Onboarding a new operator: add them to the ccatobs/datacenter team on github.com. Their next step ssh login succeeds. No CCAT-side configuration change, no admin UI to click through, no secret to rotate.

Offboarding: remove them from the team. Their current 16h cert expires within the day, no new certs can be issued. Any existing SSH sessions keep working until the cert underlying them expires, and then they’re locked out. No cert revocation needed in the common case.

This is a fully automatic model: both authentication and authorization are delegated to GitHub’s team management. CCAT writes zero identity code. A GitHub outage makes new step ssh login flows unavailable until GitHub recovers (existing 16h certs keep working), which is an acceptable trade for the operational simplicity — and in practice GitHub has dramatically better uptime than any identity layer CCAT would run itself.

Why we moved off Keycloak. The prior Phase 1 setup used Keycloak as an IdP in front of GitHub. Keycloak’s built-in GitHub broker does not call the teams endpoint, only /user, so authorization had to be enforced by a manual “assign the ccatobs-member realm role” step in the Keycloak admin UI after each new user’s first login. That’s one manual onboarding step too many, and it doesn’t age well — if a user leaves the GitHub team, their Keycloak role stays assigned unless an admin remembers to clean up. Switching to Dex collapses three moving parts (Keycloak, Keycloak-db, manual role assignment) into one declarative YAML block and tracks GitHub team membership automatically.

The --domain flag remains available in the script via the ALLOWED_DOMAINS env var for cases where domain is genuinely the right gate (e.g. you’re bootstrapping a CA for a specific org that does use a uniform email domain). For CCAT, leave it unset.

SSH access tiers — who gets what#

The Dex team gate answers “who may authenticate.” The separate question of “which Linux user may they become, and what happens if the IdP is down” is answered by a three-tier access model, implemented via a mix of Ansible-managed local users, AuthorizedPrincipalsFile, and the existing Nitrokey FIDO2 SSH keys.

This section documents the intended steady-state model. The implementation (an Ansible role deploying auth_principals/%u files) is Phase 3 work; Phase 1 hosts are currently using the legacy static-authorized_keys path.

Tier 1 — Hard-core admins (2–3 people)

Full root access to every CCAT-managed host, with a physical second factor as the fallback for when the IdP layer is unavailable.

  • Personal Linux user on every host (e.g. buchbend), managed by Ansible users.yml, member of the wheel/sudo group.

  • Static SSH authorized_keys entry for their Nitrokey FIDO2 resident key (sk-ecdsa-sha2-nistp256@openssh.com). This key is physically bound to the dongle and cannot be cloned without the device. It is the break-glass path: if Dex is down, if GitHub is unreachable, if step-ca won’t issue, the admin still SSHes in with their dongle.

  • Also a full member of ccatobs/datacenter on GitHub, so the normal step ssh login flow works day-to-day. The Nitrokey path is the backup, not the primary.

  • Sudo permissions are granted through group membership, not through anything the SSH cert carries. A Tier 1 admin who logs in with a step-ca cert lands in the same local account and gets the same sudo rights as one who logs in with the Nitrokey — the cert/key choice is just the door, not the privilege level.

Tier 2 — Operational staff

Regular contributors who need SSH access for legitimate operational work but are not the people you wake up at 3am. The Nitrokey dependency is explicitly not required — adding hardware to every new contributor is friction that scales badly.

  • Personal Linux user on managed hosts, created by Ansible from users.yml. No wheel/sudo membership unless there’s a specific operational need.

  • No static SSH authorized_keys entry. The only path to logging in is a valid step-ca-issued SSH user cert, which requires authenticating through Dex + the GitHub team check.

  • sshd_config has AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u and TrustedUserCAKeys /etc/ssh/trusted_user_ca_keys. Each staff member gets a one-line file /etc/ssh/auth_principals/<username> containing their own username as a principal. Rendered from users.yml by a Phase 3 Ansible role (working name: ssh_access_principals).

  • Off-boarding is GitHub-side: remove them from ccatobs/datacenter, their next step ssh login fails at Dex, their current cert expires within 16h, they’re out. No Ansible rerun, no manual authorized_keys surgery.

  • Rolling back a staff member to “no SSH at all” can be done either by removing them from the GitHub team (preferred, fast, no CCAT-side action) or by deleting their auth_principals/<username> file via Ansible (slower, but also denies access if they somehow got a cert from another route).

Tier 3 — Break-glass / emergency-only accounts

For scenarios where even Tier 1’s Nitrokey path is insufficient — the local sshd is broken, the machine is in single-user mode, the network is down — there must be a path that bypasses SSH entirely.

  • A named local user (e.g. breakglass) exists on each managed host, created by Ansible but with:

    • No password (! in /etc/shadow).

    • No authorized_keys and no entry in any auth_principals file. Cannot be reached via SSH by design.

    • Full wheel/sudo rights, so once you are them, you can recover anything.

  • Access is via the iLO/DRAC out-of-band management console on the R640, reached from the Uni Köln management VLAN. The iLO gives you a virtual keyboard at a physical login prompt, which is the one interface that works when every network service is gone. An admin with iLO credentials types the break-glass account’s name + a password supplied by iLO root-recovery or a physically-printed emergency password kept in the safe alongside the root HSM.

  • The break-glass path is tested during commissioning and then left alone. Using it is an incident in itself; any use should generate a postmortem.

Putting the tiers together

Tier

Primary auth

Backup auth

Linux account

Sudo

1 — Admin

step ssh login (Dex → GitHub → step-ca cert)

Nitrokey FIDO2 key in static authorized_keys

Personal user, wheel

Yes

2 — Staff

step ssh login → cert

(none — off by design)

Personal user

No, unless opted in case-by-case

3 — Break-glass

(none — not reachable via SSH)

iLO console + password

breakglass local user

Yes

The key property: the failure modes are orthogonal. A Dex outage takes out Tier 2 but leaves Tiers 1 and 3 intact. A GitHub outage takes out the step ssh login path for everyone, but Tier 1 falls back to Nitrokey and Tier 3 is untouched. A full network outage on input-b takes out step-ca entirely, but Tier 1’s Nitrokey path still works on every other host (their FIDO2 key is in each host’s local authorized_keys) and Tier 3 recovers the unreachable machine via iLO. No single failure, including a compromise of input-b, locks the operators out of their fleet.

Phase 3 work item: implement the ssh_access_principals role, populate users.yml with the Tier 2 staff list, and remove legacy static authorized_keys entries as each user migrates.

Why these lifetimes#

The numbers above are deliberate and worth understanding, because “cert lifetime” often gets conflated with “security strength” when it’s really about compromise recovery time vs operational resilience.

  • Human SSH (16h) — long enough to cover a full workday across time zones, short enough that daily re-authentication is routine. Off-boarding someone from the ccatobs GitHub org effectively revokes their SSH access within 16 hours with zero extra work: their next step ssh login fails at the GitHub OAuth step, their previous cert expires, they’re out. No authorized_keys surgery required.

  • Service SSH (24h, auto-renewed every 6h) — the service-accounts provisioner is designed for the Pattern A renewal flow described below: services run a systemd timer that calls step ssh renew every 6h, so the cert is continuously refreshed without ever touching the provisioner password again after bootstrap. A stolen cert is valid for at most 24h (and the timer would be trying to replace it during that window anyway). Rotation = rotate the provisioner password centrally, all downstream certs expire naturally within a day. Compare to classic SSH keys where compromise means “find and rotate keys on every deployed host.”

  • Service x509 (30–90d) — TLS certs for Redis, Postgres, internal APIs etc. run 30d in staging and 90d in production. Production is longer for operational resilience (a week-long CA outage doesn’t cascade into service outages); staging is shorter to exercise the renewal flow and surface any regressions before they bite prod. Services renew weekly via a short script or cert-manager-style controller.

  • SSH host certs (7d via SSHPOP) — See the detailed SSHPOP explanation below. 7 days gives plenty of slack; no reason to go longer when renewal is free.

  • ACME (90d) — matches LE convention. Any internal service that speaks ACME (cert-manager in k8s, certbot-like tools on hosts) gets the standard public-CA-equivalent lifetime.

The one non-obvious choice is service-accounts at 24h instead of 30d. A longer cert would mean fewer renewals and less operational friction, but it would also mean a compromise window measured in weeks instead of hours, and a stolen cert could quietly self-renew via step ssh renew until someone notices. 24h is the sweet spot where auto-renewal is cheap (every 6h, trivial load) and compromise is self-healing within a day.

What SSHPOP is and why it’s clever#

SSHPOP = SSH Proof Of Possession. It’s a step-ca provisioner type specifically designed for renewing SSH host certs with zero credentials stored on the host after initial bootstrap. Understanding it matters because it’s the foundation of the “SSH host certs rotate themselves forever” story in Phase 3.

The mechanism: when a host wants to renew its cert, it signs the renewal request with the private key of its currently-valid cert (which is the sshd host key — already on disk, already required for sshd to work). step-ca verifies the signature against the submitted cert, checks the cert hasn’t expired, checks it was originally issued by this CA, and issues a fresh one with the same principal.

Host                               step-ca
  │                                    │
  │ (current cert is 5 days old,       │
  │  systemd timer fires)              │
  │                                    │
  │──── step ssh renew request ───────>│
  │     (signed with current cert's    │
  │      private key, includes current │
  │      cert in the request)          │
  │                                    │
  │                                    │ SSHPOP provisioner:
  │                                    │   - Extract pubkey from current cert
  │                                    │   - Verify signature
  │                                    │   - Check not expired
  │                                    │   - Check issued-by-us
  │                                    │
  │<───── new cert, 7 days valid ──────│
  │                                    │
  │ Write to disk, SIGHUP sshd         │

Zero new secrets were used. The host proved its identity by possessing the private key that matches the current cert. Hence “Proof of Possession”. No password, no token, no provisioner credential on the host — just the sshd key which has to be there anyway.

Why only host certs? Host certs are associated with a single long-lived key (the sshd host key), so “prove possession of the current cert’s key” has a natural answer. User certs are per-session (fresh key each step ssh login), so there’s no stable key to prove possession of.

Natural forcing function: if a host falls out of rotation long enough for its cert to fully expire, SSHPOP cannot rescue it. The host has no valid cert to sign with, so renewal fails. You’d have to re-bootstrap the host with a fresh cert via a different provisioner (the JWK service-accounts). This is a feature, not a bug — it surfaces hosts that have silently fallen offline. Classic SSH host keys are forever and silently trust stale hosts; SSHPOP reflects liveness.

Phase 3 usage (not yet in place):

  1. Bootstrap host cert via the JWK service-accounts provisioner, one-time, during host provisioning (requires the password briefly, then delete it).

  2. Configure sshd: HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub.

  3. systemd timer on each host, daily:

    step ssh renew --force /etc/ssh/ssh_host_ed25519_key-cert.pub
    
  4. Cert rotates forever, no credentials on the host after bootstrap.

  5. Clients that have ca_trust deployed (the @cert-authority line in ssh_known_hosts) automatically trust the renewed certs.

Service-account SSH patterns#

There are two deployment patterns for machine SSH identities on CCAT, and knowing which is which keeps the threat model clear.

Pattern A — long-lived cert with auto-renewal. A service bootstraps once, gets a 24h cert, and runs a systemd timer that calls step ssh renew every 6 hours. After the one-time bootstrap, the provisioner password is no longer stored on the host — the cert is the authentication for future renewals (step ssh renew uses the current cert’s private key to authenticate to step-ca). This is the right pattern for:

  • Jenkins running on input-b — long-running daemon, lots of small SSH operations, trusted host

  • ccat_transfer (bbcp) on every input node — same profile, high-volume transfers between internal machines

  • cron-based backup scripts and similar daemons

Pattern B — per-task short-lived cert. A service has no standing SSH identity. When it needs to SSH, it calls step ssh certificate with a 5–60 minute lifetime, uses the cert for the task, discards it. The provisioner password lives in a tightly-scoped secret readable only by the job runner. Each cert issuance is a logged event in step-ca. This is the right pattern for:

  • CI runners on untrusted execution environments (cloud runners, contractor machines, shared infrastructure)

  • Rarely-run one-off jobs where maintaining a renewal timer adds more ceremony than it saves

  • Compliance-sensitive operations that need an audit entry per execution

Both patterns use the same service-accounts provisioner — the difference is how the service uses it. CCAT’s current setup (Jenkins + ccat_transfer, all on trusted hardware in a locked hall) maps cleanly to Pattern A everywhere.

Wiring Pattern A — concrete (Phase 3)#

This is not yet in place — it’s the Phase 3 SSH-cert migration work. Sketched here so the target state is clear.

A new Ansible role ssh_service_cert (Phase 3) would:

  1. Install step-cli on the host from the upstream release binary.

  2. Bootstrap the host against the CCAT CA (idempotent):

    step ca bootstrap --force \
      --ca-url https://ca.ccat.uni-koeln.de \
      --fingerprint <from ca_trust/files/root_ca.crt>
    
  3. As the target user (jenkins, ccat_transfer, …), issue the initial cert using the provisioner password from the vault:

    step ssh certificate <user>-<host> \
      ~<user>/.ssh/id_ed25519 \
      --provisioner service-accounts \
      --password-file <tmp path, deleted after> \
      --principal <user> \
      --not-after 24h
    
  4. Install a systemd user timer that runs every 6h:

    step ssh renew --force ~<user>/.ssh/id_ed25519-cert.pub
    
  5. Remove the provisioner password from the host. From now on, the cert renews itself — the running cert is the authentication for its own renewal.

On the target side (hosts the service SSHes into), the ca_trust role already deploys the SSH user CA via /etc/ssh/trusted_user_ca_keys. The Phase 3 addition is a second piece: AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u in sshd_config, with files like /etc/ssh/auth_principals/jenkins containing jenkins on a line. This says “any cert with principal jenkins signed by a trusted user CA may log in as the jenkins user” — no authorized_keys entries needed, ever.

The whole thing is ~100 lines of Ansible and replaces the current static-SSH-key model on the managed hosts.

When to run provisioners-add.sh#

The script is idempotent by skip: each provisioner is checked against step ca provisioner list before being added, and existing entries are left alone. You run it:

  1. Once during Phase 1 commissioning, right after populating the Dex step-ca client secret in the vault (vault_dex_stepca_client_secret). See the Phase 1 checklist in step-ca/COMMISSIONING-TODO.md.

  2. Once during Phase 2 cutover, after the step-ca-data volume has been wiped and pre-populated with ceremony outputs. The new ca.json starts fresh with no provisioners — you re-run the script to restore the set.

  3. Any time you want to add a new provisioner. Edit the script to append a new add_provisioner block, commit, run. Existing ones are skipped; only the new one gets added.

How to run it#

# On input-b (or via ssh from a laptop if you prefer)
cd /opt/data-center/system-integration

# Recommended: use the `ccat ca provisioner sync` wrapper, which
# prompts for the client secret (hidden), reads it from your
# terminal, and runs the script with the right env for you.
ccat ca provisioner sync

# Or run the script directly:
DEX_STEPCA_CLIENT_SECRET="$(ccat secrets show vault_dex_stepca_client_secret --reveal 2>/dev/null | tail -1)" \
OIDC_ADMIN_EMAIL="you@uni-koeln.de" \
./step-ca/provisioners-add.sh

# Then apply the changes
ccat ca restart step-ca

The script:

  • Aborts cleanly if required env vars are missing (DEX_STEPCA_CLIENT_SECRET, OIDC_ADMIN_EMAIL).

  • Pre-flights the target container is running and the password file is readable inside it.

  • Adds each provisioner via docker exec ... step ca provisioner add, reusing /home/step/secrets/password (which contains STEP_CA_PASSWORD) for both JWK encryption and admin API auth.

  • Prints a summary of the final provisioner list.

Updating lifetimes on existing provisioners#

The script does not modify existing provisioners. If you want to change a lifetime — say, loosen prod-services from 90d to 180d — use step ca provisioner update directly inside the container:

docker exec -it ccat-ca-step-ca-1 step ca provisioner update prod-services \
  --x509-default-dur 4320h \
  --x509-max-dur 4320h

ccat ca restart step-ca

Valid lifetime flags (pass only the ones you want to change):

Flag

Applies to

Example

--x509-default-dur

x509 certs

2160h (90 days)

--x509-max-dur

x509 certs

4320h (180 days)

--ssh-user-default-dur

SSH user certs

16h

--ssh-user-max-dur

SSH user certs

24h

--ssh-host-default-dur

SSH host certs

168h (7 days)

--ssh-host-max-dur

SSH host certs

336h (14 days)

All durations are passed as Go time.Duration strings — h for hours, m for minutes. Don’t use d or w (not supported).

After any provisioner update, always restart step-ca so it re-reads ca.json:

ccat ca restart step-ca

You can also update the script’s default values and re-commit, so that a future DR re-install gets the new defaults. But the live provisioners won’t change until you also run step ca provisioner update.

Removing a provisioner#

docker exec -it ccat-ca-step-ca-1 step ca provisioner remove <name>
ccat ca restart step-ca

Be careful: removing a provisioner does NOT revoke the certs it previously issued. Those keep validating until their own expiry. If you need to actually invalidate issued certs, bump the intermediate or add them to the CRL.

Troubleshooting: x509: certificate signed by unknown authority#

Symptom: after step ca bootstrap succeeds, any subsequent command that talks to the CA (step ssh login, step ca provisioner list, etc.) fails with:

client GET https://ca.ccat.uni-koeln.de/... failed:
tls: failed to verify certificate: x509: certificate signed
by unknown authority

Root cause: you’re hitting ca.ccat.uni-koeln.de on port 443, where nginx-proxy terminates TLS with a Let’s Encrypt cert that doesn’t chain to the CCAT root stored in ~/.step/certs/root_ca.crt. The compose file currently exposes step-ca’s native TLS on port 9000 — clients should hit port 9000 explicitly.

Fix 1 — re-bootstrap with the explicit port (correct, requires firewall to allow port 9000 inbound):

step ca bootstrap --force \
  --ca-url https://ca.ccat.uni-koeln.de:9000 \
  --fingerprint <from ceremony / throwaway root>
step ssh login

Fix 2 — trust-bundle workaround (for off-network clients before the firewall opens 9000): append the system CA bundle to step’s trust file so step-cli trusts both LE and the CCAT root:

cat /etc/ssl/certs/ca-certificates.crt >> ~/.step/certs/root_ca.crt
step ssh login

This lets step-cli talk to the LE-fronted ca.ccat.uni-koeln.de:443 endpoint while still trusting the CCAT root for issued certs. It’s a hack and should be unnecessary once port 9000 is open, but it unblocks bootstrap during firewall coordination. Every laptop that uses this workaround needs to re-append after any step ca bootstrap --force, since bootstrap rewrites the root file.

Troubleshooting provisioner setup#

“Live API shows fewer provisioners than ca.json” (the split-brain we hit during Phase 1). Root cause: enableAdmin: true in ca.json puts step-ca into “remote management” mode, where the runtime uses an internal BoltDB for provisioners and reads ca.json only at first-ever boot when the DB is empty. Since the init path auto-creates admin + sshpop, the DB is never empty, so subsequent offline edits to ca.json (the mode step ca provisioner add uses when it has filesystem access) are invisible to the running CA.

We intentionally do not enable remote management on CCAT’s step-ca. The docker-compose.ca.yml omits DOCKER_STEPCA_INIT_REMOTE_MANAGEMENT so that ca.json stays the single source of truth for provisioners and offline-mode step ca provisioner add calls take effect on restart.

If you somehow end up with enableAdmin: true in an existing ca.json (e.g., legacy volume from before we fixed the compose file), flip it back:

ccat ca down
docker run --rm -v ccat-ca_step-ca-data:/home/step busybox sh -c '
  sed -i "s/\"enableAdmin\": true/\"enableAdmin\": false/" /home/step/config/ca.json
  grep enableAdmin /home/step/config/ca.json
'
ccat ca up

Then step ca provisioner list should show everything that was in ca.json.

“error getting admin:” or HTTP 401 from the admin API — the remote management layer is enabled (enableAdmin: true in ca.json) and your step ca provisioner add call is not authenticating as an admin. See the split-brain troubleshoot above — disabling remote management is the right fix. If for some reason you need to keep remote management on, the script’s --password-file /home/step/secrets/password pattern should work because the auto-init admin provisioner is created with STEP_CA_PASSWORD. If that still fails:

docker exec ccat-ca-step-ca-1 step ca admin list

This lists the current admins and their provisioner. If the auto-init admin is not present (unusual), you can fall back to editing ca.json directly:

# 1. Stop step-ca
ccat ca down

# 2. Copy ca.json out of the volume
docker run --rm -v ccat-ca_step-ca-data:/src -v "$PWD":/dst alpine \
  cp /src/config/ca.json /dst/ca.json.backup

# 3. Edit ca.json.backup by hand: set "authority": { "enableAdmin": false }
# 4. Write it back:
docker run --rm -v ccat-ca_step-ca-data:/dst -v "$PWD":/src alpine \
  cp /src/ca.json.backup /dst/config/ca.json

# 5. Start step-ca, re-run the provisioner script (which now edits
#    ca.json directly without admin auth), then re-enable admin:
ccat ca up
./step-ca/provisioners-add.sh
# re-edit ca.json to flip enableAdmin back to true
ccat ca restart step-ca

This fallback is ugly but deterministic. Report back if you hit it so we can improve the script.

“OIDC configuration endpoint not reachable” — step-ca tries to fetch https://auth.ccat.uni-koeln.de/.well-known/openid-configuration on add. If Dex is down, or if the TLS cert is not trusted by the step-ca container’s OS trust store, this fails. Check:

# From inside the step-ca container
docker exec ccat-ca-step-ca-1 wget -qO- \
  https://auth.ccat.uni-koeln.de/.well-known/openid-configuration

Should return JSON with an issuer field equal to https://auth.ccat.uni-koeln.de. If it returns a TLS error, the step-ca image’s trust store doesn’t have Let’s Encrypt — unusual but possible. If it returns 404, Dex isn’t actually running behind the nginx-proxy vhost: check ccat ca status and ccat ca logs dex.

“provisioner already exists” — the script should handle this, but if you’re running step ca provisioner add manually without the existence check, you hit this. Use step ca provisioner update instead, or remove then add.

Rotation procedures#

Intermediate rotation (planned, every ~5 years)#

Schedule during a low-activity window. Procedure mirrors a shortened ceremony:

  1. Retrieve HSM #1 from the safe.

  2. Unplug HSM #2 from input-b (or use a fresh dongle to avoid downtime during the ceremony).

  3. Air-gapped laptop, both dongles plugged in.

  4. Generate a new intermediate key on the target dongle.

  5. Sign a new intermediate cert with HSM #1.

  6. Return HSM #1 to the safe.

  7. Install the new intermediate cert on input-b, update ca.json to reference the new dongle (if you swapped), restart step-ca.

  8. Previously-issued certs continue to validate (they chain to the unchanged root). New certs are signed by the new intermediate.

Downtime: 10–30 minutes depending on HSM swap logistics.

Intermediate rotation (emergency, after suspected compromise)#

Same procedure, but revoke the old intermediate first by removing its cert from step-ca’s config and forcing clients to refresh their trust chain. Any cert the compromised intermediate issued remains a concern until expiry (30–90 days) — monitor for anomalies.

Root rotation#

This is the catastrophic case. It involves re-bootstrapping every client against a new root. Procedure:

  1. Generate a new root ceremony-style with a spare HSM.

  2. Distribute the new root_ca.crt to every managed host via the ca_trust role (commit the new cert, run the playbook).

  3. Every developer runs step ca bootstrap --force with the new fingerprint.

  4. Every service that was configured with a hard-coded root (Settings.REDIS_CA_CERT_PATH etc.) needs its config rotated.

  5. Dispose of the old root HSM if compromised (or retain it if the rotation was planned ahead of lifetime expiry).

For CCAT-scale this is roughly a half-day of coordinated work. Not nothing, but recoverable. The entire offline-root architecture exists to make this the rare case rather than the routine case.

Disaster recovery#

“step-ca container won’t start”#

Usual causes: HSM not visible, PIN mismatch, ca.json syntax error. Diagnose in order:

ccat ca logs step-ca                              # read the error
docker exec -it <step-ca> pkcs11-tool --list-slots  # HSM visible in container?
docker exec -it <step-ca> cat /run/secrets/hsm-pin  # PIN file mounted?
docker exec -it <step-ca> jq . /home/step/config/ca.json  # JSON valid?

If the HSM isn’t visible inside the container but is visible on the host (pkcs11-tool --list-slots from the host SSH session), the devices: mount in docker-compose.ca.yml is wrong — udev may have renumbered the USB bus after a reboot. Update the device path and restart.

“input-b is down, CA is unreachable”#

No action required for existing clients. SSH user certs are valid for 16 hours, TLS certs for 30–90 days. Existing sessions keep working. New certs cannot be issued until input-b is back.

Recovery: bring input-b back online. If the server itself is lost:

  1. Provision a replacement R640 or equivalent.

  2. Restore the three Docker volumes from backup.

  3. Move HSM #2 from the old chassis to the new one’s internal USB.

  4. Re-point DNS if the IP changed.

  5. ccat ca up. Clients do not notice — the CA URL and root fingerprint are unchanged.

“HSM #2 has failed”#

HSM #2 contains the intermediate key. If it dies:

  1. Retrieve HSM #1 from the safe.

  2. Buy a new HSM 2 dongle (same model).

  3. Do a rotation ceremony (see above) to produce a new intermediate on the fresh dongle.

  4. Install the new HSM in input-b, update ca.json, restart step-ca.

  5. Existing certs chain to the same root and remain valid.

Downtime: ~1 hour ceremony + recovery time.

“HSM #1 has failed”#

HSM #1 contains the root key. If it dies:

  1. Procure a new HSM 2.

  2. Do a full commissioning ceremony to generate a new root.

  3. Produce a new intermediate signed by the new root.

  4. Distribute the new root_ca.crt to every managed host via ca_trust.

  5. Every developer runs step ca bootstrap --force with the new fingerprint.

  6. Every internal service config that hard-codes the root path is rotated.

This is a one-off event we plan not to experience. It takes roughly half a day of coordinated work for the team.

Ansible roles for trust distribution#

Two Ansible roles support the CA:

ca_trust — distribute public trust material#

Applied to: all managed hosts (input_ccat, input_staging, ccat, eventually travel_hosts).

Responsibilities:

  • Copy root_ca.crt into the RHEL (/etc/pki/ca-trust/source/anchors/) or Debian (/usr/local/share/ca-certificates/) trust store and run update-ca-trust / update-ca-certificates.

  • Copy ssh_user_ca.pub to /etc/ssh/trusted_user_ca_keys and set TrustedUserCAKeys in sshd_config.

  • Register ssh_host_ca.pub in /etc/ssh/ssh_known_hosts with a @cert-authority *.data.ccat.uni-koeln.de line.

Source files live in ansible/roles/ca_trust/files/. The role is a safe no-op until those files exist — it guards every task on file presence on the controller and warns if nothing is found. This means the role can be merged and wired into playbook_setup_vms.yml before the ceremony without affecting any running host.

Run it standalone:

cd ansible
ansible-playbook playbook_setup_vms.yml --tags ca_trust

Sub-tags ca_trust_x509 and ca_trust_ssh limit to just the x509 trust store or just the SSH trust pieces.

hsm_host — prepare input-b for the HSM#

Applied to: input-b only (inside the - hosts: input-b play in playbook_setup_vms.yml).

Responsibilities:

  • Install opensc and opensc-tools packages on the host.

  • Deploy 99-nitrokey-hsm.rules udev rule granting the plugdev group access to the Nitrokey HSM 2 device node.

  • Run pkcs11-tool --list-slots and either warn (default) or fail (when _hsm_enforce_verify: true in host_vars) if the HSM is not detected.

Run standalone:

cd ansible
ansible-playbook playbook_setup_vms.yml --tags hsm_host -l input-b

Sub-tags: hsm_host_pkg, hsm_host_udev, hsm_host_verify.

To make verification strict once commissioning is done (so that an accidentally-unplugged HSM fails the playbook loudly), create ansible/host_vars/input-b/hsm.yml with:

_hsm_enforce_verify: true

Cross-references#

Appendix: Why not Let’s Encrypt for everything?#

A fair question: if Let’s Encrypt already works for our public endpoints, why run our own CA for internal stuff?

  • Let’s Encrypt only works for publicly-resolvable DNS names and reachable HTTP(S) endpoints. Our internal Redis, Postgres, SSH host certs, and service mTLS all run on hostnames like redis.data.ccat.uni-koeln.de that are reachable only from inside our network — Let’s Encrypt cannot validate them.

  • Let’s Encrypt does not issue SSH certs. SSH certs are a completely different format from X.509 TLS certs. step-ca handles both; LE only does TLS.

  • Let’s Encrypt rate limits (50 certs per week per registered domain) would be hit fast if every internal service renewed against the public CA. Our own CA has no such limit.

  • Short-lived internal TLS certs (30 days, renewed weekly) with LE would mean constantly hammering a third-party. With our own CA the operation is free and internal.

LE is the right tool for the outer boundary (the CA’s own public face). step-ca is the right tool for everything behind it.