# Patch Management, Container Security & Supply Chains ```{contents} On this page :depth: 2 :local: true ``` This document explains why and how we keep our infrastructure patched and our container images secure. It covers the threat landscape, the mechanics of OS and container updates, and the defenses against supply chain attacks — using real-world incidents (including one that hit a sister observatory) as concrete examples. ## Why This Matters — The ALMA Incident On 29 October 2022, the Atacama Large Millimeter/Submillimeter Array (ALMA) observatory in Chile was hit by a ransomware attack. Attackers gained access through a compromised VPN credential. The result: - **48 days of lost observations.** No science operations from late October through mid-December 2022. - **~\$250,000/day in operational cost** while the array sat idle. - All email, public web presence, and internal IT services taken offline. - A multinational team (ESO, NAOJ, NRAO) had to rebuild the network from scratch, including replacing the compromised VPN infrastructure. The Hive ransomware group was suspected. The attack did not compromise ALMA's antennas or archived science data, but the operational and reputational damage was severe. **ALMA and CCAT share key characteristics**: remote observatory infrastructure, university/consortium governance, small IT teams relative to the complexity of the systems, and scientific data pipelines that cannot tolerate extended downtime. The lesson is not "be afraid" — it is "patch promptly, limit blast radius, and design for recovery." This document explains how. ## The Two Layers: OS Packages and Container Images Our infrastructure has two distinct software layers that need independent update strategies: ```text ┌──────────────────────────────────────────────────┐ │ Container Layer (Docker images) │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ ops-db │ │ ops-db- │ │ data-transfer │ │ │ │ -api │ │ ui │ │ │ │ │ └────┬─────┘ └────┬─────┘ └───────┬──────────┘ │ │ │ python:3.12-slim node:22-slim │ │ │ (base image) (base image) │ │ ─────┴────────────────────────────┴──────────────│ │ │ │ Host OS Layer (RHEL 9 / Ubuntu) │ │ kernel, glibc, openssl, systemd, docker-ce ... │ └──────────────────────────────────────────────────┘ ``` A vulnerability can exist at either layer: - **OS layer**: A kernel privilege escalation, an OpenSSL buffer overflow, a systemd bug. Exploitable by anything running on the host, including containers (which share the host kernel). - **Container layer**: A CVE in the Python interpreter, a vulnerable version of `libexpat` baked into the base image, a compromised pip/npm dependency. Only affects that specific container, but can lead to remote code execution within it. Both layers must be patched. Patching only the OS leaves container images running known-vulnerable libraries. Patching only images leaves the shared kernel exposed. ## OS-Level Patching ### Concepts **Security-only vs full updates** Package managers can apply *all* available updates or only those classified as security fixes: ```{eval-rst} .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Mode - What it does - Trade-off * - **Security-only** - Applies patches tagged with security advisories (CVE fixes). ``dnf upgrade --security`` (RHEL), ``unattended-upgrades`` with security origins (Ubuntu). - Minimal risk of breaking changes. Leaves feature/bugfix updates for planned windows. * - **Full update** - Applies all available package updates including feature changes, new minor versions, and bug fixes. - More comprehensive but higher risk of unexpected behavior changes. A PostgreSQL minor version bump might change query behavior; a new systemd version might change service defaults. ``` **Recommendation**: Security-only for automated/frequent updates. Full updates in planned maintenance windows (monthly or quarterly). ### RHEL 9: `dnf-automatic` Red Hat Enterprise Linux provides `dnf-automatic`, a systemd timer that periodically checks for and optionally applies updates: ```ini # /etc/dnf/automatic.conf [commands] upgrade_type = security # only security-classified updates apply_updates = yes # actually install them (vs just download) random_sleep = 3600 # jitter to avoid all hosts hitting mirrors at once [emitters] emit_via = email,stdio # send notification email after each run ``` The timer runs daily. With `upgrade_type = security`, it only applies patches that have an associated Red Hat Security Advisory (RHSA). :::{important} `dnf --security` requires an active RHEL subscription. The security classification metadata comes from Red Hat's advisory system. Without a subscription, `dnf updateinfo list --security` returns nothing and security-only mode silently does nothing. Verify your subscription is working: ```bash subscription-manager status dnf updateinfo list --security ``` ::: ### Ubuntu: `unattended-upgrades` Ubuntu's equivalent is the `unattended-upgrades` package, configured via `/etc/apt/apt.conf.d/50unattended-upgrades`: ```text Unattended-Upgrade::Allowed-Origins { "${distro_id}:${distro_codename}-security"; // "${distro_id}:${distro_codename}-updates"; ← commented out = security only }; Unattended-Upgrade::Automatic-Reboot "false"; // don't auto-reboot Unattended-Upgrade::Mail "ops@example.com"; ``` ### Reboot Management Some updates (especially kernel updates) require a reboot to take effect. The system is running the old kernel until you restart. **Detection**: - RHEL: `needs-restarting -r` exits with code 1 if a reboot is needed - Ubuntu: check for `/var/run/reboot-required` **Options**: ```{eval-rst} .. list-table:: :header-rows: 1 :widths: 20 35 45 * - Approach - How it works - When to use * - **Scheduled reboot** - Reboot at a fixed window (e.g., Sunday 03:00) if needed - Staging environments, services that tolerate brief downtime * - **Kernel livepatch** - Patches the running kernel in memory without rebooting. ``kpatch`` (RHEL) or Canonical Livepatch (Ubuntu). - 24/7 services where even brief downtime is unacceptable. Adds operational complexity. * - **Serial rolling reboot** - Ansible reboots one host at a time (``serial: 1``), verifying health before proceeding to the next - Production fleets where you can tolerate one host being down temporarily ``` For our fleet, **serial rolling reboots** are the right balance. Our services can tolerate one host being down briefly, and the simplicity of "reboot and verify" beats the operational burden of livepatch management. ### Docker Engine — The Special Case A Docker engine update restarts the daemon, which restarts **all containers** on the host. This is qualitatively different from a library update. :::{warning} **Always exclude Docker from automatic updates.** ```ini # /etc/dnf/dnf.conf excludepkgs=docker-ce*,containerd* ``` Docker engine updates should be deliberate, scheduled, and tested on staging first. Quarterly cadence is appropriate unless a critical CVE drops. ::: Pin the Docker version in Ansible and bump it explicitly: ```yaml # roles/docker/defaults/main.yml docker_version: "27.5.1" ``` This ensures you know exactly what version is running and when it changes. ## Container Image Security ### The Base Image Problem When you write a Dockerfile starting with `FROM python:3.12-slim`, you inherit everything in that base image: the Debian libraries, the Python interpreter, system utilities. If `libexpat` in that Debian layer has a CVE, your image is vulnerable — even if your application code is perfect. The challenge: base image maintainers publish updated images regularly, but **the tag stays the same**. `python:3.12-slim` today has different contents than `python:3.12-slim` three months ago. If you don't rebuild, you're frozen on the old (vulnerable) version. ### Tag Pinning vs Digest Pinning ```{eval-rst} .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Strategy - Example - Properties * - **Tag only** - ``FROM python:3.12-slim`` - Mutable — the registry can change what this points to at any time. You get whatever was latest when you last built. Builds are not reproducible. * - **Tag + digest** - ``FROM python:3.12-slim@sha256:abc123...`` - Immutable — the digest is a content hash. Even if the tag is updated, your build uses exactly this image. Builds are reproducible. You must explicitly bump the digest to get updates. ``` **Digest pinning is the secure choice.** It makes your builds reproducible and prevents surprise changes. The trade-off is that you need a mechanism to detect when the upstream image has been updated and bump the digest. ### Image Scanning Image scanners inspect a built container image and compare its packages against vulnerability databases (NVD, Red Hat OVAL, Debian Security Tracker, GitHub Advisory Database, etc.). **How it works**: ```text ┌─────────────┐ ┌────────────────┐ ┌─────────────────┐ │ Built image │ ───> │ Scanner reads │ ───> │ Compare against │ │ (OCI layers) │ │ package lists │ │ vuln databases │ └─────────────┘ │ (dpkg, rpm, │ │ (NVD, GHSA, etc) │ │ pip, npm, etc) │ └────────┬────────┘ └────────────────┘ │ ┌───────▼───────┐ │ Report: CVE-X │ │ in libfoo 1.2 │ │ severity: HIGH │ └───────────────┘ ``` **Common scanners**: - **Trivy** (Aqua Security) — widely used, broad vulnerability database - **Grype** (Anchore) — fast, similar CLI model, different supply chain - **Docker Scout** — built into Docker CLI (`docker scout cves`) - **Clair** (Red Hat/Quay) — runs as a service, mature :::{important} Scanning only helps if it **blocks deployment**. A scanner that produces reports nobody reads is security theater. Configure your CI to fail the build on CRITICAL and HIGH findings: ```bash grype ghcr.io/ccatobs/ops-db-api:latest --fail-on high ``` Use `--ignore-unfixed` or equivalent to suppress CVEs that have no available fix yet — you cannot act on these, and they create alert fatigue. ::: ### SBOM — Software Bill of Materials An SBOM is a machine-readable inventory of every component in your image: packages, libraries, versions. Think of it as a "nutrition label" for software. ```bash # Generate an SBOM in CycloneDX format grype ghcr.io/ccatobs/ops-db-api:latest -o cyclonedx-json > sbom.json ``` **Why generate SBOMs?** When a new CVE drops (e.g., "all versions of libexpat < 2.6.0 are vulnerable"), you can query your SBOMs to answer "which of our images contain libexpat, and which version?" in seconds — instead of rebuilding and scanning everything. Store SBOMs as CI artifacts alongside each image build. They cost almost nothing to generate and are invaluable during incident response. ## Supply Chain Attacks A supply chain attack compromises software **before it reaches you** — in the build system, the package registry, or the distribution channel. You download and run malicious code believing it to be legitimate. ### Anatomy of a Supply Chain Attack ```text Legitimate flow: Developer → Source Code → CI Build → Registry → Your Server (you trust this) Supply chain attack (multiple vectors): Developer → [Compromised credentials] → Malicious CI Build → Registry → Your Server (you trust this but it's poisoned) Developer → Source Code → CI Build → [Compromised registry tag] → Your Server (tag now points to malicious image) Developer → Source Code → [Compromised CI Action] → Secrets exfiltrated (third-party action replaced with malware) ``` ### Case Study: The Trivy/CanisterWorm Attack (March 2026) On 19 March 2026, Trivy — one of the most widely used open-source vulnerability scanners — was itself compromised. The irony is sharp: the tool teams relied on to *detect* vulnerabilities became the *vector* for attack. **What happened**: A threat actor group called "TeamPCP" exploited incompletely revoked credentials from a prior security incident at Aqua Security (Trivy's maintainer). They launched a multi-vector attack: 1. **GitHub Actions poisoning**: Force-pushed 76 of 77 version tags in the `aquasecurity/trivy-action` repository, redirecting trusted version references (like `@v0.35`) to malicious commits. 2. **Malicious binary release**: Triggered Trivy's release automation to publish an infected binary (v0.69.4) to GitHub Releases, Docker Hub, and container registries. 3. **CanisterWorm propagation**: The payload included a self-propagating npm worm that spread to 47+ packages. It used an Internet Computer blockchain canister as a C2 dead-drop — the first documented abuse of this technique. 4. **Credential harvesting**: The malware exfiltrated CI/CD secrets, cloud credentials, SSH keys, and Docker configurations — all while Trivy scans appeared to complete successfully. **Impact**: CVE-2026-33634 (CVSS 9.4). Exposure window: 19–23 March 2026. The attacker also defaced all 44 Aqua Security internal repositories. **Root cause**: Credentials from a *prior* incident were not fully revoked. The attacker retained residual access to release infrastructure. **Key lessons**: ```{eval-rst} .. list-table:: :header-rows: 1 :widths: 30 70 * - Lesson - Implication * - **Mutable tags are dangerous** - ``@v0.35`` was force-pushed to point to malicious code. The tag looked the same; the content was completely different. * - **Pin by commit SHA** - ``@d43cc1dfea034b8e4e523b399d14fd25f7535bc5`` cannot be force-pushed. It is a content-addressed reference. * - **Don't be first-wave adopters** - A 72-hour delay before adopting new releases would have avoided the entire exposure window. The community detected the compromise within ~4 days. * - **Multi-vendor scanning** - If you run two scanners from different organizations, an attacker must compromise both simultaneously. * - **The scanner is part of the supply chain** - Security tools are software too. They need the same scrutiny as any other dependency. ``` ### Other Notable Supply Chain Incidents - **SolarWinds (2020)**: Attackers compromised the build system of SolarWinds Orion, inserting a backdoor into signed updates distributed to ~18,000 organizations including US government agencies. Detected after ~9 months. - **Codecov (2021)**: Attackers modified the Codecov bash uploader script to exfiltrate environment variables (including CI secrets) from customer CI pipelines. The compromised script was served for ~2 months. - **ua-parser-js (2021)**: A popular npm package (7M weekly downloads) was hijacked via compromised maintainer credentials. Malicious versions mined cryptocurrency and stole passwords. - **xz-utils (2024)**: A long-running social engineering campaign where an attacker gained maintainer trust over 2+ years, then inserted a backdoor into the xz compression library — targeting SSH authentication on Linux systems. Caught by accident days before widespread distribution. The pattern is consistent: compromise the build/release pipeline, poison the artifact, rely on trust in the distribution channel. ## Defense Patterns No single defense stops supply chain attacks. The strategy is **defense in depth** — multiple independent layers so that any single failure is contained. ### Pin Everything by Content Hash Mutable references (tags, version ranges, branch names) can be redirected. Content-addressed references (SHA digests, commit SHAs) cannot. ```dockerfile # BAD — mutable tag, can be replaced silently FROM python:3.12-slim # GOOD — digest-pinned, immutable FROM python:3.12-slim@sha256:abc123def456... ``` ```yaml # BAD — tag can be force-pushed (exactly what happened with Trivy) - uses: anchore/scan-action@v4 # GOOD — commit SHA, immutable - uses: anchore/scan-action@d43cc1dfea034b8e4e523b399d14fd25f7535bc5 ``` This applies to: Docker base images, GitHub Actions, Ansible Galaxy roles, pip/npm dependencies with lock files. Anywhere a reference can be replaced without changing the name. ### Delay Adoption — Never Be First-Wave Most supply chain compromises are detected within hours to days by the community. A deliberate delay before adopting new releases means you are never in the blast radius of initial exposure: ```{eval-rst} .. list-table:: :header-rows: 1 :widths: 25 25 50 * - Update type - Recommended delay - Rationale * - Base image digest bump - 72 hours - Security rebuild of existing version; low risk but worth the wait * - Minor/major version bump - 1 week + human review - Behavioral changes possible; needs testing * - CI tool updates - 1 week minimum - CI tools run with elevated privileges (secrets access); highest supply chain risk * - OS security patches - 7 days (staging → production) - Staging acts as canary for the full bake period ``` ### Multi-Vendor Scanning Running two scanners from different organizations means an attacker must compromise both supply chains simultaneously: ```bash # Scanner 1: Grype (Anchore) grype $IMAGE --fail-on high # Scanner 2: Docker Scout (Docker Inc.) docker scout cves $IMAGE --only-severity critical,high --exit-code ``` Different vulnerability databases, different organizations, different build pipelines. If they both say clean, confidence is high. If they disagree, investigate. ### Staging-First Deployment Every change flows through staging before production. This applies to OS patches, container images, and infrastructure changes: ```text ┌──────────┐ 7-day bake ┌──────────────┐ │ Staging │ ─────────────────> │ Production │ │ (auto) │ │ (manual gate) │ └──────────┘ └──────────────┘ If staging breaks during the bake period, production is unaffected and you fix forward. ``` This pattern catches not just compromised software, but also legitimate updates that happen to break your specific configuration. ### Run the Scanner Locally, Not as a Third-Party Action The Trivy attack worked because teams delegated scanning to a third-party GitHub Action. That Action ran with access to repository secrets. When the Action was compromised, so were the secrets. A safer pattern: download the scanner binary yourself, verify its checksum, and run it directly: ```yaml # Download specific version, verify integrity, then scan - name: Install Grype run: | curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh \ | sh -s -- -b /usr/local/bin v0.87.0 echo "EXPECTED_SHA /usr/local/bin/grype" | sha256sum -c - - name: Scan image run: grype $IMAGE --fail-on high ``` This removes the GitHub Actions supply chain from the scanner's trust path. The only trust relationship is with the binary you verified by checksum. ## Audit and Compliance ### Standards and Timelines Two frameworks are most relevant to our infrastructure: **NIST SP 800-40 Rev 4** (Guide to Enterprise Patch Management): - Critical vulnerabilities: remediate within **14 days** - High vulnerabilities: remediate within **30 days** - Maintain an inventory of all software assets (SBOMs help here) - Automate patch deployment where possible **CIS Benchmarks** (Center for Internet Security): - CIS RHEL 9 Benchmark §1.9: "Ensure updates, patches, and additional security software are installed" - CIS Docker Benchmark §1.1.2: "Ensure Docker is up to date" - CIS recommends automated patching with verification Our update cadence (nightly staging → weekly production for security patches) satisfies both NIST and CIS timelines for critical and high vulnerabilities. ### What an Auditor Looks For A third-party security audit of patch management typically examines: 1. **Policy**: Is there a documented patch management policy with defined timelines? 2. **Automation**: Are patches applied automatically or on a documented schedule? 3. **Testing**: Is there a staging/test environment where patches are validated before production? 4. **Coverage**: Are both OS and application layers covered? 5. **Monitoring**: Is there alerting for failed patches or missing updates? 6. **Audit trail**: Can you show when each patch was applied and by whom? 7. **Rollback**: Is there a documented rollback procedure? 8. **Exceptions**: Is there a process for documenting and tracking exceptions (patches that cannot be applied immediately)? Jenkins job logs, Ansible output, and Grafana annotations provide audit trail. Documented procedures (the workflow document) provide policy evidence. Staging environment provides testing evidence. ## Glossary :::{glossary} :sorted: true CVE (Common Vulnerabilities and Exposures) : A unique identifier for a publicly known security vulnerability. Format: `CVE-YYYY-NNNNN`. Published in the National Vulnerability Database (NVD). CVSS (Common Vulnerability Scoring System) : A numerical score (0.0–10.0) rating the severity of a vulnerability. Critical: 9.0–10.0, High: 7.0–8.9, Medium: 4.0–6.9, Low: 0.1–3.9. RHSA (Red Hat Security Advisory) : A Red Hat advisory documenting a security fix and the affected packages. `dnf --security` uses these to classify which updates are security- relevant. SBOM (Software Bill of Materials) : A machine-readable inventory of all components in a software artifact. Formats include CycloneDX and SPDX. Enables rapid impact assessment when new CVEs are published. Supply Chain Attack : An attack that compromises software during its build, packaging, or distribution — before it reaches the end user. The end user trusts the artifact because it comes from a legitimate source. Digest (Image Digest) : A SHA-256 hash of a container image's contents. Immutable and content- addressed: if a single byte changes, the digest changes. Format: `sha256:abc123...`. Mutable Tag : A container image tag (like `v1.0` or `latest`) that can be reassigned to point to a different image digest at any time. Tags are convenient but provide no integrity guarantee. `dnf-automatic` : A RHEL/Fedora service that periodically checks for and optionally applies package updates. Configured via `/etc/dnf/automatic.conf`. Controlled by a systemd timer. `unattended-upgrades` : The Ubuntu equivalent of `dnf-automatic`. Applies security updates automatically. Configured via files in `/etc/apt/apt.conf.d/`. `needs-restarting` : A RHEL utility (from `yum-utils`) that checks whether a reboot is required after package updates. Exit code 1 means a reboot is needed. Image Scanning : The process of inspecting a container image's installed packages against vulnerability databases to identify known CVEs. Content-Addressed Reference : A reference to a software artifact based on its cryptographic hash, making it immutable. Examples: Docker image digests, git commit SHAs. Contrast with mutable references like tags or branch names. ::: ## Further Reading - {doc}`tls-and-pki` — TLS certificates and PKI concepts in the CCAT Data Center - {doc}`../secrets-management` — Operational guide for managing secrets - [NIST SP 800-40 Rev 4](https://csrc.nist.gov/publications/detail/sp/800-40/rev-4/final) — Guide to Enterprise Patch Management Planning - [CIS Benchmarks](https://www.cisecurity.org/cis-benchmarks) — Security configuration guides for RHEL, Docker, and Kubernetes - [ALMA Cyberattack Recovery](https://almascience.eso.org/news/alma-update-on-the-recovery-from-october-29-cyberattack) — ALMA's official account of the 2022 ransomware incident - [Trivy Supply Chain Incident (CVE-2026-33634)](https://github.com/aquasecurity/trivy/discussions/10425) — Aqua Security's incident discussion and timeline - [Grype](https://github.com/anchore/grype) — Anchore's open-source vulnerability scanner - [Docker Scout](https://docs.docker.com/scout/) — Docker's built-in image analysis tool