Patch Management, Container Security & Supply Chains#

This document explains why and how we keep our infrastructure patched and our container images secure. It covers the threat landscape, the mechanics of OS and container updates, and the defenses against supply chain attacks — using real-world incidents (including one that hit a sister observatory) as concrete examples.

Why This Matters — The ALMA Incident#

On 29 October 2022, the Atacama Large Millimeter/Submillimeter Array (ALMA) observatory in Chile was hit by a ransomware attack. Attackers gained access through a compromised VPN credential. The result:

  • 48 days of lost observations. No science operations from late October through mid-December 2022.

  • ~$250,000/day in operational cost while the array sat idle.

  • All email, public web presence, and internal IT services taken offline.

  • A multinational team (ESO, NAOJ, NRAO) had to rebuild the network from scratch, including replacing the compromised VPN infrastructure.

The Hive ransomware group was suspected. The attack did not compromise ALMA’s antennas or archived science data, but the operational and reputational damage was severe.

ALMA and CCAT share key characteristics: remote observatory infrastructure, university/consortium governance, small IT teams relative to the complexity of the systems, and scientific data pipelines that cannot tolerate extended downtime.

The lesson is not “be afraid” — it is “patch promptly, limit blast radius, and design for recovery.” This document explains how.

The Two Layers: OS Packages and Container Images#

Our infrastructure has two distinct software layers that need independent update strategies:

┌──────────────────────────────────────────────────┐
│  Container Layer (Docker images)                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  │
│  │ ops-db   │ │ ops-db-  │ │ data-transfer    │  │
│  │ -api     │ │ ui       │ │                  │  │
│  └────┬─────┘ └────┬─────┘ └───────┬──────────┘  │
│       │ python:3.12-slim    node:22-slim          │
│       │ (base image)        (base image)          │
│  ─────┴────────────────────────────┴──────────────│
│                                                    │
│  Host OS Layer (RHEL 9 / Ubuntu)                  │
│  kernel, glibc, openssl, systemd, docker-ce ...   │
└──────────────────────────────────────────────────┘

A vulnerability can exist at either layer:

  • OS layer: A kernel privilege escalation, an OpenSSL buffer overflow, a systemd bug. Exploitable by anything running on the host, including containers (which share the host kernel).

  • Container layer: A CVE in the Python interpreter, a vulnerable version of libexpat baked into the base image, a compromised pip/npm dependency. Only affects that specific container, but can lead to remote code execution within it.

Both layers must be patched. Patching only the OS leaves container images running known-vulnerable libraries. Patching only images leaves the shared kernel exposed.

OS-Level Patching#

Concepts#

Security-only vs full updates

Package managers can apply all available updates or only those classified as security fixes:

Mode

What it does

Trade-off

Security-only

Applies patches tagged with security advisories (CVE fixes). dnf upgrade --security (RHEL), unattended-upgrades with security origins (Ubuntu).

Minimal risk of breaking changes. Leaves feature/bugfix updates for planned windows.

Full update

Applies all available package updates including feature changes, new minor versions, and bug fixes.

More comprehensive but higher risk of unexpected behavior changes. A PostgreSQL minor version bump might change query behavior; a new systemd version might change service defaults.

Recommendation: Security-only for automated/frequent updates. Full updates in planned maintenance windows (monthly or quarterly).

RHEL 9: dnf-automatic#

Red Hat Enterprise Linux provides dnf-automatic, a systemd timer that periodically checks for and optionally applies updates:

# /etc/dnf/automatic.conf
[commands]
upgrade_type = security     # only security-classified updates
apply_updates = yes         # actually install them (vs just download)
random_sleep = 3600         # jitter to avoid all hosts hitting mirrors at once

[emitters]
emit_via = email,stdio      # send notification email after each run

The timer runs daily. With upgrade_type = security, it only applies patches that have an associated Red Hat Security Advisory (RHSA).

Important

dnf --security requires an active RHEL subscription. The security classification metadata comes from Red Hat’s advisory system. Without a subscription, dnf updateinfo list --security returns nothing and security-only mode silently does nothing.

Verify your subscription is working:

subscription-manager status
dnf updateinfo list --security

Ubuntu: unattended-upgrades#

Ubuntu’s equivalent is the unattended-upgrades package, configured via /etc/apt/apt.conf.d/50unattended-upgrades:

Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
    // "${distro_id}:${distro_codename}-updates";   ← commented out = security only
};

Unattended-Upgrade::Automatic-Reboot "false";       // don't auto-reboot
Unattended-Upgrade::Mail "ops@example.com";

Reboot Management#

Some updates (especially kernel updates) require a reboot to take effect. The system is running the old kernel until you restart.

Detection:

  • RHEL: needs-restarting -r exits with code 1 if a reboot is needed

  • Ubuntu: check for /var/run/reboot-required

Options:

Approach

How it works

When to use

Scheduled reboot

Reboot at a fixed window (e.g., Sunday 03:00) if needed

Staging environments, services that tolerate brief downtime

Kernel livepatch

Patches the running kernel in memory without rebooting. kpatch (RHEL) or Canonical Livepatch (Ubuntu).

24/7 services where even brief downtime is unacceptable. Adds operational complexity.

Serial rolling reboot

Ansible reboots one host at a time (serial: 1), verifying health before proceeding to the next

Production fleets where you can tolerate one host being down temporarily

For our fleet, serial rolling reboots are the right balance. Our services can tolerate one host being down briefly, and the simplicity of “reboot and verify” beats the operational burden of livepatch management.

Docker Engine — The Special Case#

A Docker engine update restarts the daemon, which restarts all containers on the host. This is qualitatively different from a library update.

Warning

Always exclude Docker from automatic updates.

# /etc/dnf/dnf.conf
excludepkgs=docker-ce*,containerd*

Docker engine updates should be deliberate, scheduled, and tested on staging first. Quarterly cadence is appropriate unless a critical CVE drops.

Pin the Docker version in Ansible and bump it explicitly:

# roles/docker/defaults/main.yml
docker_version: "27.5.1"

This ensures you know exactly what version is running and when it changes.

Container Image Security#

The Base Image Problem#

When you write a Dockerfile starting with FROM python:3.12-slim, you inherit everything in that base image: the Debian libraries, the Python interpreter, system utilities. If libexpat in that Debian layer has a CVE, your image is vulnerable — even if your application code is perfect.

The challenge: base image maintainers publish updated images regularly, but the tag stays the same. python:3.12-slim today has different contents than python:3.12-slim three months ago. If you don’t rebuild, you’re frozen on the old (vulnerable) version.

Tag Pinning vs Digest Pinning#

Strategy

Example

Properties

Tag only

FROM python:3.12-slim

Mutable — the registry can change what this points to at any time. You get whatever was latest when you last built. Builds are not reproducible.

Tag + digest

FROM python:3.12-slim@sha256:abc123...

Immutable — the digest is a content hash. Even if the tag is updated, your build uses exactly this image. Builds are reproducible. You must explicitly bump the digest to get updates.

Digest pinning is the secure choice. It makes your builds reproducible and prevents surprise changes. The trade-off is that you need a mechanism to detect when the upstream image has been updated and bump the digest.

Image Scanning#

Image scanners inspect a built container image and compare its packages against vulnerability databases (NVD, Red Hat OVAL, Debian Security Tracker, GitHub Advisory Database, etc.).

How it works:

┌─────────────┐      ┌────────────────┐      ┌─────────────────┐
│ Built image  │ ───> │ Scanner reads   │ ───> │ Compare against  │
│ (OCI layers) │      │ package lists   │      │ vuln databases   │
└─────────────┘      │ (dpkg, rpm,     │      │ (NVD, GHSA, etc) │
                      │  pip, npm, etc) │      └────────┬────────┘
                      └────────────────┘               │
                                               ┌───────▼───────┐
                                               │ Report: CVE-X  │
                                               │ in libfoo 1.2  │
                                               │ severity: HIGH │
                                               └───────────────┘

Common scanners:

  • Trivy (Aqua Security) — widely used, broad vulnerability database

  • Grype (Anchore) — fast, similar CLI model, different supply chain

  • Docker Scout — built into Docker CLI (docker scout cves)

  • Clair (Red Hat/Quay) — runs as a service, mature

Important

Scanning only helps if it blocks deployment. A scanner that produces reports nobody reads is security theater. Configure your CI to fail the build on CRITICAL and HIGH findings:

grype ghcr.io/ccatobs/ops-db-api:latest --fail-on high

Use --ignore-unfixed or equivalent to suppress CVEs that have no available fix yet — you cannot act on these, and they create alert fatigue.

SBOM — Software Bill of Materials#

An SBOM is a machine-readable inventory of every component in your image: packages, libraries, versions. Think of it as a “nutrition label” for software.

# Generate an SBOM in CycloneDX format
grype ghcr.io/ccatobs/ops-db-api:latest -o cyclonedx-json > sbom.json

Why generate SBOMs?

When a new CVE drops (e.g., “all versions of libexpat < 2.6.0 are vulnerable”), you can query your SBOMs to answer “which of our images contain libexpat, and which version?” in seconds — instead of rebuilding and scanning everything.

Store SBOMs as CI artifacts alongside each image build. They cost almost nothing to generate and are invaluable during incident response.

Supply Chain Attacks#

A supply chain attack compromises software before it reaches you — in the build system, the package registry, or the distribution channel. You download and run malicious code believing it to be legitimate.

Anatomy of a Supply Chain Attack#

Legitimate flow:

Developer → Source Code → CI Build → Registry → Your Server
                                                 (you trust this)

Supply chain attack (multiple vectors):

Developer → [Compromised credentials] → Malicious CI Build → Registry → Your Server
                                                                         (you trust this
                                                                          but it's poisoned)

Developer → Source Code → CI Build → [Compromised registry tag] → Your Server
                                                                   (tag now points to
                                                                    malicious image)

Developer → Source Code → [Compromised CI Action] → Secrets exfiltrated
                           (third-party action
                            replaced with malware)

Case Study: The Trivy/CanisterWorm Attack (March 2026)#

On 19 March 2026, Trivy — one of the most widely used open-source vulnerability scanners — was itself compromised. The irony is sharp: the tool teams relied on to detect vulnerabilities became the vector for attack.

What happened:

A threat actor group called “TeamPCP” exploited incompletely revoked credentials from a prior security incident at Aqua Security (Trivy’s maintainer). They launched a multi-vector attack:

  1. GitHub Actions poisoning: Force-pushed 76 of 77 version tags in the aquasecurity/trivy-action repository, redirecting trusted version references (like @v0.35) to malicious commits.

  2. Malicious binary release: Triggered Trivy’s release automation to publish an infected binary (v0.69.4) to GitHub Releases, Docker Hub, and container registries.

  3. CanisterWorm propagation: The payload included a self-propagating npm worm that spread to 47+ packages. It used an Internet Computer blockchain canister as a C2 dead-drop — the first documented abuse of this technique.

  4. Credential harvesting: The malware exfiltrated CI/CD secrets, cloud credentials, SSH keys, and Docker configurations — all while Trivy scans appeared to complete successfully.

Impact: CVE-2026-33634 (CVSS 9.4). Exposure window: 19–23 March 2026. The attacker also defaced all 44 Aqua Security internal repositories.

Root cause: Credentials from a prior incident were not fully revoked. The attacker retained residual access to release infrastructure.

Key lessons:

Lesson

Implication

Mutable tags are dangerous

@v0.35 was force-pushed to point to malicious code. The tag looked the same; the content was completely different.

Pin by commit SHA

@d43cc1dfea034b8e4e523b399d14fd25f7535bc5 cannot be force-pushed. It is a content-addressed reference.

Don’t be first-wave adopters

A 72-hour delay before adopting new releases would have avoided the entire exposure window. The community detected the compromise within ~4 days.

Multi-vendor scanning

If you run two scanners from different organizations, an attacker must compromise both simultaneously.

The scanner is part of the supply chain

Security tools are software too. They need the same scrutiny as any other dependency.

Other Notable Supply Chain Incidents#

  • SolarWinds (2020): Attackers compromised the build system of SolarWinds Orion, inserting a backdoor into signed updates distributed to ~18,000 organizations including US government agencies. Detected after ~9 months.

  • Codecov (2021): Attackers modified the Codecov bash uploader script to exfiltrate environment variables (including CI secrets) from customer CI pipelines. The compromised script was served for ~2 months.

  • ua-parser-js (2021): A popular npm package (7M weekly downloads) was hijacked via compromised maintainer credentials. Malicious versions mined cryptocurrency and stole passwords.

  • xz-utils (2024): A long-running social engineering campaign where an attacker gained maintainer trust over 2+ years, then inserted a backdoor into the xz compression library — targeting SSH authentication on Linux systems. Caught by accident days before widespread distribution.

The pattern is consistent: compromise the build/release pipeline, poison the artifact, rely on trust in the distribution channel.

Defense Patterns#

No single defense stops supply chain attacks. The strategy is defense in depth — multiple independent layers so that any single failure is contained.

Pin Everything by Content Hash#

Mutable references (tags, version ranges, branch names) can be redirected. Content-addressed references (SHA digests, commit SHAs) cannot.

# BAD — mutable tag, can be replaced silently
FROM python:3.12-slim

# GOOD — digest-pinned, immutable
FROM python:3.12-slim@sha256:abc123def456...
# BAD — tag can be force-pushed (exactly what happened with Trivy)
- uses: anchore/scan-action@v4

# GOOD — commit SHA, immutable
- uses: anchore/scan-action@d43cc1dfea034b8e4e523b399d14fd25f7535bc5

This applies to: Docker base images, GitHub Actions, Ansible Galaxy roles, pip/npm dependencies with lock files. Anywhere a reference can be replaced without changing the name.

Delay Adoption — Never Be First-Wave#

Most supply chain compromises are detected within hours to days by the community. A deliberate delay before adopting new releases means you are never in the blast radius of initial exposure:

Update type

Recommended delay

Rationale

Base image digest bump

72 hours

Security rebuild of existing version; low risk but worth the wait

Minor/major version bump

1 week + human review

Behavioral changes possible; needs testing

CI tool updates

1 week minimum

CI tools run with elevated privileges (secrets access); highest supply chain risk

OS security patches

7 days (staging → production)

Staging acts as canary for the full bake period

Multi-Vendor Scanning#

Running two scanners from different organizations means an attacker must compromise both supply chains simultaneously:

# Scanner 1: Grype (Anchore)
grype $IMAGE --fail-on high

# Scanner 2: Docker Scout (Docker Inc.)
docker scout cves $IMAGE --only-severity critical,high --exit-code

Different vulnerability databases, different organizations, different build pipelines. If they both say clean, confidence is high. If they disagree, investigate.

Staging-First Deployment#

Every change flows through staging before production. This applies to OS patches, container images, and infrastructure changes:

┌──────────┐     7-day bake     ┌──────────────┐
│ Staging  │ ─────────────────> │ Production    │
│ (auto)   │                    │ (manual gate) │
└──────────┘                    └──────────────┘

If staging breaks during the bake period,
production is unaffected and you fix forward.

This pattern catches not just compromised software, but also legitimate updates that happen to break your specific configuration.

Run the Scanner Locally, Not as a Third-Party Action#

The Trivy attack worked because teams delegated scanning to a third-party GitHub Action. That Action ran with access to repository secrets. When the Action was compromised, so were the secrets.

A safer pattern: download the scanner binary yourself, verify its checksum, and run it directly:

# Download specific version, verify integrity, then scan
- name: Install Grype
  run: |
    curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh \
      | sh -s -- -b /usr/local/bin v0.87.0
    echo "EXPECTED_SHA  /usr/local/bin/grype" | sha256sum -c -

- name: Scan image
  run: grype $IMAGE --fail-on high

This removes the GitHub Actions supply chain from the scanner’s trust path. The only trust relationship is with the binary you verified by checksum.

Audit and Compliance#

Standards and Timelines#

Two frameworks are most relevant to our infrastructure:

NIST SP 800-40 Rev 4 (Guide to Enterprise Patch Management):

  • Critical vulnerabilities: remediate within 14 days

  • High vulnerabilities: remediate within 30 days

  • Maintain an inventory of all software assets (SBOMs help here)

  • Automate patch deployment where possible

CIS Benchmarks (Center for Internet Security):

  • CIS RHEL 9 Benchmark §1.9: “Ensure updates, patches, and additional security software are installed”

  • CIS Docker Benchmark §1.1.2: “Ensure Docker is up to date”

  • CIS recommends automated patching with verification

Our update cadence (nightly staging → weekly production for security patches) satisfies both NIST and CIS timelines for critical and high vulnerabilities.

What an Auditor Looks For#

A third-party security audit of patch management typically examines:

  1. Policy: Is there a documented patch management policy with defined timelines?

  2. Automation: Are patches applied automatically or on a documented schedule?

  3. Testing: Is there a staging/test environment where patches are validated before production?

  4. Coverage: Are both OS and application layers covered?

  5. Monitoring: Is there alerting for failed patches or missing updates?

  6. Audit trail: Can you show when each patch was applied and by whom?

  7. Rollback: Is there a documented rollback procedure?

  8. Exceptions: Is there a process for documenting and tracking exceptions (patches that cannot be applied immediately)?

Jenkins job logs, Ansible output, and Grafana annotations provide audit trail. Documented procedures (the workflow document) provide policy evidence. Staging environment provides testing evidence.

Glossary#

Content-Addressed Reference#
: A reference to a software artifact based on its cryptographic hash,#

making it immutable. Examples: Docker image digests, git commit SHAs. Contrast with mutable references like tags or branch names.

CVE (Common Vulnerabilities and Exposures)#
: A unique identifier for a publicly known security vulnerability.#

Format: CVE-YYYY-NNNNN. Published in the National Vulnerability Database (NVD).

CVSS (Common Vulnerability Scoring System)#
: A numerical score (0.0–10.0) rating the severity of a vulnerability.#

Critical: 9.0–10.0, High: 7.0–8.9, Medium: 4.0–6.9, Low: 0.1–3.9.

Digest (Image Digest)#
: A SHA-256 hash of a container image’s contents. Immutable and content-#

addressed: if a single byte changes, the digest changes. Format: sha256:abc123....

dnf-automatic#
: A RHEL/Fedora service that periodically checks for and optionally#

applies package updates. Configured via /etc/dnf/automatic.conf. Controlled by a systemd timer.

Image Scanning#
: The process of inspecting a container image’s installed packages against#

vulnerability databases to identify known CVEs.

Mutable Tag#
: A container image tag (like v1.0 or latest) that can be#

reassigned to point to a different image digest at any time. Tags are convenient but provide no integrity guarantee.

needs-restarting#
: A RHEL utility (from yum-utils) that checks whether a reboot is#

required after package updates. Exit code 1 means a reboot is needed.

RHSA (Red Hat Security Advisory)#
: A Red Hat advisory documenting a security fix and the affected packages.#

dnf --security uses these to classify which updates are security- relevant.

SBOM (Software Bill of Materials)#
: A machine-readable inventory of all components in a software artifact.#

Formats include CycloneDX and SPDX. Enables rapid impact assessment when new CVEs are published.

Supply Chain Attack#
: An attack that compromises software during its build, packaging, or#

distribution — before it reaches the end user. The end user trusts the artifact because it comes from a legitimate source.

unattended-upgrades#
: The Ubuntu equivalent of dnf-automatic. Applies security updates#

automatically. Configured via files in /etc/apt/apt.conf.d/.

Further Reading#