# Patch Management, Container Security & Supply Chains

```{contents} On this page
:depth: 2
:local: true
```

This document explains why and how we keep our infrastructure patched and our
container images secure. It covers the threat landscape, the mechanics of OS
and container updates, and the defenses against supply chain attacks — using
real-world incidents (including one that hit a sister observatory) as concrete
examples.

## Why This Matters — The ALMA Incident

On 29 October 2022, the Atacama Large Millimeter/Submillimeter Array (ALMA)
observatory in Chile was hit by a ransomware attack. Attackers gained access
through a compromised VPN credential. The result:

- **48 days of lost observations.** No science operations from late October
  through mid-December 2022.
- **~\$250,000/day in operational cost** while the array sat idle.
- All email, public web presence, and internal IT services taken offline.
- A multinational team (ESO, NAOJ, NRAO) had to rebuild the network from
  scratch, including replacing the compromised VPN infrastructure.

The Hive ransomware group was suspected. The attack did not compromise ALMA's
antennas or archived science data, but the operational and reputational damage
was severe.

**ALMA and CCAT share key characteristics**: remote observatory infrastructure,
university/consortium governance, small IT teams relative to the complexity of
the systems, and scientific data pipelines that cannot tolerate extended
downtime.

The lesson is not "be afraid" — it is "patch promptly, limit blast radius, and
design for recovery." This document explains how.

## The Two Layers: OS Packages and Container Images

Our infrastructure has two distinct software layers that need independent
update strategies:

```text
┌──────────────────────────────────────────────────┐
│  Container Layer (Docker images)                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  │
│  │ ops-db   │ │ ops-db-  │ │ data-transfer    │  │
│  │ -api     │ │ ui       │ │                  │  │
│  └────┬─────┘ └────┬─────┘ └───────┬──────────┘  │
│       │ python:3.12-slim    node:22-slim          │
│       │ (base image)        (base image)          │
│  ─────┴────────────────────────────┴──────────────│
│                                                    │
│  Host OS Layer (RHEL 9 / Ubuntu)                  │
│  kernel, glibc, openssl, systemd, docker-ce ...   │
└──────────────────────────────────────────────────┘
```

A vulnerability can exist at either layer:

- **OS layer**: A kernel privilege escalation, an OpenSSL buffer overflow, a
  systemd bug. Exploitable by anything running on the host, including
  containers (which share the host kernel).
- **Container layer**: A CVE in the Python interpreter, a vulnerable version
  of `libexpat` baked into the base image, a compromised pip/npm dependency.
  Only affects that specific container, but can lead to remote code execution
  within it.

Both layers must be patched. Patching only the OS leaves container images
running known-vulnerable libraries. Patching only images leaves the shared
kernel exposed.

## OS-Level Patching

### Concepts

**Security-only vs full updates**

Package managers can apply *all* available updates or only those classified as
security fixes:

```{eval-rst}
.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Mode
     - What it does
     - Trade-off
   * - **Security-only**
     - Applies patches tagged with security advisories (CVE fixes).
       ``dnf upgrade --security`` (RHEL), ``unattended-upgrades`` with
       security origins (Ubuntu).
     - Minimal risk of breaking changes. Leaves feature/bugfix updates
       for planned windows.
   * - **Full update**
     - Applies all available package updates including feature changes,
       new minor versions, and bug fixes.
     - More comprehensive but higher risk of unexpected behavior changes.
       A PostgreSQL minor version bump might change query behavior; a new
       systemd version might change service defaults.
```

**Recommendation**: Security-only for automated/frequent updates. Full updates
in planned maintenance windows (monthly or quarterly).

### RHEL 9: `dnf-automatic`

Red Hat Enterprise Linux provides `dnf-automatic`, a systemd timer that
periodically checks for and optionally applies updates:

```ini
# /etc/dnf/automatic.conf
[commands]
upgrade_type = security     # only security-classified updates
apply_updates = yes         # actually install them (vs just download)
random_sleep = 3600         # jitter to avoid all hosts hitting mirrors at once

[emitters]
emit_via = email,stdio      # send notification email after each run
```

The timer runs daily. With `upgrade_type = security`, it only applies
patches that have an associated Red Hat Security Advisory (RHSA).

:::{important}
`dnf --security` requires an active RHEL subscription. The security
classification metadata comes from Red Hat's advisory system. Without a
subscription, `dnf updateinfo list --security` returns nothing and
security-only mode silently does nothing.

Verify your subscription is working:

```bash
subscription-manager status
dnf updateinfo list --security
```
:::

### Ubuntu: `unattended-upgrades`

Ubuntu's equivalent is the `unattended-upgrades` package, configured via
`/etc/apt/apt.conf.d/50unattended-upgrades`:

```text
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
    // "${distro_id}:${distro_codename}-updates";   ← commented out = security only
};

Unattended-Upgrade::Automatic-Reboot "false";       // don't auto-reboot
Unattended-Upgrade::Mail "ops@example.com";
```

### Reboot Management

Some updates (especially kernel updates) require a reboot to take effect. The
system is running the old kernel until you restart.

**Detection**:

- RHEL: `needs-restarting -r` exits with code 1 if a reboot is needed
- Ubuntu: check for `/var/run/reboot-required`

**Options**:

```{eval-rst}
.. list-table::
   :header-rows: 1
   :widths: 20 35 45

   * - Approach
     - How it works
     - When to use
   * - **Scheduled reboot**
     - Reboot at a fixed window (e.g., Sunday 03:00) if needed
     - Staging environments, services that tolerate brief downtime
   * - **Kernel livepatch**
     - Patches the running kernel in memory without rebooting.
       ``kpatch`` (RHEL) or Canonical Livepatch (Ubuntu).
     - 24/7 services where even brief downtime is unacceptable.
       Adds operational complexity.
   * - **Serial rolling reboot**
     - Ansible reboots one host at a time (``serial: 1``), verifying
       health before proceeding to the next
     - Production fleets where you can tolerate one host being down
       temporarily
```

For our fleet, **serial rolling reboots** are the right balance. Our services
can tolerate one host being down briefly, and the simplicity of "reboot and
verify" beats the operational burden of livepatch management.

### Docker Engine — The Special Case

A Docker engine update restarts the daemon, which restarts **all containers**
on the host. This is qualitatively different from a library update.

:::{warning}
**Always exclude Docker from automatic updates.**

```ini
# /etc/dnf/dnf.conf
excludepkgs=docker-ce*,containerd*
```

Docker engine updates should be deliberate, scheduled, and tested on
staging first. Quarterly cadence is appropriate unless a critical CVE
drops.
:::

Pin the Docker version in Ansible and bump it explicitly:

```yaml
# roles/docker/defaults/main.yml
docker_version: "27.5.1"
```

This ensures you know exactly what version is running and when it changes.

## Container Image Security

### The Base Image Problem

When you write a Dockerfile starting with `FROM python:3.12-slim`, you
inherit everything in that base image: the Debian libraries, the Python
interpreter, system utilities. If `libexpat` in that Debian layer has a
CVE, your image is vulnerable — even if your application code is perfect.

The challenge: base image maintainers publish updated images regularly, but
**the tag stays the same**. `python:3.12-slim` today has different contents
than `python:3.12-slim` three months ago. If you don't rebuild, you're
frozen on the old (vulnerable) version.

### Tag Pinning vs Digest Pinning

```{eval-rst}
.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Strategy
     - Example
     - Properties
   * - **Tag only**
     - ``FROM python:3.12-slim``
     - Mutable — the registry can change what this points to at any time.
       You get whatever was latest when you last built. Builds are not
       reproducible.
   * - **Tag + digest**
     - ``FROM python:3.12-slim@sha256:abc123...``
     - Immutable — the digest is a content hash. Even if the tag is
       updated, your build uses exactly this image. Builds are
       reproducible. You must explicitly bump the digest to get updates.
```

**Digest pinning is the secure choice.** It makes your builds reproducible
and prevents surprise changes. The trade-off is that you need a mechanism to
detect when the upstream image has been updated and bump the digest.

### Image Scanning

Image scanners inspect a built container image and compare its packages
against vulnerability databases (NVD, Red Hat OVAL, Debian Security Tracker,
GitHub Advisory Database, etc.).

**How it works**:

```text
┌─────────────┐      ┌────────────────┐      ┌─────────────────┐
│ Built image  │ ───> │ Scanner reads   │ ───> │ Compare against  │
│ (OCI layers) │      │ package lists   │      │ vuln databases   │
└─────────────┘      │ (dpkg, rpm,     │      │ (NVD, GHSA, etc) │
                      │  pip, npm, etc) │      └────────┬────────┘
                      └────────────────┘               │
                                               ┌───────▼───────┐
                                               │ Report: CVE-X  │
                                               │ in libfoo 1.2  │
                                               │ severity: HIGH │
                                               └───────────────┘
```

**Common scanners**:

- **Trivy** (Aqua Security) — widely used, broad vulnerability database
- **Grype** (Anchore) — fast, similar CLI model, different supply chain
- **Docker Scout** — built into Docker CLI (`docker scout cves`)
- **Clair** (Red Hat/Quay) — runs as a service, mature

:::{important}
Scanning only helps if it **blocks deployment**. A scanner that produces
reports nobody reads is security theater. Configure your CI to fail the
build on CRITICAL and HIGH findings:

```bash
grype ghcr.io/ccatobs/ops-db-api:latest --fail-on high
```

Use `--ignore-unfixed` or equivalent to suppress CVEs that have no
available fix yet — you cannot act on these, and they create alert fatigue.
:::

### SBOM — Software Bill of Materials

An SBOM is a machine-readable inventory of every component in your image:
packages, libraries, versions. Think of it as a "nutrition label" for
software.

```bash
# Generate an SBOM in CycloneDX format
grype ghcr.io/ccatobs/ops-db-api:latest -o cyclonedx-json > sbom.json
```

**Why generate SBOMs?**

When a new CVE drops (e.g., "all versions of libexpat < 2.6.0 are
vulnerable"), you can query your SBOMs to answer "which of our images contain
libexpat, and which version?" in seconds — instead of rebuilding and scanning
everything.

Store SBOMs as CI artifacts alongside each image build. They cost almost
nothing to generate and are invaluable during incident response.

## Supply Chain Attacks

A supply chain attack compromises software **before it reaches you** — in the
build system, the package registry, or the distribution channel. You download
and run malicious code believing it to be legitimate.

### Anatomy of a Supply Chain Attack

```text
Legitimate flow:

Developer → Source Code → CI Build → Registry → Your Server
                                                 (you trust this)

Supply chain attack (multiple vectors):

Developer → [Compromised credentials] → Malicious CI Build → Registry → Your Server
                                                                         (you trust this
                                                                          but it's poisoned)

Developer → Source Code → CI Build → [Compromised registry tag] → Your Server
                                                                   (tag now points to
                                                                    malicious image)

Developer → Source Code → [Compromised CI Action] → Secrets exfiltrated
                           (third-party action
                            replaced with malware)
```

### Case Study: The Trivy/CanisterWorm Attack (March 2026)

On 19 March 2026, Trivy — one of the most widely used open-source
vulnerability scanners — was itself compromised. The irony is sharp: the tool
teams relied on to *detect* vulnerabilities became the *vector* for attack.

**What happened**:

A threat actor group called "TeamPCP" exploited incompletely revoked
credentials from a prior security incident at Aqua Security (Trivy's
maintainer). They launched a multi-vector attack:

1. **GitHub Actions poisoning**: Force-pushed 76 of 77 version tags in the
   `aquasecurity/trivy-action` repository, redirecting trusted version
   references (like `@v0.35`) to malicious commits.
2. **Malicious binary release**: Triggered Trivy's release automation to
   publish an infected binary (v0.69.4) to GitHub Releases, Docker Hub,
   and container registries.
3. **CanisterWorm propagation**: The payload included a self-propagating npm
   worm that spread to 47+ packages. It used an Internet Computer blockchain
   canister as a C2 dead-drop — the first documented abuse of this technique.
4. **Credential harvesting**: The malware exfiltrated CI/CD secrets, cloud
   credentials, SSH keys, and Docker configurations — all while Trivy scans
   appeared to complete successfully.

**Impact**: CVE-2026-33634 (CVSS 9.4). Exposure window: 19–23 March 2026.
The attacker also defaced all 44 Aqua Security internal repositories.

**Root cause**: Credentials from a *prior* incident were not fully revoked.
The attacker retained residual access to release infrastructure.

**Key lessons**:

```{eval-rst}
.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Lesson
     - Implication
   * - **Mutable tags are dangerous**
     - ``@v0.35`` was force-pushed to point to malicious code. The tag
       looked the same; the content was completely different.
   * - **Pin by commit SHA**
     - ``@d43cc1dfea034b8e4e523b399d14fd25f7535bc5`` cannot be force-pushed.
       It is a content-addressed reference.
   * - **Don't be first-wave adopters**
     - A 72-hour delay before adopting new releases would have avoided the
       entire exposure window. The community detected the compromise within
       ~4 days.
   * - **Multi-vendor scanning**
     - If you run two scanners from different organizations, an attacker must
       compromise both simultaneously.
   * - **The scanner is part of the supply chain**
     - Security tools are software too. They need the same scrutiny as any
       other dependency.
```

### Other Notable Supply Chain Incidents

- **SolarWinds (2020)**: Attackers compromised the build system of SolarWinds
  Orion, inserting a backdoor into signed updates distributed to ~18,000
  organizations including US government agencies. Detected after ~9 months.
- **Codecov (2021)**: Attackers modified the Codecov bash uploader script to
  exfiltrate environment variables (including CI secrets) from customer CI
  pipelines. The compromised script was served for ~2 months.
- **ua-parser-js (2021)**: A popular npm package (7M weekly downloads) was
  hijacked via compromised maintainer credentials. Malicious versions mined
  cryptocurrency and stole passwords.
- **xz-utils (2024)**: A long-running social engineering campaign where an
  attacker gained maintainer trust over 2+ years, then inserted a backdoor
  into the xz compression library — targeting SSH authentication on Linux
  systems. Caught by accident days before widespread distribution.

The pattern is consistent: compromise the build/release pipeline, poison the
artifact, rely on trust in the distribution channel.

## Defense Patterns

No single defense stops supply chain attacks. The strategy is **defense in
depth** — multiple independent layers so that any single failure is contained.

### Pin Everything by Content Hash

Mutable references (tags, version ranges, branch names) can be redirected.
Content-addressed references (SHA digests, commit SHAs) cannot.

```dockerfile
# BAD — mutable tag, can be replaced silently
FROM python:3.12-slim

# GOOD — digest-pinned, immutable
FROM python:3.12-slim@sha256:abc123def456...
```

```yaml
# BAD — tag can be force-pushed (exactly what happened with Trivy)
- uses: anchore/scan-action@v4

# GOOD — commit SHA, immutable
- uses: anchore/scan-action@d43cc1dfea034b8e4e523b399d14fd25f7535bc5
```

This applies to: Docker base images, GitHub Actions, Ansible Galaxy roles,
pip/npm dependencies with lock files. Anywhere a reference can be replaced
without changing the name.

### Delay Adoption — Never Be First-Wave

Most supply chain compromises are detected within hours to days by the
community. A deliberate delay before adopting new releases means you are never
in the blast radius of initial exposure:

```{eval-rst}
.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Update type
     - Recommended delay
     - Rationale
   * - Base image digest bump
     - 72 hours
     - Security rebuild of existing version; low risk but worth the wait
   * - Minor/major version bump
     - 1 week + human review
     - Behavioral changes possible; needs testing
   * - CI tool updates
     - 1 week minimum
     - CI tools run with elevated privileges (secrets access); highest
       supply chain risk
   * - OS security patches
     - 7 days (staging → production)
     - Staging acts as canary for the full bake period
```

### Multi-Vendor Scanning

Running two scanners from different organizations means an attacker must
compromise both supply chains simultaneously:

```bash
# Scanner 1: Grype (Anchore)
grype $IMAGE --fail-on high

# Scanner 2: Docker Scout (Docker Inc.)
docker scout cves $IMAGE --only-severity critical,high --exit-code
```

Different vulnerability databases, different organizations, different build
pipelines. If they both say clean, confidence is high. If they disagree,
investigate.

### Staging-First Deployment

Every change flows through staging before production. This applies to OS
patches, container images, and infrastructure changes:

```text
┌──────────┐     7-day bake     ┌──────────────┐
│ Staging  │ ─────────────────> │ Production    │
│ (auto)   │                    │ (manual gate) │
└──────────┘                    └──────────────┘

If staging breaks during the bake period,
production is unaffected and you fix forward.
```

This pattern catches not just compromised software, but also legitimate
updates that happen to break your specific configuration.

### Run the Scanner Locally, Not as a Third-Party Action

The Trivy attack worked because teams delegated scanning to a third-party
GitHub Action. That Action ran with access to repository secrets. When the
Action was compromised, so were the secrets.

A safer pattern: download the scanner binary yourself, verify its checksum,
and run it directly:

```yaml
# Download specific version, verify integrity, then scan
- name: Install Grype
  run: |
    curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh \
      | sh -s -- -b /usr/local/bin v0.87.0
    echo "EXPECTED_SHA  /usr/local/bin/grype" | sha256sum -c -

- name: Scan image
  run: grype $IMAGE --fail-on high
```

This removes the GitHub Actions supply chain from the scanner's trust path.
The only trust relationship is with the binary you verified by checksum.

## Audit and Compliance

### Standards and Timelines

Two frameworks are most relevant to our infrastructure:

**NIST SP 800-40 Rev 4** (Guide to Enterprise Patch Management):

- Critical vulnerabilities: remediate within **14 days**
- High vulnerabilities: remediate within **30 days**
- Maintain an inventory of all software assets (SBOMs help here)
- Automate patch deployment where possible

**CIS Benchmarks** (Center for Internet Security):

- CIS RHEL 9 Benchmark §1.9: "Ensure updates, patches, and additional
  security software are installed"
- CIS Docker Benchmark §1.1.2: "Ensure Docker is up to date"
- CIS recommends automated patching with verification

Our update cadence (nightly staging → weekly production for security patches)
satisfies both NIST and CIS timelines for critical and high vulnerabilities.

### What an Auditor Looks For

A third-party security audit of patch management typically examines:

1. **Policy**: Is there a documented patch management policy with defined
   timelines?
2. **Automation**: Are patches applied automatically or on a documented
   schedule?
3. **Testing**: Is there a staging/test environment where patches are
   validated before production?
4. **Coverage**: Are both OS and application layers covered?
5. **Monitoring**: Is there alerting for failed patches or missing updates?
6. **Audit trail**: Can you show when each patch was applied and by whom?
7. **Rollback**: Is there a documented rollback procedure?
8. **Exceptions**: Is there a process for documenting and tracking exceptions
   (patches that cannot be applied immediately)?

Jenkins job logs, Ansible output, and Grafana annotations provide audit trail.
Documented procedures (the workflow document) provide policy evidence. Staging
environment provides testing evidence.

## Glossary

:::{glossary}
:sorted: true

CVE (Common Vulnerabilities and Exposures)

: A unique identifier for a publicly known security vulnerability.
  Format: `CVE-YYYY-NNNNN`. Published in the National Vulnerability
  Database (NVD).

CVSS (Common Vulnerability Scoring System)

: A numerical score (0.0–10.0) rating the severity of a vulnerability.
  Critical: 9.0–10.0, High: 7.0–8.9, Medium: 4.0–6.9, Low: 0.1–3.9.

RHSA (Red Hat Security Advisory)

: A Red Hat advisory documenting a security fix and the affected packages.
  `dnf --security` uses these to classify which updates are security-
  relevant.

SBOM (Software Bill of Materials)

: A machine-readable inventory of all components in a software artifact.
  Formats include CycloneDX and SPDX. Enables rapid impact assessment
  when new CVEs are published.

Supply Chain Attack

: An attack that compromises software during its build, packaging, or
  distribution — before it reaches the end user. The end user trusts the
  artifact because it comes from a legitimate source.

Digest (Image Digest)

: A SHA-256 hash of a container image's contents. Immutable and content-
  addressed: if a single byte changes, the digest changes. Format:
  `sha256:abc123...`.

Mutable Tag

: A container image tag (like `v1.0` or `latest`) that can be
  reassigned to point to a different image digest at any time. Tags are
  convenient but provide no integrity guarantee.

`dnf-automatic`

: A RHEL/Fedora service that periodically checks for and optionally
  applies package updates. Configured via `/etc/dnf/automatic.conf`.
  Controlled by a systemd timer.

`unattended-upgrades`

: The Ubuntu equivalent of `dnf-automatic`. Applies security updates
  automatically. Configured via files in `/etc/apt/apt.conf.d/`.

`needs-restarting`

: A RHEL utility (from `yum-utils`) that checks whether a reboot is
  required after package updates. Exit code 1 means a reboot is needed.

Image Scanning

: The process of inspecting a container image's installed packages against
  vulnerability databases to identify known CVEs.

Content-Addressed Reference

: A reference to a software artifact based on its cryptographic hash,
  making it immutable. Examples: Docker image digests, git commit SHAs.
  Contrast with mutable references like tags or branch names.
:::

## Further Reading

- {doc}`tls-and-pki` — TLS certificates and PKI concepts in the CCAT Data
  Center
- {doc}`../secrets-management` — Operational guide for managing secrets
- [NIST SP 800-40 Rev 4](https://csrc.nist.gov/publications/detail/sp/800-40/rev-4/final) —
  Guide to Enterprise Patch Management Planning
- [CIS Benchmarks](https://www.cisecurity.org/cis-benchmarks) —
  Security configuration guides for RHEL, Docker, and Kubernetes
- [ALMA Cyberattack Recovery](https://almascience.eso.org/news/alma-update-on-the-recovery-from-october-29-cyberattack) —
  ALMA's official account of the 2022 ransomware incident
- [Trivy Supply Chain Incident (CVE-2026-33634)](https://github.com/aquasecurity/trivy/discussions/10425) —
  Aqua Security's incident discussion and timeline
- [Grype](https://github.com/anchore/grype) — Anchore's open-source
  vulnerability scanner
- [Docker Scout](https://docs.docker.com/scout/) — Docker's built-in
  image analysis tool