Kubernetes on Ramses#

The Ramses environment hosts the Kubernetes workloads that keep the Cologne staging data-transfer pipeline running. This page captures how the cluster is laid out, which configuration artifacts live in this repository, and the common operational procedures the infrastructure team follows.

Overview#

Primary workload: a Celery worker that services the cologne_ramses_processing_staging queue of the data-transfer system.
Namespace: ccat on the Ramses cluster.
Helm chart: system-integration/cologne-staging-k8s (production/staging) with a matching cologne-staging-k8s-local variant for Kind-based testing.
Key dependencies: Redis (TLS enabled) and PostgreSQL on db.data.ccat.uni-koeln.de, Infisical-managed secrets, and access to the Coscine API for long-term archive uploads.

Access & Prerequisites#

Obtain a Ramses user account and add it to the hpc-ccat (and, if needed, ag-riechers) Unix groups. New accounts and group changes are handled by Roland Pabel (cluster administrator).
After the groups propagate, verify membership from a Ramses login node:
```
id
getent group hpc-ccat | grep "$USER"
```
Request namespace administration on ccat from Roland. Access is granted per user; once approved you can manage resources with kubectl and helm from the ramses1 or ramses4 login nodes.
Ensure you can log in to the Ramses bastion; see access_to_rames for connectivity requirements once your account is active.

Install Helm in your home directory if it is not already available on the login nodes. The standard binary install works on Ramses:

curl -fsSL https://get.helm.sh/helm-v3.14.4-linux-amd64.tar.gz -o helm.tar.gz
tar -xzf helm.tar.gz linux-amd64/helm
install -m 0755 linux-amd64/helm ~/bin/helm
rm -rf helm.tar.gz linux-amd64
export PATH="$HOME/bin:$PATH"

kubectl and Helm must be installed on the host you use to manage the cluster. Confirm versions after logging in:
```
kubectl version
helm version
```
The Ramses kubeconfig is already present for the operations account. Verify you are targeting the Ramses context (named ramses in the shared kubeconfig):
```
kubectl config get-contexts
```

Cluster Layout#

Namespace: all workloads run inside ccat.
Namespace concept: namespaces are Kubernetes’ tenancy boundary. Only users added as admins to ccat can see or modify workloads in this namespace, while cluster-level resources stay protected.
Persistent volumes: PVs for paths under /projects/ccat are created by the Ramses administrators. Coordinate PV provisioning (for example /projects/ccat/raw_data) before deploying PVCs.
Node placement: the cluster exposes a 10-node worker pool with direct access to the parallel file system. Pods in the ccat namespace typically land on these nodes so throughput is bounded by PV performance rather than ephemeral storage.
Release name: production automation uses ccat-raw for the Helm release. The default example in the chart README uses cologne-staging. Align with existing naming when upgrading.
Workload: the chart renders a single Deployment (*-data-managers) and an optional PersistentVolumeClaim (*-data). No DaemonSets or ingress resources are created.
Pods and containers: the deployment manages pods, each hosting one cologneStaging container that runs the Celery worker image. The pod is the schedulable unit; the container is the runtime that executes the Python CLI.
PVC vs PV: the chart defines a PVC (ccat-raw-data) that requests storage from a pre-created PV backing /projects/ccat/raw_data. Ramses administrators bind the PV; once bound, the PVC exposes the mount inside the pod at /data.

Configuration Inputs#

All runtime settings live in system-integration/cologne-staging-k8s/values.yaml:

cologneStaging.image controls the container image (repository, tag, pull policy).
cologneStaging.command is the exact ccat_data_transfer invocation that starts the Celery worker. Update this when queue names change.
app.environment maps to ENV_FOR_DYNACONF inside the container and should stay production so the worker reads production settings.
redis and postgres sections define broker/database endpoints and credentials. Values must match the Infisical records for the Ramses staging environment.
resources and replicas gate resource requests/limits and scaling.
persistence provisions a 10 Gi PVC (default storage class) mounted at /data. Adjust size if staging workloads outgrow the default.
healthChecks controls readiness and liveness probes. They are disabled by default because Celery exits on unrecoverable errors; enable cautiously to avoid flapping deployments.
Override parameters temporarily by supplying an additional values file during Helm upgrades:
```
helm upgrade --install ccat-raw . -f values.yaml -n ccat
```

Helm Chart Structure#

The Ramses deployment assets live under system-integration:

cologne-staging-k8s/Chart.yaml – chart metadata (name, version).
cologne-staging-k8s/values.yaml – default values consumed by the templates.
cologne-staging-k8s/templates/deployment.yaml – renders the Deployment manifest:
- Sets hostAliases for Redis/PostgreSQL resolution.
- Defines the cologneStaging container image, command, and environment variables.
- Mounts redis-certs (hostPath) and optional data (PVC) volumes.
- Injects resources and probes based on the values file.
cologne-staging-k8s/templates/pvc.yaml – renders the PersistentVolumeClaim when persistence.enabled is true using size and access mode from values.yaml.
cologne-staging-k8s-local – Kind-focused chart with the same structure but a different deployment.yaml tailored for host networking and local hostPath mounts.

Secrets & Certificates#

Coscine API token: the deployment expects a Kubernetes secret named coscine-api-token-secret with the key CCAT_DATA_TRANSFER_COSCINE_API_TOKEN. Create or rotate it with:
```
kubectl create secret generic coscine-api-token-secret \
  --from-literal=CCAT_DATA_TRANSFER_COSCINE_API_TOKEN='<token>' \
  -n ccat --dry-run=client -o yaml > coscine-api-token-secret.yaml
```
Store the canonical token in Infisical (Infisical Secrets Management) and rotate via the CLI to avoid stale values.
Redis TLS material: pods mount /etc/redis/certs from the host path /projects/ccat/etc/certs (see the Helm template). Keep this directory up to date with the certificates issued for db.data.ccat.uni-koeln.de.
Verify mounts: after deployment, confirm the certs are visible inside the container:
```
kubectl exec deploy/ccat-raw-data-managers -n ccat -- ls /etc/redis/certs
```
PostgreSQL credentials: shipped via environment variables in values.yaml. Keep them in sync with the credentials stored in Infisical and the actual database. If credentials need to be rotated without editing the main values file, create a Kubernetes secret and reference it via an overlay values file using envFrom.

Persistent Storage & Data Paths#

The worker writes intermediate processing files to the PVC mounted at /data. On Ramses the cluster must have a default storage class able to satisfy 10 Gi ReadWriteOnce claims.
The chart also defines a hostAliases block so that the pod can resolve redis and postgres to 134.95.40.103. Confirm that this address remains correct after infrastructure changes.

StorageClass, PV/PVC, and pod relationships for the Ramses deployment.#

Routine Operations#

Deploy or upgrade#

Use Helm directly or the helper target in system-integration/Makefile:

cd /projects/ccat/data-center/system-integration
make restart_ramses_processing

This expands to:

helm upgrade --install ccat-raw cologne-staging-k8s \
  -f cologne-staging-k8s/values.yaml -n ccat
kubectl rollout restart deployment/ccat-raw-data-managers -n ccat

Health checks#

Verify rollout status and pod health:

kubectl get pods -n ccat
kubectl rollout status deployment/ccat-raw-data-managers -n ccat

Inspect logs (stream when investigating incidents):

kubectl logs deployment/ccat-raw-data-managers -n ccat -f

Confirm queue registration by running a one-off check inside the pod:

kubectl exec deploy/ccat-raw-data-managers -n ccat -- \
  ccat_data_transfer list-queues ramses_processing

Scaling#

Adjust replicas in values.yaml (replicas.dataManagers) and run helm upgrade. Horizontal scaling is limited by Redis/PostgreSQL and available CPU/memory on Ramses.
Update resources.cologne-staging when the worker hits memory limits or throttling.

Troubleshooting#

CrashLoopBackOff: check secrets mounted correctly (especially coscine-api-token-secret) and that ENV_FOR_DYNACONF points to a valid configuration.
Broker connectivity errors: validate TLS certificates under /projects/ccat/etc/certs and ensure firewall rules allow Redis (6379) and PostgreSQL (5432) traffic from the Ramses node to db.data.ccat.uni-koeln.de.
PVC issues: run kubectl describe pvc ccat-raw-data -n ccat to confirm storage class binding. Coordinate with infrastructure to expand the volume if Events show provisioning errors.
Helm divergence: helm status ccat-raw -n ccat and helm get values ccat-raw -n ccat highlight local overrides that need to be checked into values.yaml.
ImagePullBackOff: confirm the cluster image pull secret has access to ghcr.io/ccatobs. Re-run kubectl describe pod to review registry errors and coordinate with Roland if credentials need to be refreshed.

Network reachability: when transfers fail, run quick probes from inside the pod:

kubectl exec deploy/ccat-raw-data-managers -n ccat -- \
nc -v db.data.ccat.uni-koeln.de 6379

kubectl exec deploy/ccat-raw-data-managers -n ccat -- \
nc -v db.data.ccat.uni-koeln.de 5432

Local Replication#

Use the Kind-based chart when validating changes off-cluster:

cd system-integration/cologne-staging-k8s-local
kind delete cluster --name local-dev
kind create cluster --config kind-config.yaml
kubectl create ns monitor
helm upgrade --install scheduler-dummy . -f values.yaml -n monitor

When finished testing, clean up the local environment:

helm uninstall scheduler-dummy -n monitor
kind delete cluster --name local-dev

The local chart runs hostNetwork and maps host directories to mimic Ramses paths. Update values.yaml to mirror any queue or credential changes before promoting them upstream.

Change Management#

Update chart templates or values.yaml in git, run helm lint and helm template for quick validation, then open a pull request.
Document operational changes here and in Deployment Workflows when rollout steps change (new secrets, extra services, etc.).
Tag releases after successful upgrades so the image tag and values tracked in git match what is running in the cluster.
Record Helm history before and after large changes (helm history ccat-raw -n ccat) so rollback targets are obvious during an incident.