Kubernetes on Ramses#
The Ramses environment hosts the Kubernetes workloads that keep the Cologne staging data-transfer pipeline running. This page captures how the cluster is laid out, which configuration artifacts live in this repository, and the common operational procedures the infrastructure team follows.
Overview#
Primary workload: a Celery worker that services the
cologne_ramses_processing_stagingqueue of the data-transfer system.Namespace:
ccaton the Ramses cluster.Helm chart:
system-integration/cologne-staging-k8s(production/staging) with a matchingcologne-staging-k8s-localvariant for Kind-based testing.Key dependencies: Redis (TLS enabled) and PostgreSQL on
db.data.ccat.uni-koeln.de, Infisical-managed secrets, and access to the Coscine API for long-term archive uploads.
Access & Prerequisites#
Obtain a Ramses user account and add it to the
hpc-ccat(and, if needed,ag-riechers) Unix groups. New accounts and group changes are handled by Roland Pabel (cluster administrator).After the groups propagate, verify membership from a Ramses login node:
id getent group hpc-ccat | grep "$USER"
Request namespace administration on
ccatfrom Roland. Access is granted per user; once approved you can manage resources withkubectlandhelmfrom theramses1orramses4login nodes.Ensure you can log in to the Ramses bastion; see access_to_rames for connectivity requirements once your account is active.
Install Helm in your home directory if it is not already available on the login nodes. The standard binary install works on Ramses:
curl -fsSL https://get.helm.sh/helm-v3.14.4-linux-amd64.tar.gz -o helm.tar.gz tar -xzf helm.tar.gz linux-amd64/helm install -m 0755 linux-amd64/helm ~/bin/helm rm -rf helm.tar.gz linux-amd64 export PATH="$HOME/bin:$PATH"
kubectland Helm must be installed on the host you use to manage the cluster. Confirm versions after logging in:kubectl version helm version
The Ramses kubeconfig is already present for the operations account. Verify you are targeting the Ramses context (named
ramsesin the shared kubeconfig):kubectl config get-contexts
Cluster Layout#
Namespace: all workloads run inside
ccat.Namespace concept: namespaces are Kubernetes’ tenancy boundary. Only users added as admins to
ccatcan see or modify workloads in this namespace, while cluster-level resources stay protected.Persistent volumes: PVs for paths under
/projects/ccatare created by the Ramses administrators. Coordinate PV provisioning (for example/projects/ccat/raw_data) before deploying PVCs.Node placement: the cluster exposes a 10-node worker pool with direct access to the parallel file system. Pods in the
ccatnamespace typically land on these nodes so throughput is bounded by PV performance rather than ephemeral storage.Release name: production automation uses
ccat-rawfor the Helm release. The default example in the chart README usescologne-staging. Align with existing naming when upgrading.Workload: the chart renders a single
Deployment(*-data-managers) and an optionalPersistentVolumeClaim(*-data). No DaemonSets or ingress resources are created.Pods and containers: the deployment manages pods, each hosting one
cologneStagingcontainer that runs the Celery worker image. The pod is the schedulable unit; the container is the runtime that executes the Python CLI.PVC vs PV: the chart defines a PVC (
ccat-raw-data) that requests storage from a pre-created PV backing/projects/ccat/raw_data. Ramses administrators bind the PV; once bound, the PVC exposes the mount inside the pod at/data.
Configuration Inputs#
All runtime settings live in system-integration/cologne-staging-k8s/values.yaml:
cologneStaging.imagecontrols the container image (repository, tag, pull policy).cologneStaging.commandis the exactccat_data_transferinvocation that starts the Celery worker. Update this when queue names change.app.environmentmaps toENV_FOR_DYNACONFinside the container and should stayproductionso the worker reads production settings.redisandpostgressections define broker/database endpoints and credentials. Values must match the Infisical records for the Ramses staging environment.resourcesandreplicasgate resource requests/limits and scaling.persistenceprovisions a 10 Gi PVC (default storage class) mounted at/data. Adjust size if staging workloads outgrow the default.healthCheckscontrols readiness and liveness probes. They are disabled by default because Celery exits on unrecoverable errors; enable cautiously to avoid flapping deployments.Override parameters temporarily by supplying an additional values file during Helm upgrades:
helm upgrade --install ccat-raw . -f values.yaml -n ccat
Helm Chart Structure#
The Ramses deployment assets live under system-integration:
cologne-staging-k8s/Chart.yaml– chart metadata (name, version).cologne-staging-k8s/values.yaml– default values consumed by the templates.cologne-staging-k8s/templates/deployment.yaml– renders theDeploymentmanifest:Sets
hostAliasesfor Redis/PostgreSQL resolution.Defines the
cologneStagingcontainer image, command, and environment variables.Mounts
redis-certs(hostPath) and optionaldata(PVC) volumes.Injects resources and probes based on the values file.
cologne-staging-k8s/templates/pvc.yaml– renders thePersistentVolumeClaimwhenpersistence.enabledistrueusing size and access mode fromvalues.yaml.cologne-staging-k8s-local– Kind-focused chart with the same structure but a differentdeployment.yamltailored for host networking and local hostPath mounts.
Secrets & Certificates#
Coscine API token: the deployment expects a Kubernetes secret named
coscine-api-token-secretwith the keyCCAT_DATA_TRANSFER_COSCINE_API_TOKEN. Create or rotate it with:kubectl create secret generic coscine-api-token-secret \ --from-literal=CCAT_DATA_TRANSFER_COSCINE_API_TOKEN='<token>' \ -n ccat --dry-run=client -o yaml > coscine-api-token-secret.yaml
Store the canonical token in Infisical (Infisical Secrets Management) and rotate via the CLI to avoid stale values.
Redis TLS material: pods mount
/etc/redis/certsfrom the host path/projects/ccat/etc/certs(see the Helm template). Keep this directory up to date with the certificates issued fordb.data.ccat.uni-koeln.de.Verify mounts: after deployment, confirm the certs are visible inside the container:
kubectl exec deploy/ccat-raw-data-managers -n ccat -- ls /etc/redis/certs
PostgreSQL credentials: shipped via environment variables in
values.yaml. Keep them in sync with the credentials stored in Infisical and the actual database. If credentials need to be rotated without editing the main values file, create a Kubernetes secret and reference it via an overlay values file usingenvFrom.
Persistent Storage & Data Paths#
The worker writes intermediate processing files to the PVC mounted at
/data. On Ramses the cluster must have a default storage class able to satisfy 10 GiReadWriteOnceclaims.The chart also defines a
hostAliasesblock so that the pod can resolveredisandpostgresto134.95.40.103. Confirm that this address remains correct after infrastructure changes.
StorageClass, PV/PVC, and pod relationships for the Ramses deployment.#
Routine Operations#
Deploy or upgrade#
Use Helm directly or the helper target in system-integration/Makefile:
cd /projects/ccat/data-center/system-integration
make restart_ramses_processing
This expands to:
helm upgrade --install ccat-raw cologne-staging-k8s \
-f cologne-staging-k8s/values.yaml -n ccat
kubectl rollout restart deployment/ccat-raw-data-managers -n ccat
Health checks#
Verify rollout status and pod health:
kubectl get pods -n ccat kubectl rollout status deployment/ccat-raw-data-managers -n ccat
Inspect logs (stream when investigating incidents):
kubectl logs deployment/ccat-raw-data-managers -n ccat -f
Confirm queue registration by running a one-off check inside the pod:
kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ ccat_data_transfer list-queues ramses_processing
Scaling#
Adjust replicas in
values.yaml(replicas.dataManagers) and runhelm upgrade. Horizontal scaling is limited by Redis/PostgreSQL and available CPU/memory on Ramses.Update
resources.cologne-stagingwhen the worker hits memory limits or throttling.
Troubleshooting#
CrashLoopBackOff: check secrets mounted correctly (especially
coscine-api-token-secret) and thatENV_FOR_DYNACONFpoints to a valid configuration.Broker connectivity errors: validate TLS certificates under
/projects/ccat/etc/certsand ensure firewall rules allow Redis (6379) and PostgreSQL (5432) traffic from the Ramses node todb.data.ccat.uni-koeln.de.PVC issues: run
kubectl describe pvc ccat-raw-data -n ccatto confirm storage class binding. Coordinate with infrastructure to expand the volume ifEventsshow provisioning errors.Helm divergence:
helm status ccat-raw -n ccatandhelm get values ccat-raw -n ccathighlight local overrides that need to be checked intovalues.yaml.ImagePullBackOff: confirm the cluster image pull secret has access to
ghcr.io/ccatobs. Re-runkubectl describe podto review registry errors and coordinate with Roland if credentials need to be refreshed.Network reachability: when transfers fail, run quick probes from inside the pod:
kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ nc -v db.data.ccat.uni-koeln.de 6379
kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ nc -v db.data.ccat.uni-koeln.de 5432
Local Replication#
Use the Kind-based chart when validating changes off-cluster:
cd system-integration/cologne-staging-k8s-local
kind delete cluster --name local-dev
kind create cluster --config kind-config.yaml
kubectl create ns monitor
helm upgrade --install scheduler-dummy . -f values.yaml -n monitor
When finished testing, clean up the local environment:
helm uninstall scheduler-dummy -n monitor
kind delete cluster --name local-dev
The local chart runs hostNetwork and maps host directories to mimic Ramses
paths. Update values.yaml to mirror any queue or credential changes before
promoting them upstream.
Change Management#
Update chart templates or
values.yamlin git, runhelm lintandhelm templatefor quick validation, then open a pull request.Document operational changes here and in Deployment Workflows when rollout steps change (new secrets, extra services, etc.).
Tag releases after successful upgrades so the image tag and values tracked in git match what is running in the cluster.
Record Helm history before and after large changes (
helm history ccat-raw -n ccat) so rollback targets are obvious during an incident.