Kubernetes on Ramses ==================== The Ramses environment hosts the Kubernetes workloads that keep the Cologne staging data-transfer pipeline running. This page captures how the cluster is laid out, which configuration artifacts live in this repository, and the common operational procedures the infrastructure team follows. Overview -------- - **Primary workload:** a Celery worker that services the ``cologne_ramses_processing_staging`` queue of the data-transfer system. - **Namespace:** ``ccat`` on the Ramses cluster. - **Helm chart:** ``system-integration/cologne-staging-k8s`` (production/staging) with a matching ``cologne-staging-k8s-local`` variant for Kind-based testing. - **Key dependencies:** Redis (TLS enabled) and PostgreSQL on ``db.data.ccat.uni-koeln.de``, Infisical-managed secrets, and access to the Coscine API for long-term archive uploads. Access & Prerequisites ---------------------- - Obtain a Ramses user account and add it to the ``hpc-ccat`` (and, if needed, ``ag-riechers``) Unix groups. New accounts and group changes are handled by Roland Pabel (cluster administrator). - After the groups propagate, verify membership from a Ramses login node: .. code-block:: bash id getent group hpc-ccat | grep "$USER" - Request namespace administration on ``ccat`` from Roland. Access is granted per user; once approved you can manage resources with ``kubectl`` and ``helm`` from the ``ramses1`` or ``ramses4`` login nodes. - Ensure you can log in to the Ramses bastion; see :doc:`access_to_rames` for connectivity requirements once your account is active. - Install Helm in your home directory if it is not already available on the login nodes. The standard binary install works on Ramses: .. code-block:: bash curl -fsSL https://get.helm.sh/helm-v3.14.4-linux-amd64.tar.gz -o helm.tar.gz tar -xzf helm.tar.gz linux-amd64/helm install -m 0755 linux-amd64/helm ~/bin/helm rm -rf helm.tar.gz linux-amd64 export PATH="$HOME/bin:$PATH" - ``kubectl`` and Helm must be installed on the host you use to manage the cluster. Confirm versions after logging in: .. code-block:: bash kubectl version helm version - The Ramses kubeconfig is already present for the operations account. Verify you are targeting the Ramses context (named ``ramses`` in the shared kubeconfig): .. code-block:: bash kubectl config get-contexts Cluster Layout -------------- - **Namespace:** all workloads run inside ``ccat``. - **Namespace concept:** namespaces are Kubernetes' tenancy boundary. Only users added as admins to ``ccat`` can see or modify workloads in this namespace, while cluster-level resources stay protected. - **Persistent volumes:** PVs for paths under ``/projects/ccat`` are created by the Ramses administrators. Coordinate PV provisioning (for example ``/projects/ccat/raw_data``) before deploying PVCs. - **Node placement:** the cluster exposes a 10-node worker pool with direct access to the parallel file system. Pods in the ``ccat`` namespace typically land on these nodes so throughput is bounded by PV performance rather than ephemeral storage. - **Release name:** production automation uses ``ccat-raw`` for the Helm release. The default example in the chart README uses ``cologne-staging``. Align with existing naming when upgrading. - **Workload:** the chart renders a single ``Deployment`` (``*-data-managers``) and an optional ``PersistentVolumeClaim`` (``*-data``). No DaemonSets or ingress resources are created. - **Pods and containers:** the deployment manages pods, each hosting one ``cologneStaging`` container that runs the Celery worker image. The pod is the schedulable unit; the container is the runtime that executes the Python CLI. - **PVC vs PV:** the chart defines a PVC (``ccat-raw-data``) that requests storage from a pre-created PV backing ``/projects/ccat/raw_data``. Ramses administrators bind the PV; once bound, the PVC exposes the mount inside the pod at ``/data``. Configuration Inputs -------------------- All runtime settings live in ``system-integration/cologne-staging-k8s/values.yaml``: - ``cologneStaging.image`` controls the container image (repository, tag, pull policy). - ``cologneStaging.command`` is the exact ``ccat_data_transfer`` invocation that starts the Celery worker. Update this when queue names change. - ``app.environment`` maps to ``ENV_FOR_DYNACONF`` inside the container and should stay ``production`` so the worker reads production settings. - ``redis`` and ``postgres`` sections define broker/database endpoints and credentials. Values must match the Infisical records for the Ramses staging environment. - ``resources`` and ``replicas`` gate resource requests/limits and scaling. - ``persistence`` provisions a 10 Gi PVC (default storage class) mounted at ``/data``. Adjust size if staging workloads outgrow the default. - ``healthChecks`` controls readiness and liveness probes. They are disabled by default because Celery exits on unrecoverable errors; enable cautiously to avoid flapping deployments. - Override parameters temporarily by supplying an additional values file during Helm upgrades: .. code-block:: bash helm upgrade --install ccat-raw . -f values.yaml -n ccat Helm Chart Structure -------------------- The Ramses deployment assets live under ``system-integration``: - ``cologne-staging-k8s/Chart.yaml`` – chart metadata (name, version). - ``cologne-staging-k8s/values.yaml`` – default values consumed by the templates. - ``cologne-staging-k8s/templates/deployment.yaml`` – renders the ``Deployment`` manifest: * Sets ``hostAliases`` for Redis/PostgreSQL resolution. * Defines the ``cologneStaging`` container image, command, and environment variables. * Mounts ``redis-certs`` (hostPath) and optional ``data`` (PVC) volumes. * Injects resources and probes based on the values file. - ``cologne-staging-k8s/templates/pvc.yaml`` – renders the ``PersistentVolumeClaim`` when ``persistence.enabled`` is ``true`` using size and access mode from ``values.yaml``. - ``cologne-staging-k8s-local`` – Kind-focused chart with the same structure but a different ``deployment.yaml`` tailored for host networking and local hostPath mounts. Secrets & Certificates ---------------------- - **Coscine API token:** the deployment expects a Kubernetes secret named ``coscine-api-token-secret`` with the key ``CCAT_DATA_TRANSFER_COSCINE_API_TOKEN``. Create or rotate it with: .. code-block:: bash kubectl create secret generic coscine-api-token-secret \ --from-literal=CCAT_DATA_TRANSFER_COSCINE_API_TOKEN='' \ -n ccat --dry-run=client -o yaml > coscine-api-token-secret.yaml Store the canonical token in Infisical (:doc:`infisical_secrets_management`) and rotate via the CLI to avoid stale values. - **Redis TLS material:** pods mount ``/etc/redis/certs`` from the host path ``/projects/ccat/etc/certs`` (see the Helm template). Keep this directory up to date with the certificates issued for ``db.data.ccat.uni-koeln.de``. - **Verify mounts:** after deployment, confirm the certs are visible inside the container: .. code-block:: bash kubectl exec deploy/ccat-raw-data-managers -n ccat -- ls /etc/redis/certs - **PostgreSQL credentials:** shipped via environment variables in ``values.yaml``. Keep them in sync with the credentials stored in Infisical and the actual database. If credentials need to be rotated without editing the main values file, create a Kubernetes secret and reference it via an overlay values file using ``envFrom``. Persistent Storage & Data Paths ------------------------------- - The worker writes intermediate processing files to the PVC mounted at ``/data``. On Ramses the cluster must have a default storage class able to satisfy 10 Gi ``ReadWriteOnce`` claims. - The chart also defines a ``hostAliases`` block so that the pod can resolve ``redis`` and ``postgres`` to ``134.95.40.103``. Confirm that this address remains correct after infrastructure changes. .. plantuml:: ../../_static/diagrams/ramses_pvc.puml :caption: StorageClass, PV/PVC, and pod relationships for the Ramses deployment. Routine Operations ------------------ Deploy or upgrade ^^^^^^^^^^^^^^^^^ Use Helm directly or the helper target in ``system-integration/Makefile``: .. code-block:: bash cd /projects/ccat/data-center/system-integration make restart_ramses_processing This expands to: .. code-block:: bash helm upgrade --install ccat-raw cologne-staging-k8s \ -f cologne-staging-k8s/values.yaml -n ccat kubectl rollout restart deployment/ccat-raw-data-managers -n ccat Health checks ^^^^^^^^^^^^^ - Verify rollout status and pod health: .. code-block:: bash kubectl get pods -n ccat kubectl rollout status deployment/ccat-raw-data-managers -n ccat - Inspect logs (stream when investigating incidents): .. code-block:: bash kubectl logs deployment/ccat-raw-data-managers -n ccat -f - Confirm queue registration by running a one-off check inside the pod: .. code-block:: bash kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ ccat_data_transfer list-queues ramses_processing Scaling ^^^^^^^ - Adjust replicas in ``values.yaml`` (``replicas.dataManagers``) and run ``helm upgrade``. Horizontal scaling is limited by Redis/PostgreSQL and available CPU/memory on Ramses. - Update ``resources.cologne-staging`` when the worker hits memory limits or throttling. Troubleshooting --------------- - **CrashLoopBackOff:** check secrets mounted correctly (especially ``coscine-api-token-secret``) and that ``ENV_FOR_DYNACONF`` points to a valid configuration. - **Broker connectivity errors:** validate TLS certificates under ``/projects/ccat/etc/certs`` and ensure firewall rules allow Redis (6379) and PostgreSQL (5432) traffic from the Ramses node to ``db.data.ccat.uni-koeln.de``. - **PVC issues:** run ``kubectl describe pvc ccat-raw-data -n ccat`` to confirm storage class binding. Coordinate with infrastructure to expand the volume if ``Events`` show provisioning errors. - **Helm divergence:** ``helm status ccat-raw -n ccat`` and ``helm get values ccat-raw -n ccat`` highlight local overrides that need to be checked into ``values.yaml``. - **ImagePullBackOff:** confirm the cluster image pull secret has access to ``ghcr.io/ccatobs``. Re-run ``kubectl describe pod`` to review registry errors and coordinate with Roland if credentials need to be refreshed. - **Network reachability:** when transfers fail, run quick probes from inside the pod: .. code-block:: bash kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ nc -v db.data.ccat.uni-koeln.de 6379 .. code-block:: bash kubectl exec deploy/ccat-raw-data-managers -n ccat -- \ nc -v db.data.ccat.uni-koeln.de 5432 Local Replication ----------------- Use the Kind-based chart when validating changes off-cluster: .. code-block:: bash cd system-integration/cologne-staging-k8s-local kind delete cluster --name local-dev kind create cluster --config kind-config.yaml kubectl create ns monitor helm upgrade --install scheduler-dummy . -f values.yaml -n monitor When finished testing, clean up the local environment: .. code-block:: bash helm uninstall scheduler-dummy -n monitor kind delete cluster --name local-dev The local chart runs ``hostNetwork`` and maps host directories to mimic Ramses paths. Update ``values.yaml`` to mirror any queue or credential changes before promoting them upstream. Change Management ----------------- - Update chart templates or ``values.yaml`` in git, run ``helm lint`` and ``helm template`` for quick validation, then open a pull request. - Document operational changes here and in :doc:`deployment` when rollout steps change (new secrets, extra services, etc.). - Tag releases after successful upgrades so the image tag and values tracked in git match what is running in the cluster. - Record Helm history before and after large changes (``helm history ccat-raw -n ccat``) so rollback targets are obvious during an incident.