# HPC Backends ```{eval-rst} .. verified:: 2026-03-07 :reviewer: Christof Buchbender ``` The Workflow Manager abstracts HPC job submission behind a pluggable backend interface. Three backends are available, selected via the `HPC_BACKEND` configuration setting. ## Backend Interface All backends implement the `HPCBackend` abstract base class defined in `ccat_workflow_manager.hpc.base`: ```python class HPCBackend(ABC): def submit(self, execution_command, image_ref, sif_path, input_dir, output_dir, workspace_dir, manifest_path, resource_requirements, environment_variables, job_name) -> str: """Submit a job, return the backend-specific job ID.""" def get_status(self, job_id) -> HPCJobInfo: """Get current status and metrics for a job.""" def cancel(self, job_id) -> None: """Cancel a running job.""" def get_logs(self, job_id) -> str: """Retrieve logs for a job.""" ``` The `HPCJobInfo` dataclass carries status and optional metrics: ```python @dataclass class HPCJobInfo: status: str # "pending", "running", "completed", "failed" exit_code: int | None wall_time_seconds: float | None cpu_hours: float | None peak_memory_gb: float | None ``` ## SLURM Backend **Use case:** Production HPC batch jobs. The SLURM backend writes an `sbatch` script that wraps the Apptainer execution command and submits it to the SLURM scheduler. Job metrics are extracted from `sacct` after completion. **Configuration:** - `HPC_BACKEND=slurm` - `SLURM_PARTITION` --- SLURM partition name **Resource mapping:** ```text {"cpu": 4, "memory_gb": 16, "time_hours": 2} → --cpus-per-task=4 --mem=16G --time=02:00:00 ``` ## Kubernetes Backend **Use case:** Service-style deployments (staging environment in Cologne). The Kubernetes backend creates a Job resource that runs the Apptainer command in a container. Wall time is extracted from pod status timestamps. **Configuration:** - `HPC_BACKEND=kubernetes` - `K8S_NAMESPACE` --- Kubernetes namespace for jobs **Resource mapping:** ```text {"cpu": 4, "memory_gb": 16} → resources.requests and resources.limits on the pod spec ``` ## Local Backend **Use case:** Local development, standalone servers, or sites without SLURM/K8s. The Local backend runs `apptainer exec` directly via `subprocess.Popen`. Jobs are tracked in Redis with synthetic UUID job IDs. Logs are read from the filesystem. **Configuration:** - `HPC_BACKEND=local` This backend is ideal for: - Local development and testing - Small-scale deployments on a single server - Quick-look pipelines at the observatory site ## Backend Selection The backend is selected via Dynaconf configuration: ```toml # settings.toml [staging] HPC_BACKEND = "kubernetes" K8S_NAMESPACE = "ccat-workflows-staging" [production] HPC_BACKEND = "slurm" SLURM_PARTITION = "science" [localdev] HPC_BACKEND = "local" ``` The `CCAT_WORKFLOW_MANAGER_HPC_BACKEND` environment variable overrides the setting. ## Related Documentation - {doc}`manager_worker` - How managers dispatch work - {doc}`/source/concepts/execution_flow` - Status transitions driven by backend polling - {doc}`/source/operations/configuration` - Full configuration reference