HPC Backends#
The Workflow Manager abstracts HPC job submission behind a pluggable backend interface.
Three backends are available, selected via the HPC_BACKEND configuration setting.
Backend Interface#
All backends implement the HPCBackend abstract base class defined in
ccat_workflow_manager.hpc.base:
class HPCBackend(ABC):
def submit(self, execution_command, image_ref, sif_path, input_dir,
output_dir, workspace_dir, manifest_path,
resource_requirements, environment_variables,
job_name) -> str:
"""Submit a job, return the backend-specific job ID."""
def get_status(self, job_id) -> HPCJobInfo:
"""Get current status and metrics for a job."""
def cancel(self, job_id) -> None:
"""Cancel a running job."""
def get_logs(self, job_id) -> str:
"""Retrieve logs for a job."""
The HPCJobInfo dataclass carries status and optional metrics:
@dataclass
class HPCJobInfo:
status: str # "pending", "running", "completed", "failed"
exit_code: int | None
wall_time_seconds: float | None
cpu_hours: float | None
peak_memory_gb: float | None
SLURM Backend#
Use case: Production HPC batch jobs.
The SLURM backend writes an sbatch script that wraps the Apptainer execution
command and submits it to the SLURM scheduler. Job metrics are extracted from
sacct after completion.
Configuration:
HPC_BACKEND=slurmSLURM_PARTITION— SLURM partition name
Resource mapping:
{"cpu": 4, "memory_gb": 16, "time_hours": 2}
→ --cpus-per-task=4 --mem=16G --time=02:00:00
Kubernetes Backend#
Use case: Service-style deployments (staging environment in Cologne).
The Kubernetes backend creates a Job resource that runs the Apptainer command in a container. Wall time is extracted from pod status timestamps.
Configuration:
HPC_BACKEND=kubernetesK8S_NAMESPACE— Kubernetes namespace for jobs
Resource mapping:
{"cpu": 4, "memory_gb": 16}
→ resources.requests and resources.limits on the pod spec
Local Backend#
Use case: Local development, standalone servers, or sites without SLURM/K8s.
The Local backend runs apptainer exec directly via subprocess.Popen. Jobs are
tracked in Redis with synthetic UUID job IDs. Logs are read from the filesystem.
Configuration:
HPC_BACKEND=local
This backend is ideal for:
Local development and testing
Small-scale deployments on a single server
Quick-look pipelines at the observatory site
Backend Selection#
The backend is selected via Dynaconf configuration:
# settings.toml
[staging]
HPC_BACKEND = "kubernetes"
K8S_NAMESPACE = "ccat-workflows-staging"
[production]
HPC_BACKEND = "slurm"
SLURM_PARTITION = "science"
[localdev]
HPC_BACKEND = "local"
The CCAT_WORKFLOW_MANAGER_HPC_BACKEND environment variable overrides the setting.