HPC Backends#

Documentation Verified Last checked: 2026-03-07 Reviewer: Christof Buchbender

The Workflow Manager abstracts HPC job submission behind a pluggable backend interface. Three backends are available, selected via the HPC_BACKEND configuration setting.

Backend Interface#

All backends implement the HPCBackend abstract base class defined in ccat_workflow_manager.hpc.base:

class HPCBackend(ABC):
    def submit(self, execution_command, image_ref, sif_path, input_dir,
               output_dir, workspace_dir, manifest_path,
               resource_requirements, environment_variables,
               job_name) -> str:
        """Submit a job, return the backend-specific job ID."""

    def get_status(self, job_id) -> HPCJobInfo:
        """Get current status and metrics for a job."""

    def cancel(self, job_id) -> None:
        """Cancel a running job."""

    def get_logs(self, job_id) -> str:
        """Retrieve logs for a job."""

The HPCJobInfo dataclass carries status and optional metrics:

@dataclass
class HPCJobInfo:
    status: str            # "pending", "running", "completed", "failed"
    exit_code: int | None
    wall_time_seconds: float | None
    cpu_hours: float | None
    peak_memory_gb: float | None

SLURM Backend#

Use case: Production HPC batch jobs.

The SLURM backend writes an sbatch script that wraps the Apptainer execution command and submits it to the SLURM scheduler. Job metrics are extracted from sacct after completion.

Configuration:

  • HPC_BACKEND=slurm

  • SLURM_PARTITION — SLURM partition name

Resource mapping:

{"cpu": 4, "memory_gb": 16, "time_hours": 2}
→ --cpus-per-task=4 --mem=16G --time=02:00:00

Kubernetes Backend#

Use case: Service-style deployments (staging environment in Cologne).

The Kubernetes backend creates a Job resource that runs the Apptainer command in a container. Wall time is extracted from pod status timestamps.

Configuration:

  • HPC_BACKEND=kubernetes

  • K8S_NAMESPACE — Kubernetes namespace for jobs

Resource mapping:

{"cpu": 4, "memory_gb": 16}
→ resources.requests and resources.limits on the pod spec

Local Backend#

Use case: Local development, standalone servers, or sites without SLURM/K8s.

The Local backend runs apptainer exec directly via subprocess.Popen. Jobs are tracked in Redis with synthetic UUID job IDs. Logs are read from the filesystem.

Configuration:

  • HPC_BACKEND=local

This backend is ideal for:

  • Local development and testing

  • Small-scale deployments on a single server

  • Quick-look pipelines at the observatory site

Backend Selection#

The backend is selected via Dynaconf configuration:

# settings.toml
[staging]
HPC_BACKEND = "kubernetes"
K8S_NAMESPACE = "ccat-workflows-staging"

[production]
HPC_BACKEND = "slurm"
SLURM_PARTITION = "science"

[localdev]
HPC_BACKEND = "local"

The CCAT_WORKFLOW_MANAGER_HPC_BACKEND environment variable overrides the setting.