# HPC Backends

```{eval-rst}
.. verified:: 2026-03-07
   :reviewer: Christof Buchbender
```

The Workflow Manager abstracts HPC job submission behind a pluggable backend interface.
Three backends are available, selected via the `HPC_BACKEND` configuration setting.

## Backend Interface

All backends implement the `HPCBackend` abstract base class defined in
`ccat_workflow_manager.hpc.base`:

```python
class HPCBackend(ABC):
    def submit(self, execution_command, image_ref, sif_path, input_dir,
               output_dir, workspace_dir, manifest_path,
               resource_requirements, environment_variables,
               job_name) -> str:
        """Submit a job, return the backend-specific job ID."""

    def get_status(self, job_id) -> HPCJobInfo:
        """Get current status and metrics for a job."""

    def cancel(self, job_id) -> None:
        """Cancel a running job."""

    def get_logs(self, job_id) -> str:
        """Retrieve logs for a job."""
```

The `HPCJobInfo` dataclass carries status and optional metrics:

```python
@dataclass
class HPCJobInfo:
    status: str            # "pending", "running", "completed", "failed"
    exit_code: int | None
    wall_time_seconds: float | None
    cpu_hours: float | None
    peak_memory_gb: float | None
```

## SLURM Backend

**Use case:** Production HPC batch jobs.

The SLURM backend writes an `sbatch` script that wraps the Apptainer execution
command and submits it to the SLURM scheduler. Job metrics are extracted from
`sacct` after completion.

**Configuration:**

- `HPC_BACKEND=slurm`
- `SLURM_PARTITION` --- SLURM partition name

**Resource mapping:**

```text
{"cpu": 4, "memory_gb": 16, "time_hours": 2}
→ --cpus-per-task=4 --mem=16G --time=02:00:00
```

## Kubernetes Backend

**Use case:** Service-style deployments (staging environment in Cologne).

The Kubernetes backend creates a Job resource that runs the Apptainer command in a
container. Wall time is extracted from pod status timestamps.

**Configuration:**

- `HPC_BACKEND=kubernetes`
- `K8S_NAMESPACE` --- Kubernetes namespace for jobs

**Resource mapping:**

```text
{"cpu": 4, "memory_gb": 16}
→ resources.requests and resources.limits on the pod spec
```

## Local Backend

**Use case:** Local development, standalone servers, or sites without SLURM/K8s.

The Local backend runs `apptainer exec` directly via `subprocess.Popen`. Jobs are
tracked in Redis with synthetic UUID job IDs. Logs are read from the filesystem.

**Configuration:**

- `HPC_BACKEND=local`

This backend is ideal for:

- Local development and testing
- Small-scale deployments on a single server
- Quick-look pipelines at the observatory site

## Backend Selection

The backend is selected via Dynaconf configuration:

```toml
# settings.toml
[staging]
HPC_BACKEND = "kubernetes"
K8S_NAMESPACE = "ccat-workflows-staging"

[production]
HPC_BACKEND = "slurm"
SLURM_PARTITION = "science"

[localdev]
HPC_BACKEND = "local"
```

The `CCAT_WORKFLOW_MANAGER_HPC_BACKEND` environment variable overrides the setting.

## Related Documentation

- {doc}`manager_worker` - How managers dispatch work
- {doc}`/source/concepts/execution_flow` - Status transitions driven by backend polling
- {doc}`/source/operations/configuration` - Full configuration reference