# Performance

The engine is optimised for HPC workloads processing thousands of
scans with up to 16384 channels and 7+ pixels.

## Key Optimisations

### Prefetch I/O Overlap

The pipeline overlaps disk reads with calibration compute using a
dedicated prefetch thread (see {doc}`/source/architecture/pipeline-internals`).
This hides NVMe latency entirely for compute-bound scans.

### Vectorized Calibration

`calibrate_full()` precomputes a combined calibration factor
`gamma / (H - C) / tr_s` with shape `[C, R, A]` and broadcasts
it across the dump and subscan axes. The inner loop order
`C → R → A → D → S` ensures stride-1 access on the innermost
(subscan) dimension.

### Zero-Copy ON Data

`PreparedData` stores indices into `ScanData.data` rather than
copying ON-source arrays. For large OTF datasets this avoids a ~1 GB
allocation.

### OFF Reference Deduplication

When multiple ON subscans share the same OFF reference (common in
OTF), the reference average is computed once and reused via a
`HashMap` cache.

### Frequency-Adaptive Sigma Weighting

The PWV fit uses frequency-adaptive sigma weighting to improve
convergence in the presence of strong atmospheric lines, giving
appropriate weight to channels with different noise characteristics.

### PWV Grid Search

`residual_ss_precomputed()` fuses transmission computation and
residual summation in one loop with pre-scaled ATM coefficients.
This eliminates all heap allocation per evaluation and achieves
5–10× speedup over separate function calls.

### Binary ATM Table

Converting the text ATM table (`.dat.gz`) to binary (`.catm`)
via `calibrate convert` reduces load time from ~15s to \<100ms
(memory-mapped I/O).

## Benchmarking

```bash
# Criterion micro-benchmarks
cargo bench -p cal-io

# Full pipeline benchmark (CHESTER dataset, SLURM)
sbatch bench_chester.slurm
```

Typical per-scan timing breakdown (HFAV OTF, 16384 channels, debug build):

```text
load_ms=753 resolve_ms=244 atm_ms=762
prepare_ms=2399 calibrate_ms=36876 write_ms=1175
```

In release builds with LTO, the calibrate stage is ~5× faster.