Performance#

The engine is optimised for HPC workloads processing thousands of scans with up to 16384 channels and 7+ pixels.

Key Optimisations#

Prefetch I/O Overlap#

The pipeline overlaps disk reads with calibration compute using a dedicated prefetch thread (see /source/architecture/pipeline-internals). This hides NVMe latency entirely for compute-bound scans.

Vectorized Calibration#

calibrate_full() precomputes a combined calibration factor gamma / (H - C) / tr_s with shape [C, R, A] and broadcasts it across the dump and subscan axes. The inner loop order C → R → A → D → S ensures stride-1 access on the innermost (subscan) dimension.

Zero-Copy ON Data#

PreparedData stores indices into ScanData.data rather than copying ON-source arrays. For large OTF datasets this avoids a ~1 GB allocation.

OFF Reference Deduplication#

When multiple ON subscans share the same OFF reference (common in OTF), the reference average is computed once and reused via a HashMap cache.

Frequency-Adaptive Sigma Weighting#

The PWV fit uses frequency-adaptive sigma weighting to improve convergence in the presence of strong atmospheric lines, giving appropriate weight to channels with different noise characteristics.

PWV Grid Search#

residual_ss_precomputed() fuses transmission computation and residual summation in one loop with pre-scaled ATM coefficients. This eliminates all heap allocation per evaluation and achieves 5–10× speedup over separate function calls.

Binary ATM Table#

Converting the text ATM table (.dat.gz) to binary (.catm) via calibrate convert reduces load time from ~15s to <100ms (memory-mapped I/O).

Benchmarking#

# Criterion micro-benchmarks
cargo bench -p cal-io

# Full pipeline benchmark (CHESTER dataset, SLURM)
sbatch bench_chester.slurm

Typical per-scan timing breakdown (HFAV OTF, 16384 channels, debug build):

load_ms=753 resolve_ms=244 atm_ms=762
prepare_ms=2399 calibrate_ms=36876 write_ms=1175

In release builds with LTO, the calibrate stage is ~5× faster.