# Performance The engine is optimised for HPC workloads processing thousands of scans with up to 16384 channels and 7+ pixels. ## Key Optimisations ### Prefetch I/O Overlap The pipeline overlaps disk reads with calibration compute using a dedicated prefetch thread (see {doc}`/source/architecture/pipeline-internals`). This hides NVMe latency entirely for compute-bound scans. ### Vectorized Calibration `calibrate_full()` precomputes a combined calibration factor `gamma / (H - C) / tr_s` with shape `[C, R, A]` and broadcasts it across the dump and subscan axes. The inner loop order `C → R → A → D → S` ensures stride-1 access on the innermost (subscan) dimension. ### Zero-Copy ON Data `PreparedData` stores indices into `ScanData.data` rather than copying ON-source arrays. For large OTF datasets this avoids a ~1 GB allocation. ### OFF Reference Deduplication When multiple ON subscans share the same OFF reference (common in OTF), the reference average is computed once and reused via a `HashMap` cache. ### Frequency-Adaptive Sigma Weighting The PWV fit uses frequency-adaptive sigma weighting to improve convergence in the presence of strong atmospheric lines, giving appropriate weight to channels with different noise characteristics. ### PWV Grid Search `residual_ss_precomputed()` fuses transmission computation and residual summation in one loop with pre-scaled ATM coefficients. This eliminates all heap allocation per evaluation and achieves 5–10× speedup over separate function calls. ### Binary ATM Table Converting the text ATM table (`.dat.gz`) to binary (`.catm`) via `calibrate convert` reduces load time from ~15s to \<100ms (memory-mapped I/O). ## Benchmarking ```bash # Criterion micro-benchmarks cargo bench -p cal-io # Full pipeline benchmark (CHESTER dataset, SLURM) sbatch bench_chester.slurm ``` Typical per-scan timing breakdown (HFAV OTF, 16384 channels, debug build): ```text load_ms=753 resolve_ms=244 atm_ms=762 prepare_ms=2399 calibrate_ms=36876 write_ms=1175 ``` In release builds with LTO, the calibrate stage is ~5× faster.