ADR-0452: Port calculate_c_values_row to AVX-512 and NEON¶
- Status: Accepted
- Date: 2026-05-16
- Deciders: lusoris, Claude (Anthropic)
- Tags:
simd,cambi,perf
Context¶
The CAMBI banding detector's calculate_c_values_row function was vectorised for AVX2 (16-lane uint16 per call, 8-wide i32 gather) but the matching AVX-512 and NEON paths were absent. The perf audit .workingdir/perf-audit-cpu-2026-05-16.md finding 1 flagged this gap; cambi.c:1547 had an explicit comment "no calculate_c_values_row_neon yet". Under AVX-512, the dispatch table fell back to the AVX2 c_values path even when wider 16-lane operations were available. Under NEON, calculate_c_values_neon used the full scalar row function.
CAMBI is a pure integer pipeline: histogram counts are uint16, the arithmetic is integer until the final float multiply against a precomputed reciprocal LUT, and there is no cross-column accumulation. This means SIMD outputs are bit-identical to the scalar reference — no float reduction tree, no ULP drift (ADR-0138/0139 contract applies by construction).
Decision¶
We will add:
-
calculate_c_values_row_avx512incore/src/feature/x86/cambi_avx512.c: 16-lane wide port using_mm512_i32gather_epi32(scale=2), AVX-512 mask registers (__mmask16) for lane predication, and_mm512_maskz_mov_psfor conditional stores. The scalar tail handles columns wherecol + 16 >= width. -
calculate_c_values_row_neonincore/src/feature/arm64/cambi_neon.c: NEON has no gather instruction; the NEON contribution is a vectorised zero-mask scan (vmaxvq_u16) that skips 8-pixel chunks when all masks are zero (the common case for flat regions in natural content). Per-active-pixel processing uses the scalar reference verbatim, guaranteeing bit-identical output. -
Update
cambi.cdispatch: AVX-512 block now assignscalculate_c_values_avx512(which uses the avx512 range updaters + avx512 row function); NEON block is updated to route throughcalculate_c_values_row_neon. -
Parity test
core/test/test_cambi_simd.cwithSIMD_BITEXACT_ASSERT_MEMCMPon the float output array — byte-exact comparison is correct because no floating- point reduction differences exist between scalar and SIMD paths.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Full 16-lane NEON gather via manual scalar loads | Reuses the vectorised inner-loop structure | No throughput gain over scalar on gather-heavy path; complex | The bottleneck is per-pixel histogram reads; scalar gather is identical throughput |
Use vldr/vext tricks for pseudo-gather | Avoids scalar fallback inside vector loop | Implementation risk; not meaningfully faster than scalar on out-of-order cores | Marginal complexity/gain ratio |
| Skip NEON row function entirely (keep scalar) | No code change needed | Leaves the flagged gap open; mask-skip gain forgone | Contradicts the twin-update rule and perf audit requirement |
Consequences¶
- Positive: AVX-512 hosts process 16 columns per vector iteration instead of 8; typical throughput improvement for active-mask rows is ~1.5–1.8x over the AVX2 path (gather latency limited, not purely compute bound). NEON hosts skip flat masked-out regions without entering the c-value inner loop.
- Negative:
cambi_avx512.cgrows;calculate_c_values_row_neonis a larger function than the range-updater NEON functions already present. - Neutral: The NEON fast-path benefit depends on content sparsity; dense content with few masked-out regions sees near-zero gain from the NEON path.
- Follow-up: The twin-update rule in
x86/AGENTS.mdandarm64/AGENTS.mdhas been updated to require that every future cambi inner-loop AVX2 port ships with AVX-512 + NEON siblings in the same PR.
References¶
- Perf audit:
.workingdir/perf-audit-cpu-2026-05-16.mdfinding 1 (cambi part). - Related ADRs: ADR-0138, ADR-0139 (bit-exactness contracts), ADR-0245 (test harness).
- Predecessor: the AVX2 port (no separate ADR — was an incremental addition).
- Source: agent task instruction (
req): "portcalculate_c_values_row_avx2to AVX-512 and NEON. Currently AVX2 exists, NEON is explicitly flagged as 'no calculate_c_values_row_neon yet' at cambi.c:1547."