ADR-0138: _iqa_convolve AVX2 bit-exact double-precision fast path¶
- Status: Accepted
- Date: 2026-04-21
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, performance
Context¶
With the MS-SSIM decimate SIMD fast paths landed (ADR-0125, AVX2 / AVX-512 / NEON), a fresh CPU profile of the MS-SSIM feature at 1080p on branch feat/ms-ssim-convolve-avx2 (commit f28d853e) shows the hot-spot has moved:
| Symbol | File | Self % | Cumulative % |
|---|---|---|---|
_iqa_convolve | core/src/feature/iqa/convolve.c:152 | 51.4 | 51.4 |
_iqa_filter_pixel | core/src/feature/iqa/convolve.c:264 | 36.0 | 87.4 |
_iqa_decimate | core/src/feature/iqa/decimate.c:53 | 0.6 | — |
The 36% _iqa_filter_pixel cost belongs to the SSIM extractor's scalar _iqa_decimate (which calls it per-pixel) and is a separate workstream. The 51.4% _iqa_convolve cost is the focus of this ADR.
_iqa_convolve is called five times per _iqa_ssim invocation (ref_mu, cmp_mu, ref_sigma_sqd, cmp_sigma_sqd, sigma_both at ssim_tools.c:170–189) and MS-SSIM runs _iqa_ssim for each of 5 pyramid scales — 25 convolutions per frame for MS-SSIM alone, plus SSIM's own five per frame at each scale. The kernel is the 11-tap separable Gaussian declared as g_gaussian_window_h / g_gaussian_window_v in ssim_tools.h:77-83; both arrays are static const float with compile-time-constant values.
The scalar implementation (in the 1-D separable branch selected by IQA_CONVOLVE_1D defined in iqa_options.h:25) accumulates into a double sum and casts once at store:
double sum = 0.0;
for (u = -uc; u <= uc; ++u, ++k_offset) {
sum += img[img_offset + u] * k->kernel_h[k_offset];
}
img_cache[img_offset] = (float)(sum * scale);
Because the per-tap multiply-accumulate is written as two separate operations (sum += a * b), the scalar path's observable rounding depends on whether the compiler auto-fuses into FMA. Under the fork's default -O3 build without an explicit -march=haswell or -mfma, GCC does not fuse, and the scalar path is two-op _mm_mul_sd / _mm_add_sd equivalents at every tap.
Two fork-wide rules shape the implementation:
core/src/feature/iqa/is a verbatim BSD-2011 Tom Distler import. The fork leaves it untouched for rebase hygiene (ADR-0125 §Ground rules).- The Netflix CPU golden gate (CLAUDE.md §8) exercises SSIM and MS-SSIM via the three reference pairs. Any AVX2 path that perturbs the scalar output risks the frozen
assertAlmostEqual(places=N)assertions.
Decision¶
We add a bit-exact AVX2 fast path for _iqa_convolve in a new core/src/feature/x86/convolve_avx2.{c,h} plus a matching entry point declared in core/src/feature/iqa/convolve.h guarded by #if defined(HAVE_AVX2). Dispatch happens in core/src/feature/iqa/ssim_tools.c (the only hot caller) via a new _iqa_convolve_set_dispatch(fn) setter symmetric to the existing _iqa_ssim_set_dispatch pattern; the scalar _iqa_convolve remains the default fallback. The vendored file iqa/convolve.c is not modified.
The AVX2 kernel is specialised for the MS-SSIM / SSIM invariants:
| Invariant | Value | Enforced by |
|---|---|---|
| Kernel length | 11 (Gaussian) or 8 (square) | Runtime check on k->w |
| Normalisation | k->normalized == 1 → scale == 1.0f | Runtime check |
| Separable 1-D | IQA_CONVOLVE_1D defined | Compile-time |
| Coefficient source | k->kernel_h / k->kernel_v | Pointer compare against g_gaussian_window_h/_v / g_square_window_h/_v |
On any mismatch the AVX2 path returns an error code and the caller falls through to the scalar reference — we do not attempt to vectorise arbitrary kernels.
Bit-exactness to the scalar mirrors its three-stage pattern exactly: single-rounded float * float → widen to double → double add, with no fused FMA. Scalar writes sum += img[img_offset + u] * k->kernel_h[k_offset] where the product is evaluated in float (FLT_EVAL_METHOD == 0 on our targets) and the += promotes the float product to double before the add. Any other pattern (double mul after widening; FMA) collapses or shifts the rounding and diverges by ULPs:
__m256d sum = _mm256_setzero_pd();
for (u = 0; u < 11; ++u) {
__m128 f4 = _mm_loadu_ps(&img[x - 5 + u]); // 4 floats
__m128 coeff_f = _mm_set1_ps(kh[u]); // broadcast f32 coeff
__m128 prod_f = _mm_mul_ps(f4, coeff_f); // single-rounded float mul
__m256d prod = _mm256_cvtps_pd(prod_f); // widen product f32→f64
sum = _mm256_add_pd(prod, sum); // double add (no FMA)
}
__m128 out = _mm256_cvtpd_ps(sum); // 4 f32 outputs
_mm_storeu_ps(&dst[x], out);
Four output pixels per inner iteration. At 1080p the horizontal pass produces 1920 × 1080 = 2 073 600 interior pixels per plane; scalar does 11 mul+add = 22 ops per pixel, the AVX2 path does the same 22 ops but across 4 output lanes in parallel, yielding a ~4× ideal speedup bounded by memory bandwidth. The vertical pass is structurally identical (strided loads from img_cache).
Borders (the dst_w-tail and dst_h-tail edge strips plus the |y| < vc and |x| < uc guard region) are handled by calling back to the scalar inner loop on those pixels — we do not vectorise the border, matching the ms_ssim_decimate_avx2 precedent.
NEON is deferred to a follow-up PR. AVX-512 is also deferred — at 11 taps × 4 lanes, the AVX2 port is not AVX-512-limited; the AVX-512 win requires a separate scatter/gather for the 8-lane double layout and is a small-delta follow-up.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Bit-exact float mul → widen → double add (chosen) | Matches scalar sum += f*k byte-for-byte under FLT_EVAL_METHOD == 0; golden gate unchanged; no touch to vendored iqa/ file; supports both Gaussian-11 and square-8 kernels with one code path | 4-lane (AVX2) / 8-lane (AVX-512) speedup only; memory-bandwidth-limited on 1080p | Decision — only pattern that preserves scalar rounding exactly |
__m256d (4-lane) with _mm256_mul_pd after widening | Looks simpler (one precision throughout) | Double mul rounds differently than scalar's single-rounded float mul — diverges by ULPs | Rejected — not bit-exact |
__m256 (8-lane float) + scalar rewrite to fmaf | 8× lane parallelism; mirrors ms_ssim_decimate precedent exactly | Requires modifying vendored iqa/convolve.c to switch scalar to fmaf() — one-time golden-score shift that must be verified within places=N tolerance of every existing assertion; risks rebase noise | Rejected — modifying a vendored BSD file without an alternative motivation violates ADR-0125 §Ground rules; AVX2 double at 4× still closes the main profiled gap |
__m256 float, accept deviation | Fastest path (8× lane); simple code | Golden tests may deviate beyond assertAlmostEqual(places=3) for SSIM or places=4 for MS-SSIM; risks CPU golden gate (CLAUDE §12 rule 1) | Rejected — golden gate is non-negotiable |
Generalise _iqa_convolve itself (edit vendored file to inline AVX2) | One code path for every caller | Modifies BSD-2011 Tom Distler code; increases rebase noise; the only hot caller is ssim_tools.c so the "generalisation" has zero extra consumers today | Rejected for the same reason as ADR-0125 — no second caller exists |
Skip convolve, optimise _iqa_filter_pixel instead | Addresses the other 36% in the profile | _iqa_filter_pixel is the SSIM scalar decimate path; replacing it is a separate T3 ("SSIM decimate SIMD") workstream with its own kernel specialisation; mixing here triples the PR scope | Deferred — separate ADR / PR |
Consequences¶
- Positive: MS-SSIM CPU throughput gains an expected ~1.3–1.6× end-to-end from closing the 51% self-time on
_iqa_convolve(Amdahl-bounded: if the AVX2 kernel delivers 4× on its 51.4% share, whole-frame time drops to0.486 + 0.514/4 = 0.614— a ~1.63× speedup; measured in the PR). SSIM benefits identically per convolve call. No change to the vendorediqa/subtree. The dispatch pattern leaves a clear slot forconvolve_neonandconvolve_avx512follow-ups. - Negative: Adds one new SIMD translation unit plus header (~200 LoC). The
__m256ddouble-path is memory-bandwidth-limited at 1080p — the 4× ceiling is the best case; actual gains may be closer to 3× under L2/L3 pressure. The new file needs Power-of-10 §5 assertion-density coverage (≥1VMAF_ASSERT_DEBUGper function ≥20 lines). Dispatch adds a function-pointer indirection that is negligible at the per-image granularity but does foreclose inlining the convolve into SSIM for a theoretical whole-loop fusion. - Neutral / follow-ups:
- Research digest
0011-iqa-convolve-avx2.md. docs/metrics/ms_ssim.md(if present) /docs/metrics/ssim.mdgets a "SIMD paths" entry; per ADR-0100 SIMD paths surface at the metric-doc level.core/src/feature/AGENTS.mdrebase invariant: the AVX2 kernel assumesGAUSSIAN_LEN = 11,SQUARE_LEN = 8,normalized = 1,kernel_h/kernel_vpointer identity againstg_gaussian_window_{h,v}/g_square_window_{h,v}. Any upstream change to these constants requires updating the pointer-identity check and/or the length switch.CHANGELOG.md"lusoris fork" entry.docs/rebase-notes.mdentry: new dispatch symbol_iqa_convolve_set_dispatchis fork-local; if upstream adds a similarly-named symbol, merge must keep the fork's version.- NEON follow-up PR (arm64):
convolve_neon.c, same dispatch slot. - AVX-512 follow-up PR:
convolve_avx512.c,__m512d(8-lane double) reduction of the same kernel. - Reproducer command (for PR description):
meson test -C build test_iqa_convolve+meson test -C build test_ssim test_ms_ssimto confirm no golden drift.
References¶
- Source: user popup (2026-04-21) — "AVX2 convolve first (Recommended)" after profile ran on branch
feat/ms-ssim-convolve-avx2. - Precision source: user popup (2026-04-21) — "Bit-exact
__m256d(4-lane) (Recommended)". - Predecessor ADRs: ADR-0125 — MS-SSIM decimate SIMD (same dispatch pattern + rebase-hygiene rule). ADR-0106 — this ADR predates the implementation commit on
feat/ms-ssim-convolve-avx2. ADR-0108 — six deep-dive deliverables apply. - Profile data:
build/profiles/2026-04-20/ms_ssim_1080p_cpu.callgrindandbuild/profiles/2026-04-20/ms_ssim_1080p_cpu_topN.txt. - Vendored convolve origin: Tom Distler IQA library, 2011 — header in
core/src/feature/iqa/convolve.c.
Status update 2026-05-08: Accepted¶
Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).
Acceptance criteria verified in tree at HEAD 0a8b539e:
core/src/feature/x86/convolve_avx2.{c,h}— present.- The bit-exactness pattern (single-rounded float mul, widen to double, double add, no FMA) is cited as load-bearing by ADR-0140
simd_dx.h(SIMD_WIDEN_ADD_F32_F64_AVX2). - Verification command:
ls core/src/feature/x86/convolve_avx2.{c,h}.
AVX-512 audit 2026-05-09: AUDIT-PASS at 1.300x / 1.173x (T3-9 sub-row c)¶
T3-9 sub-row (c) bench-first audit on Ryzen 9 9950X3D (Zen 5, AVX-512F/BW/VL). 480-frame Netflix normal pair fixture, single-thread median of 3 wall-clock runs:
float_ssim(single-scale convolve): AVX2 1.057 s vs AVX-512 0.813 s = 1.300x (at threshold).float_ms_ssim(5-scale convolve): AVX2 1.450 s vs AVX-512 1.236 s = 1.173x (sub-threshold but explained — at the smallest 2 scales the kernel approaches the input width and SIMD lanes are partially masked, matching this ADR's "8-lane bandwidth-amortised on 8x8 windows" prediction).
Bit-exactness: AVX-512 vs AVX2 score JSON byte-identical at full precision (--precision max); 0/48 frames diverge. test_iqa_convolve 13/13 subtests pass on the audit build. Cross-backend gate clean.
The shipped convolve_avx512.c path is meeting its design ceiling. Both float_ssim.c:117 and float_ms_ssim.c:94 dispatch sites wire it correctly. Audit closes the ADR-0138 §"Follow-up" promise.
See Research-0089 for the full bench table, decision matrix, and theoretical-ceiling crosscheck.