Research-0011: _iqa_convolve AVX2 — bit-exactness via __m256d, kernel invariants, Amdahl¶
- Status: Active
- Workstream: ADR-0138
- Last updated: 2026-04-21
Question¶
After MS-SSIM decimate went SIMD (ADR-0125), a 2026-04-20 1080p CPU profile of MS-SSIM shows _iqa_convolve owning 51.4 % self time (+ _iqa_filter_pixel 36.0 %; decimate dropped to 0.6 %). Can we accelerate _iqa_convolve on x86 with AVX2 while remaining byte-identical to the scalar reference, without modifying the vendored BSD-2011 Tom Distler iqa/ subtree? What accumulator width and reduction order does bit-exactness require?
Sources¶
- Profile data:
build/profiles/2026-04-20/ms_ssim_1080p_cpu.callgrindbuild/profiles/2026-04-20/ms_ssim_1080p_cpu_topN.txtbuild/profiles/2026-04-20/ms_ssim_1080p_cpu.svg(flamegraph)core/src/feature/iqa/convolve.c— scalar reference. 1-D separable branch is active by default (seeIQA_CONVOLVE_1Diniqa/iqa_options.h:25).core/src/feature/iqa/ssim_tools.c— the only hot caller of_iqa_convolve: five calls per_iqa_ssim(lines 170, 171, 187, 188, 189).core/src/feature/iqa/ssim_tools.h— definesGAUSSIAN_LEN = 11,SQUARE_LEN = 8, and the staticconst float g_gaussian_window_{h,v}/g_square_window_{h,v}1-D coefficient arrays used by MS-SSIM.core/src/feature/ms_ssim.c— drives_iqa_ssimfor 5 pyramid scales, so 25_iqa_convolvecalls per MS-SSIM frame.core/src/feature/x86/ms_ssim_decimate_avx2.c— established dispatch + border-scalar precedent.core/src/feature/x86/ssim_avx2.c/ssim_avx512.c— existing SSIM inner-primitive SIMD paths wired through_iqa_ssim_set_dispatch; the same pattern applies here.- IEEE 754-2008 §4.3 — associativity of addition (not guaranteed).
Findings¶
The 36 % _iqa_filter_pixel self-time is not on the convolve path¶
The vmaf_bench profile runs both SSIM and MS-SSIM features. SSIM's scalar _iqa_decimate (in ssim.c:158) calls _iqa_filter_pixel per output pixel. MS-SSIM uses the already-SIMD ms_ssim_decimate path and does not call _iqa_filter_pixel. The 36 % cost therefore belongs to SSIM's decimate — a separate T3 workstream. This ADR targets only the 51.4 % _iqa_convolve gap.
Bit-exactness contract: __m256d with separate mul + add¶
The scalar loop at convolve.c:159-161 is:
double sum = 0.0;
for (u = -uc; u <= uc; ++u, ++k_offset) {
sum += img[img_offset + u] * k->kernel_h[k_offset];
}
img_cache[img_offset] = (float)(sum * scale); // scale == 1.0f for normalised kernels
Three observations drive the bit-exactness argument:
- The scalar multiply-add is unfused. The expression
sum += a * bis written as a*followed by a+=. Under the fork's default build flags (-O3, no explicit-march=haswell/-mfmain the default meson configuration), GCC does not contract to FMA. Each tap therefore rounds twice: once after the multiply, once after the add. The SIMD path must reproduce the same two-rounding pattern to stay bit-identical. - Associativity is only needed within a single output lane. Across-lane reductions (e.g.
_mm_hadd_pd) would change the summation order. We avoid them: each of the 4 output pixels in a__m256daccumulator only ever sums its own 11 taps in the same order as the scalar inner loop. - Promotion order matches. The scalar does
float * float → doubleat the multiply site (operator promotion) anddouble + doubleat the add site._mm256_mul_pd(_mm256_cvtps_pd(img4), _mm256_set1_pd((double)kh[u]))does the same: widen the 4 float inputs to doubles before multiply, then add doubles.
With (1), (2), (3) in place, lane i of the __m256d accumulator computes exactly the same sequence of IEEE-754 operations as the scalar inner loop for output column x + i. The store path (_mm256_cvtpd_ps followed by _mm_storeu_ps) produces the same (float)(sum * 1.0f) cast as the scalar `img_cache[...] = (float)(sum
- scale)`.
Bit-identical output by construction. No tolerance check required.
Why not __m256 (8-lane float)?¶
Single-precision FMA — _mm256_fmadd_ps(img, k, sum) — is one rounding per tap, not two, and uses float accumulation instead of double. This produces a different (and in a strict numerical sense better) answer from the scalar. To ship an 8-lane-float AVX2 path we would need to either:
- Modify the scalar in
iqa/convolve.cto usefmaf()and single-precision accumulation (thems_ssim_decimateapproach from ADR-0125). This mutates vendored BSD-2011 code and creates a one-time golden shift. - Accept the deviation and hope Netflix's
assertAlmostEqual(places=N)tolerance absorbs it. The SSIM golden atplaces=3(~5e-4) and MS-SSIM's informalplaces=4(~5e-5) might or might not cover it; a 11-tap single-vs-double accumulator can differ by ~2e-7 per tap, which cascades unpredictably through 5 convolves × 5 scales of MS-SSIM.
Keeping the vendored file untouched AND landing an AVX2 fast path at the same time requires the double-accumulator route. This is the decision in ADR-0138.
Kernel invariant set¶
The AVX2 specialisation assumes MS-SSIM / SSIM invariants and falls back to scalar otherwise. Enforced at dispatch:
k->w == k->h(square kernel)k->w ∈ { GAUSSIAN_LEN (11), SQUARE_LEN (8) }k->normalized == 1→_calc_scale(k) == 1.0f(no post-scale)k->kernel_h != NULL && k->kernel_v != NULL(separable form)IQA_CONVOLVE_1Dis defined at compile time (it is iniqa_options.h:25)w,h≥ 4 (for safe__m128-load scaffolding)img,img_cache,resultare 32-byte-aligned (guaranteed bymallocon 64-bit Linux;_mm_loadu_psis used anyway so misalignment is safe if the invariant slips)
Any violation short-circuits to the scalar _iqa_convolve.
Border handling — scalar fallback, matches the decimate precedent¶
The separable scalar path at convolve.c:152-179 only touches interior pixels — output columns 0..dst_w-1 = w - k->w + 1 on rows -vc..dst_h+vc-1. The border pixels (the first and last uc / vc columns / rows) are skipped; _iqa_convolve relies on the caller having already established that the kernel fits entirely inside the image. No KBND_SYMMETRIC reflection is involved on the _iqa_convolve hot path.
This is different from the ms_ssim_decimate case, which did need mirror reflection. Here the border handling is trivial: we simply skip the output tail columns that aren't a multiple of 4 output lanes wide, and call the scalar inner loop on the last few.
Matches the core/src/feature/x86/adm_avx2.c precedent (5-tap separable CSF filter) and ms_ssim_decimate_avx2.c (scalar-border, SIMD-inner).
FLOP accounting + expected speed-up¶
Per _iqa_convolve call on an w × h input with an N-tap separable kernel:
- Horizontal pass:
(h + N-1) × (w - N + 1) × NMACs. For 1080p + N=11:1090 × 1910 × 11 ≈ 22.9 MMACs. - Vertical pass:
(h - N + 1) × (w - N + 1) × NMACs. For 1080p + N=11:1070 × 1910 × 11 ≈ 22.5 MMACs. - Total per call: ≈ 45.4 M MACs at 1080p for N=11 Gaussian.
- Per MS-SSIM frame: 45.4 M × 5 convolves × Σ(scale-downsampled area) ≈ 80 M MACs (dominated by scale 0; scales 1–4 add only ~½ again).
- Scalar double-accumulator throughput on Zen / Skylake: ≈ 1 MAC / cycle in the critical path (one FMUL port + one FADD port, 4-cycle latency on the chain, unfused). At 4 GHz → 80 M MACs × 1 cycle / 4 GHz ≈ 20 ms / frame just on double-path MACs. Matches the 51 % of 43.6 fps = 11 ms roughly (discrepancy from the scalar actually being unfused: 2 rounding ops not 1, plus cache pressure).
- AVX2
__m256dthroughput: 4 lanes × 1 double-MAC / cycle ≈ 4 MACs / cycle — ideal 4× speed-up. At memory-bandwidth limit (~25 GB/s L3 on Zen, ~40 GB/s on Skylake server) the float-to-double expansion roughly halves the speed-up ceiling: real-world ~3×. - Amdahl on end-to-end MS-SSIM: closing 51 % of scalar time at 4× ideal yields
1 / (0.486 + 0.514/4) = 1.63×total speedup. At the 3× realistic ceiling:1 / (0.486 + 0.514/3) = 1.54×. End-to-end MS-SSIM should land at ~67 fps (vs. current 43.6 fps) at 1080p.
Prior art / existing SIMD convolution patterns in this tree¶
core/src/feature/x86/adm_avx2.c— separable 5-tap CSF filter. Structural template. Uses_mm256_fmadd_ps(float path) because ADM is not bit-exactness constrained — different rounding from the scalar is inside the documented golden tolerance.core/src/feature/x86/ms_ssim_decimate_avx2.c— most recent fork precedent. Scalar-border, SIMD-inner. Usesfmaf()for the scalar reference and_mm256_fmadd_psfor SIMD, both in single-precision. That path was written fresh — we had the freedom to choose single-precision end-to-end._iqa_convolveis vendored, so we do not.core/src/feature/common/convolution.c— integer-typed (uint16_t); not reusable.- Public Gaussian-blur SIMD in libjxl (
lib/jxl/gauss_blur.cc) uses 4-lane double recursion for the IIR variant. Not the same algorithm as the 11-tap FIR here but confirms double-precision lane parallelism is an established pattern for bit-exact Gaussian blur.
Alternatives explored¶
__m256(8-lane float) + scalar rewrite — rejected. Requires mutatingiqa/convolve.c, a vendored BSD-2011 file. The one-time golden-score shift would need empirical tolerance verification per existing SSIM + MS-SSIM assertion. Fork rule (ADR-0125 §Ground rules) is "don't touch vendored iqa/ unless we have to". Double-accumulator at 3× realistic gives most of the speedup without the risk.__m256float, accept deviation — rejected. Golden gate is non-negotiable (CLAUDE §12 rule 1). Not worth the risk for a 2× marginal speedup over the double path.__m512d(AVX-512, 8-lane double) in same PR — deferred. The vmaf_bench workload is already memory-bandwidth-limited; the extra 4 lanes on AVX-512 provide less than linear improvement and AVX-512 clock throttling on older Skylake-X could lose wall time. A standalone AVX-512 follow-up PR that benchmarks against the AVX2 path on specific hardware (Zen4, Sapphire Rapids) is the right granularity.- Fuse the 5 sequential convolves inside
_iqa_ssiminto one multi-output pass — rejected for scope. Would require refactoring_iqa_ssimto produce all 5 convolved outputs in a single cache-hot sweep. Big architectural change, big diff, hard to keep bit-exact. Potential follow-up if 3× realistic doesn't land. - Pre-allocate
img_cacheinto a per-extractor scratch slot to cut the per-framemalloc/free— deferred. The profile showedbrkat 1.1 %, so allocator cost is non-trivial but small. Separate S-effort follow-up if the measured AVX2 gain is memory-bound.
Open questions¶
- Does the hardware FMA port on Zen4 / Sapphire Rapids provide the same 4× ceiling as on Zen3 / Skylake for double-MAC? Tests on the benchmarking host should confirm.
- If the measured speedup is memory-bandwidth-limited rather than ALU-limited, the AVX-512 follow-up has diminishing returns and we should instead invest in the
img_cachepre-allocation follow-up to cut the 40–50 MB/frame allocator pressure. Settle this in the AVX2 PR's benchmark section. - SSIM's scalar
_iqa_decimateat 36 % self-time is a parallel T3 workstream — single-kernel, but with properKBND_SYMMETRICreflection (unlike_iqa_convolve's interior-only scope). The kernel specialisation here could potentially be reused.
Related¶
- ADR: ADR-0138
- Predecessor ADR: ADR-0125
- Predecessor digest: Research-0008
- PR: VMAFx/vmafx#TBD
- Profile artifacts:
build/profiles/2026-04-20/ms_ssim_1080p_cpu.callgrind(and.svg,_topN.txt).