Research-0008: MS-SSIM decimate SIMD — FLOP accounting, summation order, bit-exactness¶
- Status: Active
- Workstream: ADR-0125
- Last updated: 2026-04-20
Question¶
Can the MS-SSIM decimate step be accelerated on x86 (AVX2 + AVX-512) while still producing byte-identical float output to a scalar reference, without modifying the vendored BSD-2011 Tom Distler core/src/feature/iqa/ subtree? What summation order does the scalar reference need to use for that to hold?
Sources¶
core/src/feature/ms_ssim.c— defines the 9×9 2-D LPFg_lpf(lines 35–54) and the separable 1-D formsg_lpf_h/g_lpf_v(lines 56–60). The 1-D coefficients are already present but unused by_iqa_decimate.core/src/feature/iqa/decimate.c— scalar reference: calls_iqa_filter_pixelonce per destination pixel with the 2-D kernel; invokes the kernel-wide function-pointer boundary handlerKBND_SYMMETRIC.core/src/feature/iqa/convolve.c—_iqa_filter_pixelimplementation; confirms 81 multiply-adds per output pixel for a 9×9 kernel.core/src/feature/iqa/ssim_tools.c— existing AVX2 / AVX-512 dispatch pattern for SSIM primitives via_iqa_ssim_set_dispatch. This is the template the MS-SSIM decimate dispatch follows.- Existing
core/src/feature/x86/*_avx2.cfiles — established convention: one kernel per file,<feature>_avx{2,512}.hdeclares the entry point, dispatch lives in the caller. - Rouse, D. & Hemami, S. — Analyzing the Role of Visual Structure in the Recognition of Natural Image Content with Multi-Scale SSIM. Cited in the
ms_ssim.ccomment block next to the 9/7 wavelet LPF definition. - Wang, Simoncelli, Bovik — Multi-scale structural similarity for image quality assessment. The "Wang" path in
_compute_msssimthat the MS-SSIM* ("Rouse") path replaced.
Findings¶
FLOP accounting¶
- 2-D kernel (current path): 9 × 9 = 81 multiply-adds per destination pixel. At factor=2 on 1920 × 1080 input, the first scale produces 960 × 540 = 518 400 destination pixels × 81 MACs = 42 M FMAs per scale. Count in source-image-pixel units:
20.25 × w × hMACs. - Separable with stride-2 on horizontal pass (chosen form): horizontal pass produces
w_out × houtputs (factor-2 subsampling applied during the horizontal pass only), each with 9 MACs →w_out × h × 9 = 4.5 × w × hMACs. Vertical pass then producesw_out × h_outoutputs with 9 MACs each →w_out × h_out × 9 = 2.25 × w × hMACs. Total:6.75 × w × hMACs — a ~3× FLOP reduction vs. the 2-D form, not 9× as initially overstated (earlier draft double-counted the stride-2 savings). - Why not full stride-2 on both passes? The vertical pass needs all source rows, not just even-indexed rows, because the 9-tap vertical kernel at
y_out * 2reads rowsy_out*2 - 4 .. y_out*2 + 4, which spans both even and odd source rows. So horizontal pass must produce output for every source rowy, just subsampled inx. No further reduction is safe. - SIMD lane multiplier: AVX2 ymm (8-lane float32) and AVX-512 zmm (16-lane float32) parallelise across destination columns in both passes. Inner-region speedup vs. scalar-separable: ~8× (AVX2), ~16× (AVX-512). Total vs. scalar 2-D: ~24× (AVX2), ~48× (AVX-512) on the inner region.
- Amdahl: end-to-end MS-SSIM speedup depends on decimate's runtime share. The ~40% figure cited in the ADR is unverified; verified with
perf recordduring implementation and this digest gets updated. If decimate is only 20% of MS-SSIM wall time, the end-to-end gain caps near 1.24× even with AVX-512.
Summation-order problem (bit-exactness)¶
- The scalar 2-D path evaluates row-major:
sum = sum + g_lpf[ky][kx] * img[y+ky][x+kx]forky=0..8, kx=0..8. 81 accumulations in row-major order. - A separable scalar path evaluates
tmp[kx] = sum_over_kx(g_lpf_h[kx] * img[y+ky][x+kx])per row, thensum = sum_over_ky(g_lpf_v[ky] * tmp[ky]). Same coefficients mathematically (under the assumption thatg_lpf[i][j] == g_lpf_h[i] * g_lpf_v[j], which is true by construction for a separable wavelet — verified inline byms_ssim.cdefining both), but the summation order differs. - IEEE-754 addition is not associative. The two orders produce outputs that may differ by ≤ 1 ULP per pixel. That is a real numerical difference, not a bug — it just means the 2-D scalar output cannot be the oracle for the SIMD path.
- Resolution: bit-exactness is enforced between the SIMD path (AVX2 and AVX-512) and a new scalar-separable reference, not against the 2-D scalar. The scalar-separable reference replaces the 2-D path as the decimate oracle for MS-SSIM on CPU; MS-SSIM end-to-end scores in
testdata/scores_cpu_ms_ssim.jsonare refreshed once and then locked against all three implementations.
Invariant set (SIMD fast-path precondition)¶
The MS-SSIM call site is the only consumer of _iqa_decimate in tree. That call site's kernel configuration is:
factor == 2k->w == k->h == LPF_LEN == 9k->normalized == 1k->bnd_opt == KBND_SYMMETRICk->kernel_handk->kernel_vboth non-NULL (MS-SSIM sets them before calling)img,resultnon-overlapping (guaranteed by_alloc_buffersinms_ssim.c)
Any caller violating these invariants falls back to the scalar _iqa_decimate. The dispatching function checks the full invariant set before branching into SIMD; the SIMD kernel itself then asserts the invariants (VMAF_ASSERT_DEBUG) for defence in depth (Power-of-10 §5).
Boundary handling¶
KBND_SYMMETRICmirrors around the border: forx < 0, it returnsimg[y][-x]; forx >= w, it returnsimg[y][2w-2-x].- For the inner region where the 9-tap kernel fits entirely in-bounds (columns
4..w-5, rows4..h-5), no boundary handling is needed and the SIMD path runs at full speed. - For the 4-pixel border on each side, a boundary-aware variant of the horizontal pass is needed. Three options:
- Run the scalar-separable path on the border rows/columns, SIMD only on the inner region. Simplest; small border cost amortises to negligible on 1080p.
- Emit a mirrored scratch row with explicit
_mm256_blendv_psreflection. Higher complexity; marginal speed-up. - Pre-pad the input with 4 mirrored columns on each side in a scratch buffer. Simple but allocates.
- Chosen (pending implementation): option 1 — scalar border, SIMD inner. Matches the approach used in
core/src/feature/x86/adm_avx2.cand keeps the bit-exactness story simple: the border uses the same scalar-separable reference the tests compare against.
Prior art / existing SIMD convolution patterns in this tree¶
core/src/feature/x86/adm_avx2.cdoes a separable 5-tap horizontal + vertical convolution for the ADM CSF filter. Same pattern, shorter kernel. The horizontal pass uses_mm256_fmadd_pswith pre-broadcast coefficients; the vertical pass accumulates 8 rows at a time.core/src/feature/x86/integer_adm_avx512.chas the 16-lane AVX-512 variant of the same pattern. Direct structural template for our AVX-512 decimate.core/src/feature/common/convolution.chas a generic scalar separable convolution but it is integer-typed (uint16_tinputs). Not directly reusable for thefloatdecimate path.
Expected speed-up¶
Back-of-envelope (subject to benchmark verification):
- Scalar 2-D: baseline.
- Scalar separable: ~4× faster (FLOP reduction + better cache behaviour on the vertical pass).
- AVX2 separable: ~8× faster than scalar separable on the inner region (8-lane float32), ~30× vs. baseline.
- AVX-512 separable: ~14× faster than scalar separable inner region (16-lane), ~50× vs. baseline.
End-to-end MS-SSIM wall-time improvement depends on decimate-share of total runtime; profiling will update this digest.
Alternatives explored¶
- Modify
_iqa_decimateitself in the vendored iqa/ file. Rejected per ADR-0125: only one in-tree caller, and mixing SIMD intrinsics into BSD-2011 Tom Distler code creates rebase friction if Netflix ever re-syncs iqa/ from upstream. - Keep 2-D kernel, SIMD the 81-MAC inner loop directly. 81 is not a clean SIMD width; would need 9× 9-lane register tiles with masking. Same MAC count as scalar; no FLOP saving, just a lane multiplier. Strictly worse than the separable form.
- Precompute a 2-D kernel
_mm256broadcast table at module init. Saves the coefficient load but not the 81-MAC count. Not worth the extra static data if we switch to the separable form anyway. - Use
_mm256_dp_ps(horizontal dot-product intrinsic). High latency (~11 cycles Skylake), only 4-lane-wide, not available in AVX-512 in a useful form. Consistently slower than explicit FMA reductions on all measured μarchs.
Open questions¶
- Does the border-scalar, inner-SIMD split meet the Power-of-10 §5 assertion density bar (≥ 1
VMAF_ASSERT_DEBUGper function ≥ 20 LOC)? Expected yes — precondition asserts at dispatch entry already cover the invariants. Confirm withscripts/ci/assertion-density.shduring implementation. - Is the observed decimate-share actually ~40 % of MS-SSIM wall time on the fork-added benchmark YUVs, or is that figure from a different codebase? Verify with
perf record -g -- ./build/core/tools/vmaf_bench --feature ms_ssim …before/after and record numbers here. - Does AVX-512 deliver the expected 14× over scalar separable, or does the downclocking penalty (Skylake-X behaviour) flatten the curve on short kernels? May need to gate AVX-512 behind a runtime benchmark rather than pure
vmaf_get_cpu()detection.
Related¶
- ADR-0125 — the decision this digest supports.
- ADR-0106 — written-first ADR rule (this digest is iteration-time context for the ADR).
- ADR-0108 — the rule requiring this digest.
- ADR-0012 — Power-of-10 §5 assertion density, applied to new SIMD files.
- Sibling SIMD implementations in
core/src/feature/x86/— structural templates. - PR for this workstream: TBD (opened after scalar-separable, AVX2, and AVX-512 land on
feat/ms-ssim-decimate-simd).