Research-0008: MS-SSIM decimate SIMD — FLOP accounting, summation order, bit-exactness¶

Status: Active
Workstream: ADR-0125
Last updated: 2026-04-20

Question¶

Can the MS-SSIM decimate step be accelerated on x86 (AVX2 + AVX-512) while still producing byte-identical float output to a scalar reference, without modifying the vendored BSD-2011 Tom Distler core/src/feature/iqa/ subtree? What summation order does the scalar reference need to use for that to hold?

Sources¶

core/src/feature/ms_ssim.c — defines the 9×9 2-D LPF g_lpf (lines 35–54) and the separable 1-D forms g_lpf_h / g_lpf_v (lines 56–60). The 1-D coefficients are already present but unused by _iqa_decimate.
core/src/feature/iqa/decimate.c — scalar reference: calls _iqa_filter_pixel once per destination pixel with the 2-D kernel; invokes the kernel-wide function-pointer boundary handler KBND_SYMMETRIC.
core/src/feature/iqa/convolve.c — _iqa_filter_pixel implementation; confirms 81 multiply-adds per output pixel for a 9×9 kernel.
core/src/feature/iqa/ssim_tools.c — existing AVX2 / AVX-512 dispatch pattern for SSIM primitives via _iqa_ssim_set_dispatch. This is the template the MS-SSIM decimate dispatch follows.
Existing core/src/feature/x86/*_avx2.c files — established convention: one kernel per file, <feature>_avx{2,512}.h declares the entry point, dispatch lives in the caller.
Rouse, D. & Hemami, S. — Analyzing the Role of Visual Structure in the Recognition of Natural Image Content with Multi-Scale SSIM. Cited in the ms_ssim.c comment block next to the 9/7 wavelet LPF definition.
Wang, Simoncelli, Bovik — Multi-scale structural similarity for image quality assessment. The "Wang" path in _compute_msssim that the MS-SSIM* ("Rouse") path replaced.

Findings¶

FLOP accounting¶

2-D kernel (current path): 9 × 9 = 81 multiply-adds per destination pixel. At factor=2 on 1920 × 1080 input, the first scale produces 960 × 540 = 518 400 destination pixels × 81 MACs = 42 M FMAs per scale. Count in source-image-pixel units: 20.25 × w × h MACs.
Separable with stride-2 on horizontal pass (chosen form): horizontal pass produces w_out × h outputs (factor-2 subsampling applied during the horizontal pass only), each with 9 MACs → w_out × h × 9 = 4.5 × w × h MACs. Vertical pass then produces w_out × h_out outputs with 9 MACs each → w_out × h_out × 9 = 2.25 × w × h MACs. Total: 6.75 × w × h MACs — a ~3× FLOP reduction vs. the 2-D form, not 9× as initially overstated (earlier draft double-counted the stride-2 savings).
Why not full stride-2 on both passes? The vertical pass needs all source rows, not just even-indexed rows, because the 9-tap vertical kernel at y_out * 2 reads rows y_out*2 - 4 .. y_out*2 + 4, which spans both even and odd source rows. So horizontal pass must produce output for every source row y, just subsampled in x. No further reduction is safe.
SIMD lane multiplier: AVX2 ymm (8-lane float32) and AVX-512 zmm (16-lane float32) parallelise across destination columns in both passes. Inner-region speedup vs. scalar-separable: ~8× (AVX2), ~16× (AVX-512). Total vs. scalar 2-D: ~24× (AVX2), ~48× (AVX-512) on the inner region.
Amdahl: end-to-end MS-SSIM speedup depends on decimate's runtime share. The ~40% figure cited in the ADR is unverified; verified with perf record during implementation and this digest gets updated. If decimate is only 20% of MS-SSIM wall time, the end-to-end gain caps near 1.24× even with AVX-512.

Summation-order problem (bit-exactness)¶

The scalar 2-D path evaluates row-major: sum = sum + g_lpf[ky][kx] * img[y+ky][x+kx] for ky=0..8, kx=0..8. 81 accumulations in row-major order.
A separable scalar path evaluates tmp[kx] = sum_over_kx(g_lpf_h[kx] * img[y+ky][x+kx]) per row, then sum = sum_over_ky(g_lpf_v[ky] * tmp[ky]). Same coefficients mathematically (under the assumption that g_lpf[i][j] == g_lpf_h[i] * g_lpf_v[j], which is true by construction for a separable wavelet — verified inline by ms_ssim.c defining both), but the summation order differs.
IEEE-754 addition is not associative. The two orders produce outputs that may differ by ≤ 1 ULP per pixel. That is a real numerical difference, not a bug — it just means the 2-D scalar output cannot be the oracle for the SIMD path.
Resolution: bit-exactness is enforced between the SIMD path (AVX2 and AVX-512) and a new scalar-separable reference, not against the 2-D scalar. The scalar-separable reference replaces the 2-D path as the decimate oracle for MS-SSIM on CPU; MS-SSIM end-to-end scores in testdata/scores_cpu_ms_ssim.json are refreshed once and then locked against all three implementations.

Invariant set (SIMD fast-path precondition)¶

The MS-SSIM call site is the only consumer of _iqa_decimate in tree. That call site's kernel configuration is:

factor == 2
k->w == k->h == LPF_LEN == 9
k->normalized == 1
k->bnd_opt == KBND_SYMMETRIC
k->kernel_h and k->kernel_v both non-NULL (MS-SSIM sets them before calling)
img, result non-overlapping (guaranteed by _alloc_buffers in ms_ssim.c)

Any caller violating these invariants falls back to the scalar _iqa_decimate. The dispatching function checks the full invariant set before branching into SIMD; the SIMD kernel itself then asserts the invariants (VMAF_ASSERT_DEBUG) for defence in depth (Power-of-10 §5).

Boundary handling¶

KBND_SYMMETRIC mirrors around the border: for x < 0, it returns img[y][-x]; for x >= w, it returns img[y][2w-2-x].
For the inner region where the 9-tap kernel fits entirely in-bounds (columns 4..w-5, rows 4..h-5), no boundary handling is needed and the SIMD path runs at full speed.
For the 4-pixel border on each side, a boundary-aware variant of the horizontal pass is needed. Three options:
Run the scalar-separable path on the border rows/columns, SIMD only on the inner region. Simplest; small border cost amortises to negligible on 1080p.
Emit a mirrored scratch row with explicit _mm256_blendv_ps reflection. Higher complexity; marginal speed-up.
Pre-pad the input with 4 mirrored columns on each side in a scratch buffer. Simple but allocates.
Chosen (pending implementation): option 1 — scalar border, SIMD inner. Matches the approach used in core/src/feature/x86/adm_avx2.c and keeps the bit-exactness story simple: the border uses the same scalar-separable reference the tests compare against.

Prior art / existing SIMD convolution patterns in this tree¶

core/src/feature/x86/adm_avx2.c does a separable 5-tap horizontal + vertical convolution for the ADM CSF filter. Same pattern, shorter kernel. The horizontal pass uses _mm256_fmadd_ps with pre-broadcast coefficients; the vertical pass accumulates 8 rows at a time.
core/src/feature/x86/integer_adm_avx512.c has the 16-lane AVX-512 variant of the same pattern. Direct structural template for our AVX-512 decimate.
core/src/feature/common/convolution.c has a generic scalar separable convolution but it is integer-typed (uint16_t inputs). Not directly reusable for the float decimate path.

Expected speed-up¶

Back-of-envelope (subject to benchmark verification):

Scalar 2-D: baseline.
Scalar separable: ~4× faster (FLOP reduction + better cache behaviour on the vertical pass).
AVX2 separable: ~8× faster than scalar separable on the inner region (8-lane float32), ~30× vs. baseline.
AVX-512 separable: ~14× faster than scalar separable inner region (16-lane), ~50× vs. baseline.

End-to-end MS-SSIM wall-time improvement depends on decimate-share of total runtime; profiling will update this digest.

Alternatives explored¶

Modify _iqa_decimate itself in the vendored iqa/ file. Rejected per ADR-0125: only one in-tree caller, and mixing SIMD intrinsics into BSD-2011 Tom Distler code creates rebase friction if Netflix ever re-syncs iqa/ from upstream.
Keep 2-D kernel, SIMD the 81-MAC inner loop directly. 81 is not a clean SIMD width; would need 9× 9-lane register tiles with masking. Same MAC count as scalar; no FLOP saving, just a lane multiplier. Strictly worse than the separable form.
Precompute a 2-D kernel _mm256 broadcast table at module init. Saves the coefficient load but not the 81-MAC count. Not worth the extra static data if we switch to the separable form anyway.
Use _mm256_dp_ps (horizontal dot-product intrinsic). High latency (~11 cycles Skylake), only 4-lane-wide, not available in AVX-512 in a useful form. Consistently slower than explicit FMA reductions on all measured μarchs.

Open questions¶

Does the border-scalar, inner-SIMD split meet the Power-of-10 §5 assertion density bar (≥ 1 VMAF_ASSERT_DEBUG per function ≥ 20 LOC)? Expected yes — precondition asserts at dispatch entry already cover the invariants. Confirm with scripts/ci/assertion-density.sh during implementation.
Is the observed decimate-share actually ~40 % of MS-SSIM wall time on the fork-added benchmark YUVs, or is that figure from a different codebase? Verify with perf record -g -- ./build/core/tools/vmaf_bench --feature ms_ssim … before/after and record numbers here.
Does AVX-512 deliver the expected 14× over scalar separable, or does the downclocking penalty (Skylake-X behaviour) flatten the curve on short kernels? May need to gate AVX-512 behind a runtime benchmark rather than pure vmaf_get_cpu() detection.

ADR-0125 — the decision this digest supports.
ADR-0106 — written-first ADR rule (this digest is iteration-time context for the ADR).
ADR-0108 — the rule requiring this digest.
ADR-0012 — Power-of-10 §5 assertion density, applied to new SIMD files.
Sibling SIMD implementations in core/src/feature/x86/ — structural templates.
PR for this workstream: TBD (opened after scalar-separable, AVX2, and AVX-512 land on feat/ms-ssim-decimate-simd).