Research-0012: SSIM SIMD bit-exactness to scalar — where the ULP drifted¶
- Status: Active
- Workstream: ADR-0139
- Last updated: 2026-04-21
Question¶
The fork's AVX2 + AVX-512 ssim_accumulate_* (fork-local, commit 81fcd42e) were tightened in PR #18 (f082cfd3) with the stated goal of being bit-identical to the scalar reference. A CLI XML diff at --precision max on Netflix and checkerboard pairs surfaced a residual ~0.13 float-ULP divergence on float_ms_ssim (8th decimal) between scalar and SIMD. float_ssim, ssim_precompute_* and ssim_variance_* were already bit-exact; the drift is localised to ssim_accumulate_*. Where does the ULP come from, and what is the minimal rewrite to drive it to zero without giving up SIMD?
Sources¶
- Scalar reference:
ssim_accumulate_default_scalarinssim_tools.c(lines 174–205). - Prior SIMD:
ssim_accumulate_avx2/ssim_accumulate_avx512as of commitf28d853e. - Upstream baseline: Netflix/vmaf
origin/masterhas no SSIM SIMD at all — AVX2 / AVX-512 for VIF / ADM / motion / CAMBI only.git log upstream/master -- core/src/feature/x86/ssim_*→ empty. First SSIM SIMD touch is fork commit81fcd42e. - Measurement:
vmaf --cpumask 255 ...(all SIMD blocked — scalar path)vmaf --cpumask 16 ...(block AVX-512 bit; AVX2 active)vmaf(native — AVX-512 active)--precision maxforces%.17gIEEE-754 round-trip printing.- Relevant ADRs: ADR-0138 (companion convolve fix), ADR-0125 (bit-exactness ground rule for SIMD paths), CLAUDE.md §8 (Netflix golden gate).
Findings¶
Measured drift¶
| Fixture (1920×1080 checkerboard, frame 0) | Scalar | SIMD (AVX2 == AVX-512) | Δ |
|---|---|---|---|
float_ssim | 0.93335253000259399 | 0.93335253000259399 | 0 |
float_ms_ssim f0 | 0.89452544363435227 | 0.89452545888605872 | ≈ 1.525e-8 |
float_ms_ssim f1 | 0.93540067693522488 | 0.93540069286261140 | ≈ 1.593e-8 |
float_ssim is bit-identical; float_ms_ssim drifts ~1.5e-8 (about 0.13 float-ULP on a value near 1.0). The drift is present on AVX2 and AVX-512 identically — so it's a pattern problem, not a 16-lane-vs-8-lane reduction issue.
Why precompute / variance were already fine¶
// ssim_precompute_*_scalar
ref_sq[i] = ref[i] * ref[i];
cmp_sq[i] = cmp[i] * cmp[i];
ref_cmp[i] = ref[i] * cmp[i];
// ssim_variance_*_scalar
ref_sigma_sqd[i] -= ref_mu[i] * ref_mu[i];
cmp_sigma_sqd[i] -= cmp_mu[i] * cmp_mu[i];
ref_sigma_sqd[i] = MAX(0.0, ref_sigma_sqd[i]);
cmp_sigma_sqd[i] = MAX(0.0, cmp_sigma_sqd[i]);
sigma_both[i] -= ref_mu[i] * cmp_mu[i];
Both are pure elementwise float — no horizontal reduction, no cross-type mixing. _mm256_mul_ps / _mm512_mul_ps on the same inputs produce the same float result as scalar a*b under IEEE 754. _mm256_sub_ps same for subtraction. _mm256_max_ps is bit-exact. Nothing to fix.
Where accumulate went wrong¶
Scalar ssim_accumulate_default_scalar:
float srsc = sqrtf(ref_sigma_sqd[i] * cmp_sigma_sqd[i]);
double lv = (2.0 * ref_mu[i] * cmp_mu[i] + C1) / // <── DOUBLE numerator
(ref_mu[i] * ref_mu[i] + cmp_mu[i] * cmp_mu[i] + C1);
double cv = (2.0 * srsc + C2) / // <── DOUBLE numerator
(ref_sigma_sqd[i] + cmp_sigma_sqd[i] + C2);
float csb = (sigma_both[i] < 0.0f && srsc <= 0.0f) ? 0.0f : sigma_both[i];
double sv = (csb + C3) / (srsc + C3); // float → double on assign
ssim_sum += lv * cv * sv; // DOUBLE × DOUBLE × DOUBLE
Two C type-promotion rules are load-bearing:
2.0is a Cdoubleliteral.2.0 * ref_mu[i]promotes the floatref_mu[i]to double before the multiply. The full numerators oflvandcvare evaluated in double.double sv = <float expr>promotes the float division result to double at assignment.lv * cv * svthen runs as three double multiplies.
Prior SIMD computed everything as vector float:
// prior (drifted)
__m256 v2 = _mm256_set1_ps(2.0f); // <── FLOAT 2.0f
__m256 l_num = _mm256_mul_ps(v2, _mm256_mul_ps(rm, cm)) + vC1; // float numerator
__m256 l = _mm256_div_ps(l_num, l_den); // float l
... // same for c and s
__m256 ssim_val = _mm256_mul_ps(_mm256_mul_ps(l, c), s); // float × float × float
// store, then accumulate to double
for (k=0..7) local_ssim += (double)t_ssim[k];
Two bit-exactness breaks: the 2.0f numerator is single-precision, not double, and the l*c*s product runs in single-precision, not double. Both together are where the ~0.13 float-ULP comes from, amplified over ~10 M accumulations across 5 pyramid scales.
Fix pattern¶
Split the reduction: keep SIMD for the float-valued intermediates that are already bit-exact against scalar, and move the double-precision numerator + divide + triple product into a scalar inner loop over lanes. Per-lane reduction order 0..7 (or 0..15) matches scalar's for i in 0..n running sum byte-for-byte.
// SIMD (float — bit-exact vs scalar):
__m256 srsc = _mm256_sqrt_ps(rs * cs); // sqrtf(rs*cs)
__m256 l_den = ((rm*rm + cm*cm) + vC1); // float
__m256 c_den = ((rs + cs) + vC2); // float
__m256 sv_f = (clamped_sb + vC3) / (srsc + vC3);
// store rm, cm, srsc, l_den, c_den, sv_f to aligned temps.
// Scalar double per lane (bit-exact vs scalar):
for (int k = 0; k < LANES; k++) {
double lv = (2.0 * t_rm[k] * t_cm[k] + C1) / t_l_den[k];
double cv = (2.0 * t_srsc[k] + C2) / t_c_den[k];
double sv = t_sv[k]; // float → double
local_ssim += lv * cv * sv;
local_l += lv;
local_c += cv;
local_s += sv;
}
LANES = 8 for AVX2 (_Alignas(32) float[8] temps), 16 for AVX-512 (_Alignas(64) float[16] temps). Scalar tail for n % LANES unchanged.
Why not "full SIMD double" instead¶
Alternative considered: widen everything to __m256d / __m512d and do the 2.0 multiply + divide + triple product entirely in vector double. Dropped because:
- Every input needs
_mm256_cvtps_pdwidening at use — for2.0 * rm * cm, we'd still need to widenrmandcmseparately, then multiply in double. Thefloat → doublepromotion is free in scalar C; in SIMD it's an explicit widening op. - Effective lane count halves (
__m256d= 4 lanes,__m512d= 8 lanes) so throughput parity with the chosen per-lane-scalar design is a wash — both spend LANES scalar-equivalent cycles on the numerator/divide/product work per SIMD block. - Double divide is ~20-30 cycles latency and serialises per lane anyway on modern cores; gaining nothing from SIMD.
Why not FMA¶
Scalar 2.0 * ref_mu[i] * cmp_mu[i] + C1 has two rounding points: one at each * and one at the final +. FMA collapses the last mul + add into a single round — which is closer to the infinitely precise result, but different from scalar. Bit-exactness is not the same as "more accurate". Scalar wins by definition here.
Verification¶
After the rewrite:
$ diff <(grep -v '<fyi fps' /tmp/scalar.xml) \
<(grep -v '<fyi fps' /tmp/avx2.xml)
$ diff <(grep -v '<fyi fps' /tmp/scalar.xml) \
<(grep -v '<fyi fps' /tmp/avx512.xml)
$ # Both empty → bit-identical at --precision max.
Verified on:
- Netflix canonical pair
src01_hrc00_576x324↔src01_hrc01_576x324 - Checkerboard 1-pixel shift (
checkerboard_1920_1080_10_3_0_0↔..._1_0)
Both float_ssim and float_ms_ssim now match scalar byte-for-byte across scalar / AVX2 / AVX-512 dispatch.
Open questions / follow-ups¶
- NEON parity? The fork ships
ssim_neon.cwith the same three dispatch entry points. The same pattern applies — this research digest covers x86 only. Defer to a follow-up: confirm NEON is bit-exact at--precision maxon an Apple Silicon or aarch64 Linux host and apply the same fix if not. - Perf impact. 8 / 16 scalar doubles per SIMD block in the reduction inner loop. Hot ops (sqrt, divides, clamps) stay vectorised. Expected perf delta is small (single-digit percent) but should be measured before the next release.
- Assertion precision. The fork's bit-exactness guarantee at
--precision maxis stronger than Netflix's goldenplaces=Nasserts need. The golden gate was never in danger (0.13 ULP <places=3/places=4thresholds), but the--precision maxCLI tool is the surface where users actually see the drift.
References¶
- Source: user popup (2026-04-21) — "Fix to bit-exact now (extend this PR)".
- ADR: ADR-0139.
- Companion: ADR-0138 and Research-0011.
- Prior attempt: PR #18 (
f082cfd3) "SIMD bit-identical reductions CI fixes".