ADR-0139: SSIM SIMD accumulate bit-exact to scalar via per-lane scalar double¶
- Status: Accepted
- Date: 2026-04-21
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, performance, bit-exact
Context¶
The fork ships AVX2 and AVX-512 SIMD paths for SSIM and MS-SSIM at core/src/feature/x86/ssim_avx2.c and core/src/feature/x86/ssim_avx512.c, introduced in commit 81fcd42e. Upstream Netflix/vmaf has AVX2 / AVX-512 for VIF / ADM / motion / CAMBI but no SSIM SIMD at all — the SSIM SIMD surface is entirely fork-local. PR #18 (f082cfd3) tightened float accumulations across the fork and claimed SSIM SIMD was bit-identical to scalar.
During the ADR-0138 convolve work, a CLI XML diff at --precision max on Netflix src01_hrc00/01_576x324 and checkerboard_1920_1080_10_3_0_0/1_0 showed a residual divergence between scalar and SIMD at the 8th decimal on float_ms_ssim:
| Fixture | Scalar | AVX2 / AVX-512 | Δ |
|---|---|---|---|
| checkerboard frame 0 | 0.89452544363435227 | 0.89452545888605872 | ≈ 1.5e-8 |
| checkerboard frame 1 | 0.93540067693522488 | 0.93540069286261140 | ≈ 1.6e-8 |
float_ssim was already bit-identical across scalar / AVX2 / AVX-512 on these fixtures. Isolating the dispatch setters confirmed the divergence sits in ssim_accumulate_*, not in _iqa_convolve_* or ssim_precompute_* / ssim_variance_*:
ssim_precompute_*is three pure elementwise float multiplies (ref*ref,cmp*cmp,ref*cmp). Vector_mm256_mul_pson the same inputs is bit-identical to scalara*b. No reduction.ssim_variance_*is an elementwise subtract + clamp-to-zero. Again pure float, no reduction. Bit-identical.ssim_accumulate_*is where the divergence lives.
Reading scalar ssim_accumulate_default_scalar in ssim_tools.c:174-205:
float srsc = sqrtf(ref_sigma_sqd[i] * cmp_sigma_sqd[i]);
double lv = (2.0 * ref_mu[i] * cmp_mu[i] + C1) /
(ref_mu[i] * ref_mu[i] + cmp_mu[i] * cmp_mu[i] + C1);
double cv = (2.0 * srsc + C2) / (ref_sigma_sqd[i] + cmp_sigma_sqd[i] + C2);
float csb = (sigma_both[i] < 0.0f && srsc <= 0.0f) ? 0.0f : sigma_both[i];
double sv = (csb + C3) / (srsc + C3); /* float div → promoted */
ssim_sum += lv * cv * sv; /* double * double * double */
Two type-promotion rules govern bit-exactness here:
2.0 * ref_mu[i]is double. The literal2.0is adoubleunder C89/C99;double * floatpromotes the float to double before the multiply. Scalar thus computes thelandcnumerators in double precision.double sv = float_exprpromotes the float quotient to double at assignment, andlv * cv * svruns as three double multiplies.
The prior SIMD code computed l, c, and s all as __m256 / __m512 float (_mm256_mul_ps(v2, rm*cm), then _mm256_div_ps), then _mm256_mul_ps(l*c) * s before storing to an aligned float temp and accumulating into double. That collapses the double-path numerators and the double-path triple product into float-path operations — visible as ~0.13 float-ULP drift after summation over a 1920 × 1080 grid at 5 pyramid scales.
Two fork rules shape the fix:
- CLAUDE.md §8 freezes the Netflix CPU golden
assertAlmostEqual(places=N)asserts.places=3for SSIM andplaces=4for MS-SSIM left ~0.13 ULP below the assertion threshold, so the prior SIMD still passed the golden gate — but the drift was visible at--precision maxand violated the fork's stated "AVX2/AVX-512 float paths match scalar bit-for-bit" claim from PR #18. - ADR-0108 requires a documented rationale for any numerical fix that alters SIMD output.
Decision¶
Rewrite ssim_accumulate_avx2 and ssim_accumulate_avx512 to compute the float-valued intermediates (srsc, l_den, c_den, clamped_sb, sv_float) in SIMD vector float — these are trivially bit-exact against scalar — and then do the double-precision numerator, double division, and double triple product per-lane in scalar double inside an inner for (k = 0; k < LANES; k++) loop that matches scalar C's type promotions exactly:
/* SIMD (float): srsc, l_den, c_den, sv_float, clamp mask. */
const __m256 srsc = _mm256_sqrt_ps(_mm256_mul_ps(rs, cs));
const __m256 l_den = ((rm*rm + cm*cm) + vC1); /* float */
const __m256 c_den = ((rs + cs) + vC2); /* float */
const __m256 sv_f = _mm256_div_ps(clamped_sb + vC3, srsc + vC3);
/* store to aligned temps, then scalar double per lane */
for (int k = 0; k < 8; k++) {
const double lv = (2.0 * t_rm[k] * t_cm[k] + C1) / t_l_den[k]; /* double */
const double cv = (2.0 * t_srsc[k] + C2) / t_c_den[k]; /* double */
const double sv = t_sv[k]; /* f32→f64 */
local_ssim += lv * cv * sv;
local_l += lv;
local_c += cv;
local_s += sv;
}
Reduction order (lane 0 → lane LANES-1 within each block) matches scalar's for i in 0..n exactly, so the running-sum associativity is byte-identical. AVX-512 uses 16 lanes per block with _Alignas(64) 16-element temp buffers; AVX2 uses 8 lanes with _Alignas(32) 8-element temps. Both keep the same scalar tail for n % LANES.
ssim_precompute_* and ssim_variance_* keep their current pure SIMD float paths — they are already bit-exact by virtue of being non-reducing elementwise float ops.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Per-lane scalar double reduction (chosen) | Matches scalar C type promotions exactly; preserves SIMD speedup on the compute-heavy float ops (srsc, divisions, clamp); simple to reason about | Loses some parallelism on the double ops; 8 / 16 scalar divides per SIMD block | Decision — only pattern that preserves scalar rounding without redesigning the caller |
Full SIMD double path (__m256d / __m512d) with all ops in double | Theoretically preserves bit-exactness in-lane | Requires widening every input to double, doubling register pressure and cache pressure; the 2.0 * rm * cm + C1 promotion rule still needs float→double widening of rm, cm first before each use; effective lane count halves (4 / 8) so throughput is similar to the chosen design | Rejected — more complex, not obviously faster |
| FMA in double after widening | Fewer rounding points per tap | FMA collapses mul + add into one rounding, diverging from scalar's two-step sum += a*b | Rejected — bit-exactness lost |
| Revert: disable the SSIM SIMD dispatch entirely | Guaranteed bit-exact (falls to scalar); ~20-40% MS-SSIM CPU regression | Gives up the fork's SSIM SIMD feature | Rejected — user popup 2026-04-21 chose "fix bit-exact now" over "revert" and "keep as-is" |
| Keep as-is, document the ULP budget | Zero code churn | Contradicts PR #18's "bit-identical" claim; drift visible at --precision max; complicates future SIMD ports that want to compare against scalar | Rejected — once known, not fixing it is a regression of the stated invariant |
Consequences¶
- Positive:
float_ssimandfloat_ms_ssimare now byte-for-byte identical between scalar, AVX2, and AVX-512 at--precision maxon both the Netflix normal pair and the checkerboard pair. The fork's "SIMD float paths match scalar bit-for-bit" claim (PR #18) holds on the SSIM surface too. Netflix goldenassertAlmostEqual(places=N)asserts unchanged (they already passed before; they still pass). - Negative: The SIMD inner loop does 8 / 16 scalar doubles per block instead of vector float for the final reduction stage. Expected perf impact is small because the compute-heavy ops (srsc, divisions) stay vectorised, but the bench should confirm. Added ~30 LoC per SIMD file (temp buffers + scalar reduction loop) against ~20 LoC removed.
- Neutral / follow-ups:
core/src/feature/AGENTS.mdrebase invariant: any future upstream-port that changesssim_accumulate_default_scalarmust preserve the2.0 *double-precision literal and thefloat → doublepromotion ofsv— both are load-bearing for the SIMD bit-exactness match. If upstream changes the scalar, both SIMD variants need a matching change.CHANGELOG.md"lusoris fork" entry under "Fixed".docs/rebase-notes.mdentry documenting the per-lane scalar reduction pattern so rebase conflicts onssim_accumulate_*are resolved in favour of the fork's bit-exact pattern.-
Reproducer (for PR description):
vmaf --cpumask 255 ... --feature float_ms_ssim --precision max -o scalar.xml vmaf --cpumask 16 ... --feature float_ms_ssim --precision max -o avx2.xml vmaf ... --feature float_ms_ssim --precision max -o avx512.xml diff <(grep -v '<fyi fps' scalar.xml) <(grep -v '<fyi fps' avx2.xml) # empty diff <(grep -v '<fyi fps' scalar.xml) <(grep -v '<fyi fps' avx512.xml) # empty
References¶
- Source: user popup (2026-04-21) — "Fix to bit-exact now (extend this PR)" chosen over revert / dedicated-follow-up / keep-as-is after the convolve PR investigation surfaced the residual ~0.13 float-ULP drift on MS-SSIM.
- Related ADRs: ADR-0138 — companion bit-exact convolve fast path. ADR-0108 — six deep-dive deliverables apply.
- Scalar reference:
ssim_accumulate_default_scalarinssim_tools.c. - Prior attempt at bit-exactness: PR #18 (
f082cfd3) "SIMD bit-identical reductions + CI fixes".
Status update 2026-05-08: Accepted¶
Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).
Acceptance criteria verified in tree at HEAD 0a8b539e:
core/src/feature/x86/ssim_avx2.{c,h}andssim_avx512.{c,h}carry the per-lane scalar-double reduction pattern.- ADR-0140 codifies the reduction macro (
SIMD_PER_LANE_SCALAR_DOUBLE_REDUCE_AVX2/AVX512) and cites this ADR as the load-bearing rationale for the inline form. - Verification command:
ls core/src/feature/x86/ssim_avx2.{c,h} core/src/feature/x86/ssim_avx512.{c,h}.