Research digest 0015 — SSIMULACRA 2 SIMD ports (T3-1 + T3-2)¶
- Status: Active (captures the decision path for ADR-0161)
- Related ADRs: ADR-0130 (scalar port, prior art), ADR-0161 (this PR), ADR-0139 (per-lane double reduction), ADR-0141 (NOLINT scope)
The question¶
How do we vectorise the scalar SSIMULACRA 2 pipeline across AVX2 + AVX-512 + NEON under the fork's byte-for-byte bit-exactness contract, given that two of the six hot kernels call transcendentals (cbrtf, powf) that vector libm cannot match byte-exact, and one kernel (fast_gaussian_1d) is a serial IIR recurrence?
Scalar profile¶
Per-frame call counts in extract:
| Kernel | Calls / frame | Vectorizable cleanly? |
|---|---|---|
picture_to_linear_rgb | 2 | Matmul yes, but srgb_to_linear uses powf |
linear_rgb_to_xyb | 12 (2 × 6 scales) | Matmul yes; cbrtf per-lane scalar |
multiply_3plane | 18 | Trivial |
blur_3plane (IIR) | 30 | Horizontal pass: serial; vertical pass: per-column parallel |
ssim_map | 6 | Yes (pointwise SIMD + scalar tail reduction) |
edge_diff_map | 6 | Yes |
downsample_2x2 | 10 | Yes |
The IIR blur dominates real frame-level cost (30 × 3 planes × two 1D passes). The pointwise kernels are individually cheaper but together take ~40% of the non-IIR frame time.
Key design axes¶
Axis 1 — bit-exactness strategy for cbrtf¶
Options:
- Per-lane scalar libm inside SIMD loop (picked). Spill vector to aligned scratch, apply scalar
cbrtf, reload. Byte-identical to scalar by construction. Cost: a few loads / stores per 8/16/4 lanes; the surrounding matmul + rescale still vectorises. - Polynomial approximation in SIMD. ~3-8× the throughput, but typically 1-2 ULP drift vs scalar libm — would require a tolerance ADR. Rejected: SSIMULACRA 2 doesn't have a Netflix golden pinning today (ADR-0130 deferred T3-3 snapshot gate); opening tolerance now compounds verification debt.
- Keep the whole kernel scalar. Simpler but forgoes the matmul + XYB-rescale speedup on 12 calls/frame.
Option 1 is clean, bit-exact, and still delivers a meaningful fraction of the SIMD win on linear_rgb_to_xyb (matmul is 3 dot-products per pixel; cbrtf is 3 libm calls — the SIMD vs scalar delta is on the matmul plumbing, not the transcendental).
Axis 2 — summation order for matmul and downsample¶
IEEE-754 add is non-associative. Scalar expressions like kM00 * r + m01 * g + kM02 * b + kOpsinBias parse left-to-right as ((a + b) + c) + d. Naïve SIMD pairing ((a + b) + (c + d)) drifts by 1 ULP.
Fix: mirror the scalar chain exactly. Three sequential SIMD add_ps ops per 8/16/4-lane vector. Extra instructions, zero correctness risk. Caught at first integration run by the bit-exact test — the cost of re-running the test once saved a CI round-trip.
Axis 3 — reduction pattern for ssim_map / edge_diff_map¶
Scalar reduces per-pixel double contributions into running double accumulators, using (double) casts at each float-to- double site. SIMD-reducing float-lane-then-lift-to-double breaks the summation tree.
Picked: per-lane scalar tail. SIMD computes the pointwise float arithmetic (mu1*mu1, mu2*mu2, etc.), spills the intermediate vectors to aligned scratch, then the scalar inner loop iterates over 8 (AVX2) / 16 (AVX-512) / 4 (NEON) lanes consuming the same (double) casts and accumulator variables as the scalar reference. Matches ADR-0139 verbatim.
Axis 4 — 2×2 downsample deinterleave¶
Scalar pairs up 4 input pixels per output; SIMD wants to process 8/16/4 output lanes per iter. Deinterleave even/odd positions via:
- AVX2:
vshufps(imm 0x88 / 0xDD) +vpermpd(imm 0xD8) to straighten cross-lane. - AVX-512:
vpermt2pswith explicit index vectors (cleaner, fewer ops). - NEON:
vuzp1q_f32/vuzp2q_f32(direct deinterleave).
All three preserve scalar's ((r0_even + r0_odd) + r1_even) + r1_odd summation order.
Axis 5 — what NOT to vectorise in this PR¶
Deliberately scalar in this PR:
fast_gaussian_1d+blur_plane— serial recurrence on horizontal pass; per-column IIR state on vertical pass (SIMD-able but requires a separate column-batch rewrite). Biggest single-PR scope bump if included.picture_to_linear_rgb—powfper lane would be the same spill/reload trick ascbrtf, but Y/U/V unpack + BT.709 matmul + clamp + sRGB EOTF is a 50-line vectorised loop; a follow-up PR buys better review granularity.
Follow-up PR targets: (1) IIR blur vertical-pass column batching — biggest wallclock ROI. (2) YUV → linear RGB. (3) Optional: fast_gaussian_1d per-pole parallelism via 3-wide state vector.
Verification plan¶
- Unit test
test_ssimulacra2_simd— bit-exact vs inline scalar reference on 5 SIMD kernels with reproducible xorshift32 inputs at W=33 × H=21 (deliberately non-multiple of 8/16/4 to exercise tails). - Runtime dispatch — AVX-512 overrides AVX2 on x86; NEON on aarch64; scalar fallback when no SIMD is available.
- Cross-backend check — planned for a follow-up with the IIR blur port since blur is where scalar drift would first compound.
- CI —
Netflix CPU Golden Testsdoesn't currently exercise ssimulacra2 (ADR-0130 is a fork extension), so the bit-exact test is the gate.
Outcome¶
Shipped as simd/ssimulacra2-avx2-avx512-neon branch → PR (TBD). 15 new kernel functions across three SIMD TUs + 1 test harness + 2 meson edits + 1 dispatch helper in the scalar TU + 5 docs updates.
Open questions / follow-ups¶
- IIR blur vectorisation — plan a separate PR focused on
blur_planevertical-pass column batching (4 / 8 / 16 columns' IIR state advancing in lock-step per input row). powfper-lane inpicture_to_linear_rgb— duplicates thecbrtfpattern; worth ~2% of frame time; not urgent.- AVX-512
vpermt2psvs AVX2vshufps+vpermpd— the AVX-512 form is cleaner; consider refactoring the AVX2 code to match if a future kernel needs the same deinterleave (blur column gather, maybe). - SVE / SVE2 port — same deferral as ADR-0160; revisit when native SVE2 CI hardware is routine.
- T3-3 snapshot gate — separate PR; gated on
tools/ssimulacra2availability.