ADR-0190: float_ms_ssim Vulkan kernel — 5-level pyramid + Wang product on host¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: vulkan, gpu, feature-extractor, fork-local, places-4
Context¶
ADR-0188 scopes batch 2 as ssim → ms_ssim → psnr_hvs. ssim (Vulkan + CUDA + SYCL) landed in PR #139 + PR #140 with empirical places=4, max_abs = 1.0e-6. ms_ssim picks up where ssim left off.
The active CPU float_ms_ssim extractor (float_ms_ssim.c) runs the Wang multi-scale variant:
picture_copy→ float ref/cmp at[0, 255]per ssim's contract.- Build a 5-level pyramid with 2× decimation between levels via a 9-tap 9/7 biorthogonal separable low-pass filter (
ms_ssim_decimate_scalar/ms_ssim_decimate_avx2/ms_ssim_decimate_avx512/ms_ssim_decimate_neon). Coefficients are bit-exact across the SIMD paths — the rebase-sensitive invariant inms_ssim_decimate.cpins them tog_lpf_h/g_lpf_vinms_ssim.c. - Per scale (5×): run SSIM with the 11-tap Gaussian window, but emit three separate aggregates —
l_mean,c_mean,s_mean— instead of just the combined SSIM mean. - Wang combine (host-side):
MS-SSIM = ∏_{i=0..4} l[i]^α[i] · c[i]^β[i] · s[i]^γ[i]with the upstream-Netflix coefficients: α = {0.0000, 0.0000, 0.0000, 0.0000, 0.1333}β = {0.0448, 0.2856, 0.3001, 0.2363, 0.1333}γ = {0.0448, 0.2856, 0.3001, 0.2363, 0.1333}Note: α[0..3] = 0, sol[i]contributes only at scale 4.
Each scale's l, c, s formulas (from ssim_tools.c ssim_accumulate_default_scalar):
l = (2·μxμy + C1) / (μx² + μy² + C1)
c = (2·sqrt(σx²·σy²) + C2) / (σx² + σy² + C2)
s = (clamped(σxy) + C3) / (sqrt(σx²·σy²) + C3)
with C1 = (0.01·255)², C2 = (0.03·255)², C3 = C2/2. The clamped(σxy) step handles the ref == cmp flat-region edge case where σxy can drift slightly negative via float rounding while sqrt(σx²·σy²) is zero.
Decision¶
Ship float_ms_ssim_vulkan as two new GLSL shaders + a ~700-LOC host orchestrator:
Shaders¶
ms_ssim_decimate.comp— 9-tap 9/7 biorthogonal LPF + 2× downsample. Single 2D dispatch reading a 9×9 source neighbourhood per output pixel. Mirror boundary handling matchesms_ssim_decimate_scalar'sms_ssim_decimate_mirrorhelper (period-2n reflection that handles sub-kernel-radius inputs the iqa single-reflect form leaves out of bounds — per ADR-0125).
Output dimensions: w_out = (w/2) + (w&1), h_out = (h/2) + (h&1). Bit-exact constants: {0.026727, -0.016828, -0.078201, 0.266846, 0.602914, 0.266846, -0.078201, -0.016828, 0.026727}.
ms_ssim.comp— same shape asssim.comp(PR #139), but the vertical pass emits three per-WG partials instead of one:l_partials[wg_idx]— sum of per-pixellc_partials[wg_idx]— sum of per-pixelcs_partials[wg_idx]— sum of per-pixels
Horizontal pass is a verbatim copy of ssim.comp's pass 0; the only divergence is the vertical-pass per-pixel formula and the 3-output reduction. Per-pixel math matches ssim_accumulate_default_scalar line-for-line including the σxy clamp.
Host orchestrator: ms_ssim_vulkan.c¶
init:
picture_copy()-style float buffer for ref + cmp at full res
allocate pyramid: 5 ref + 5 cmp float buffers, sized by halving
allocate intermediate buffers for the SSIM compute (5 × float
× max_pyramid_W × max_pyramid_H)
allocate 3 × partials buffers for max_wg_count
extract per frame:
upload ref + cmp at level 0
for scale 0..4:
if scale > 0:
run decimate (level scale-1 → scale)
run ssim horizontal (level `scale`)
run ssim vertical + l/c/s combine (level `scale`)
readback l/c/s partials, sum on host in `double`
l[scale] = total_l / pixel_count
c[scale] = total_c / pixel_count
s[scale] = total_s / pixel_count
msssim = product over scales of pow(l, α) · pow(c, β) · pow(s, γ)
emit float_ms_ssim
Precision contract¶
Target places=4 on the 576×324 cross-backend gate fixture. The 5-level pyramid only goes 4 deep at this resolution (min_dim = 11 << 4 = 176; 576 / 16 = 36 ≥ 11), so the input satisfies the ADR-0153 floor.
Each scale's l, c, s reduction follows the same per-WG-float
- host-double pattern as ssim_vulkan. The Wang product combine runs entirely on the host in
double, so the per-pixelpow(x, α)calls don't accumulate float ULPs across scales.
If empirical exceeds places=4, relax to places=3 per the ciede / ssim precedent.
v1 enable_lcs deferred¶
The CPU's enable_lcs option emits 15 additional metrics (float_ms_ssim_l_scale0..4, _c_scale0..4, _s_scale0..4). v1 does not implement this — the GPU pipeline still computes the underlying l/c/s values per scale, so adding a flag-toggled vmaf_feature_collector_append per metric is mechanical. Defer to a focused follow-up.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Reuse ssim_vulkan's ssim.comp and add a 3-output combine pass that reads the same intermediates | Maximum code reuse | Requires a 4-pipeline / 4-pass design (horiz + ssim_combine + l_combine + c_combine + s_combine) — more memory bandwidth, more pipelines | A single 3-output combine pass is cleaner and the duplicated horizontal pass is ~50 LOC of GLSL |
| Compute MS-SSIM on the host by calling SSIM-on-GPU 5× | Zero new GPU code beyond ssim_vulkan | Per-scale needs l/c/s separately, not just combined SSIM. Would require either modifying ssim_vulkan's kernel (breaking its places=4 ssim_vulkan contract) or computing l/c/s analytically from SSIM (impossible — irreducible decomposition) | Not viable |
| Single 1-dispatch fused kernel (decimate + horiz + vert + combine) | Lowest dispatch count | One mega-kernel with bookkeeping for 5 scales; tile-based shared memory across scales is impractical (per-scale workgroup sizes differ) | 2 shaders × 5 scales = 11 dispatches per frame is fine at 576×324 / 1080p (compute-bound); fusion is profile-led |
| Defer to ms_ssim CUDA + SYCL bundle, skip Vulkan first | Smaller scope | Breaks the ADR-0188 ordering precedent | Vulkan first scaffolds the CUDA + SYCL ports; same as batch 1 + ssim |
Consequences¶
- Positive: matches CPU
float_ms_ssimatplaces=4on the gate fixture (subject to empirical measurement). Reuses the per-WG-float + host-double precision pattern from ssim. No new public C-API surface — the extractor registers asfloat_ms_ssim_vulkanunderHAVE_VULKAN. - Negative: 11 dispatches per frame (5 decimates + 5 horiz 5 vert+combine, rounded down because scale 0 skips decimate). Higher submission overhead than ssim's 2 dispatches; not noticeable at 1080p but worth a v2 fusion pass if a profiling pass shows submission overhead > 10% of per-frame time.
- Neutral / follow-ups:
- Batch 2 part 2b — DONE.
float_ms_ssim_cudashipped in the batch 2 parts 2b + 2c bundle (sibling PR). Three CUDA kernels (decimate, horiz, vert_lcs) mirror the GLSL shaders byte-for-byte modulo language differences. picture_copy normalisation runs host-side via a 2D D2H of the pitched device plane (cuMemcpy2DAsynchonouringsrcPitch = stride[0]) — surfaced a bring-up bug where the naïvecuMemcpyDtoHAsyncofwidth·height·bpcbytes mis-copied row N≥1 becausecuMemAllocPitchreturnsstride[0] ≥ width·bpc. Per-block float partials reduced on host indouble. Empirical: 48 frames at 576×324 on RTX 4090 →max_abs = 1.0e-6,0/48 places=4 mismatches. - Batch 2 part 2c — DONE.
float_ms_ssim_syclshipped alongside part 2b. Self-contained submit/collect (mirrorsssim_sycl). Host-pinned USM staging holds the picture_copy-normalised float planes;nd_range<2>vert+lcs kernel usessycl::reduce_over_group× 3 for per-WG float partials. fp64-free (Arc A380). Empirical: 48 frames at 576×324 on Arc A380 →max_abs = 1.0e-6,0/48 places=4 mismatches. enable_lcsmode (deferred) — 15 extra metrics. Trivial follow-up once the gate is green.
Verification¶
- 48 frames at 576×324 on Intel Arc A380 + Mesa anv vs CPU scalar: target
max_abs ≤ 5e-5(places=4), measured perscripts/ci/cross_backend_vif_diff.py --feature float_ms_ssim --backend vulkan --places 4. - New CI step
float_ms_ssim cross-backend diff (CPU vs Vulkan/lavapipe)in.github/workflows/tests-and-quality-gates.yml. - If empirical floor exceeds
places=4, relax toplaces=3with a note here (mirrors ciede / ssim precedents).
References¶
- Parent: ADR-0188 — batch 2 scope.
- Sibling: ADR-0189 — ssim Vulkan.
- Pyramid math: ADR-0125 — 9/7 biorthogonal LPF coefficients + mirror semantics.
- Min-dim guard: ADR-0153 — 176×176 minimum input for the 5-level pyramid + 11-tap Gaussian.
- CPU references:
float_ms_ssim.c,ms_ssim.c,ms_ssim_decimate.c,iqa/ssim_tools.c.