Skip to content

ADR-0243: enable_lcs MS-SSIM extras on CUDA + Vulkan

  • Status: Accepted
  • Date: 2026-04-29
  • Deciders: Lusoris (user), Claude (agent)
  • Tags: cuda, vulkan, gpu, metrics, ms-ssim, fork-local

Context

The float_ms_ssim extractor's enable_lcs option (defined in core/src/feature/float_ms_ssim.c) emits 15 extra per-scale metrics — float_ms_ssim_{l,c,s}_scale{0..4} — on top of the combined Wang-product score. The fork's GPU twins (float_ms_ssim_cuda from ADR-0190 / PR #157; float_ms_ssim_vulkan from ADR-0190 / PR #141) shipped the combined score only — both deferred enable_lcs to a follow-up (see the file-header comment at integer_ms_ssim_cuda.c:29 and the option help-text reading "(reserved; not yet implemented in the GPU path)" on the Vulkan side).

Per user direction 2026-04-28 (T7-35): implement, do not de-advertise. The kernels already produce the per-scale l_means[i], c_means[i], s_means[i] doubles — the vert_lcs CUDA kernel literally has "lcs" in its name; the Vulkan vertical pass emits 3 per-WG partials for the same triple. The combine step on host then forms the Wang product. The 15 metrics are therefore already computed; only the feature_collector_append calls were missing.

Decision

We extend the CUDA and Vulkan MS-SSIM extractors to honour enable_lcs by gating 15 additional vmaf_feature_collector_append calls on the existing per-scale L/C/S means. No kernel changes; no shader changes; no new device buffers. The default-path output (enable_lcs=false) stays bit-identical to the pre-T7-35 binary.

The Vulkan option help text is rewritten to drop the "(reserved; not yet implemented)" caveat. The CUDA file header comment is updated to remove the "v1 does NOT implement enable_lcs" deferral. A new pseudo-feature float_ms_ssim_lcs is added to the cross- backend gate (scripts/ci/cross_backend_vif_diff.py and the matrix gate in cross_backend_parity_gate.py); it pins the 16 emitted metrics at places=4 against the CPU reference per the existing float_ms_ssim contract from ADR-0190.

The SYCL MS-SSIM twin (integer_ms_ssim_sycl.cpp) does not currently expose enable_lcs (its options table is empty). It therefore stays out-of-scope for this ADR; if SYCL ever adopts the option-bool, the same wiring applies (the SYCL kernel already computes the same per-scale means).

Alternatives considered

Option Pros Cons Why not chosen
A. Gate 15 host-side feature_collector_append calls on the existing enable_lcs bool (chosen) No kernel changes; default path bit-identical; one ADR, ~30 LOC None of substance This is the trivial extension — the GPU vert kernel already emits L/C/S means; only the host-side emission was missing.
B. Add 15 separate device readback buffers (one per metric) for parity with the CPU's metric-by-metric compute_ms_ssim API Conceptual symmetry with CPU l_scores[] / c_scores[] / s_scores[] arrays 15× the D2H bandwidth; allocates ~60 MB of pinned host buffers at 4K; redundant — the per-WG-block partials already reduce to per-scale means at host Wasteful and not faster; the per-scale double accumulator already runs every frame.
C. Treat LCS as a separate feature extractor (float_ms_ssim_lcs_cuda / _vulkan) Cleaner registration; one extractor = one set of metrics Forces a second pyramid + intermediates allocation; doubles VRAM; breaks API parity (CPU is one extractor with an option, not two) API-parity with CPU is a hard constraint per ADR-0190.
D. Order the emitted metric names metric-wise ({l_scale0..4, c_scale0..4, s_scale0..4}) vs scale-wise ({l_scale0, c_scale0, s_scale0, l_scale1, ...}) Either ordering works Metric-wise matches CPU float_ms_ssim.c:189 — that's what consumers (pip install meson-python) see today We chose metric-wise to mirror the CPU emission order; downstream JSON consumers see identical key ordering across all backends.

Consequences

  • Positive:
  • The Vulkan option help text no longer lies — enable_lcs is now real on the GPU.
  • Cross-backend gate (per ADR-0214 / ADR-0125) now covers all 16 MS-SSIM metrics, not just the combined score; future LCS regressions on the GPU side surface immediately.
  • No measurable cost when enable_lcs=false (the bool is checked once per frame; the kernels are unchanged).
  • The CPU/CUDA/Vulkan triplet stays API-symmetric — one extractor, one option, same metric names.
  • Negative:
  • SYCL stays asymmetric (no enable_lcs exposed) until follow-up work; documented in features.md.
  • Cross-backend matrix gate gains one cell (float_ms_ssim_lcs × {vulkan, cuda, sycl-skipped}); CI cost is ~1 extra second per PR for the lavapipe lane.
  • Neutral / follow-ups:
  • SYCL integer_ms_ssim_sycl.cpp should grow the enable_lcs option in a follow-up PR (T7-35 SYCL coda); the kernel already has the per-scale means.
  • When the parity matrix gate gets a CUDA / hardware-Vulkan self-hosted runner, the float_ms_ssim_lcs cell becomes enforcing rather than advisory.

References