ADR-0589: Metal float_ssim option parity — enable_lcs, enable_db, clip_db, scale¶
- Status: Accepted
- Date: 2026-05-16
- Deciders: lusoris, Claude (Anthropic)
- Tags:
metal,ssim,option-parity,apple-silicon,kernel,fork-local
Context¶
The Metal float_ssim_metal extractor shipped in ADR-0421 with a hardwired two-pass Gaussian+SSIM kernel that produces a single per-frame SSIM score. The CPU reference (float_ssim.c) exposes four user-facing options:
enable_lcs— emit per-frame luminance (float_ssim_l), contrast (float_ssim_c), and structure (float_ssim_s) sub-scores alongside the composite SSIM.enable_db— convert the SSIM score to decibels:-10·log10(1−SSIM).clip_db— clamp the dB output to a finite maximum derived from frame dimensions and bit depth.scale— decimation scale factor (0=auto, 1=no downscaling).
HIP (float_ssim_hip.c, ADR-0374) added scale only. Neither HIP nor Metal had enable_lcs, enable_db, or clip_db. An audit of the Metal extractor against the CPU reference identified this as a gap that makes the Metal path unsuitable as a drop-in for downstream consumers that rely on the float_ssim_l / float_ssim_c / float_ssim_s sub-scores or the dB output form.
Decision¶
Extend float_ssim_metal.mm and the companion float_ssim.metal kernel to reach full CPU option parity:
- Add
enable_lcs,enable_db,clip_db, andscaleto theVmafOptionarray in the host.mm. - Extend
float_ssim_vert_combinein the Metal shader to accept alcs_flagsuniform and anlcs_partsoutput buffer (3 × partials_count floats for L, C, S partial sums). Whenlcs_flags == 0the kernel falls through the existing single-sum path with no extra ALU cost. - Apply
enable_db/clip_dbhost-side incollect_fex_metalafter the partial-sum reduction — identical to the CPUconvert_to_db()helper. - Validate
scale: v1 rejectsscale > 1atinittime with-EINVALand a log message, matching the HIP twin's posture. - Extend
provided_features[]to declarefloat_ssim_l,float_ssim_c,float_ssim_s(conditionally emitted whenenable_lcs == true).
The kernel change adds one [[buffer(5)]] (lcs_parts) and one [[buffer(6)]] (lcs_flags) argument to float_ssim_vert_combine. When enable_lcs == false the host binds the dummy par_buf to slot 5 and passes lcs_flags = 0; the kernel short-circuits the LCS path with a branch on the uniform, adding zero output writes and negligible ALU overhead.
Alternatives considered¶
- Host-only LCS decomposition — decompose l/c/s from the aggregated SSIM partial sums post-collection. Not feasible: the reduction discards the intermediate per-pixel mu and sigma values needed for the sub-score formulae; the decomposition would be numerically different from the pixel-level computation.
- Separate LCS kernel pass — third dispatch writing per-pixel l/c/s, then a fourth reduction pass. Two extra dispatches per frame; the current approach adds zero dispatches (the LCS sums are accumulated in the same
float_ssim_vert_combinethreadgroup reduction that already runs). - Match HIP — skip LCS, add scale only — keeps parity with HIP but leaves the Metal path below CPU parity. Rejected: the user request is full CPU parity, and the kernel change is contained.
References¶
- req: "audit float_ssim_metal.mm; if missing features (enable_db/enable_lcs/clip_db etc per PR #969 pattern), add them"
- ADR-0421 — Metal first kernel batch (T8-1c through T8-1j)
- ADR-0374 — HIP float_ssim (scale-only parity precedent)
- ADR-0453 — PSNR enable_chroma GPU parity (same option-parity pattern)