Research 0090 — T3-15 GPU coverage long-tail batch 4: actual gap re-audit + per-kernel ordering¶
- Date: 2026-05-09
- Author: Lusoris, Claude (Opus 4.7)
- Tracking: backlog T3-15 (replaces T3-17 + T3-18)
- Companion ADR(s): ADR-0192 (long-tail terminus, Status: Accepted, frozen per ADR-0028); follow-up ADR for chroma-CUDA TBD when the first port lands.
- Tags:
gpu,cuda,sycl,vulkan,feature-extractor,chroma,cambi,motion3,audit
Why this digest exists¶
The backlog row for T3-15 in .workingdir2/BACKLOG.md (line 328) carries a 2026-04-28 description that is materially stale as of 2026-05-09. Two of its three sub-rows are already mostly closed, and a third ((c) motion3) was closed in its default mode by ADR-0219. Before opening any implementation PR we need an audit that re-states the real gap, otherwise the next agent will re-port kernels that already exist.
This digest is the source-of-truth gap re-audit; the BACKLOG row will be updated in the implementation PR(s) that close each sub-row.
What the BACKLOG row claims vs. what the tree actually has¶
| Sub-row claim (BACKLOG.md line 328) | Tree reality (verified 2026-05-09) |
|---|---|
| (a) "CUDA/SYCL ship integer adm/vif/motion only — 3 of ~19 metrics" | False. core/src/feature/cuda/ ships 16 kernels, core/src/feature/sycl/ ships 14. Per metrics-backends-matrix.md lines 66-91, every classical metric except cambi already has CUDA + SYCL twins. The actual remaining (a)-class gap is cambi × {CUDA, SYCL} = 2 kernels. |
| (b) "PSNR Vulkan v1 + CUDA both luma-only because picture_vulkan / picture_cuda upload paths can't carry chroma" | Half-stale. Vulkan PSNR went chroma-complete in PR #204 / ADR-0216 (chroma claim re: picture_vulkan was wrong on inspection — it's a generic byte allocator). CUDA picture_cuda.c already loops over all 3 planes (lines 44-48). Real gap: the per-extractor provided_features lists in integer_psnr_cuda.c:216 and integer_psnr_sycl.cpp:244 still claim {"psnr_y", NULL} and dispatch only luma. |
| (c) "motion3 (5-frame window) GPU coverage on Vulkan + CUDA + SYCL" | Half-stale. motion3_score in 3-frame mode (the default, the only mode any shipped model uses) is emitted by all three GPU backends — see integer_motion_cuda.c:354-360, integer_motion_sycl.cpp:496-502, motion_vulkan.c:23-29 and ADR-0219. What's deferred is motion_five_frame_window=true mode, which all three backends correctly reject with -ENOTSUP. ADR-0219's own §Decision explicitly accepts that deferral (no shipped model uses 5-frame; the option exists for ad-hoc CLI tuning only). |
Real gap inventory (corrected)¶
The actual open kernel count is 8, not 16. Ordered for ascending implementation risk (mirrors ADR-0192's "ascending complexity" pattern):
| # | Kernel | Backend | Closest existing twin (reference) | Est. LOC | Risk | Unblocks |
|---|---|---|---|---|---|---|
| 1 | psnr_cuda chroma extension | CUDA | psnr_vulkan.c (ADR-0216) — same algo, port host scaffolding | ~120 | Low | chroma-PSNR matrix cell + (3) + (5) ↓ |
| 2 | psnr_sycl chroma extension | SYCL | psnr_vulkan.c (ADR-0216) + integer_psnr_sycl.cpp v1 | ~120 | Low | chroma-PSNR matrix cell + (4) + (6) ↓ |
| 3 | ssim_cuda chroma extension | CUDA | float_ssim_cuda.c (luma) + chroma loop pattern from (1) | ~150 | Low | chroma-SSIM matrix cell |
| 4 | ssim_sycl chroma extension | SYCL | float_ssim_sycl.cpp (luma) + chroma loop pattern from (2) | ~150 | Low | chroma-SSIM matrix cell |
| 5 | ms_ssim_cuda chroma extension | CUDA | float_ms_ssim_cuda.c (luma) + (3) | ~180 | Med | chroma-MS-SSIM matrix cell |
| 6 | ms_ssim_sycl chroma extension | SYCL | float_ms_ssim_sycl.cpp (luma) + (4) | ~180 | Med | chroma-MS-SSIM matrix cell |
| 7 | cambi_cuda | CUDA | cambi_vulkan.c Strategy II hybrid host/GPU per ADR-0210 | ~900 | High | matrix terminus on cambi |
| 8 | cambi_sycl | SYCL | cambi_vulkan.c Strategy II + (7) | ~900 | High | matrix terminus on cambi |
Optional / deferred (no implementation in this batch):
motion_five_frame_window=trueon Vulkan + CUDA + SYCL — needs 5-deep device-side blur ring buffer in 3 kernel languages. ADR-0219 §Context: "no shipped VMAF model uses this mode; the option exists for ad-hoc CLI tuning". Re-open only if a model surfaces that consumes it.
Ordering rationale¶
- PSNR chroma first (CUDA → SYCL). Smallest diff (extend the
provided_featuresarray, allocate per-plane SSE accumulators, dispatch the existing kernel three times against per-plane buffers). Mirrors the Vulkan landed pattern in ADR-0216 exactly. Establishes the chroma-extension idiom every later (3)-(6) kernel reuses. The CPU integer_psnr.c reference at lines 1-end already emitspsnr_y / psnr_cb / psnr_crunconditionally (clamping to luma-only on YUV400); parity is the existing CPU contract, not a new specification. - SSIM chroma second. Reuses the chroma-iteration pattern from (1) on a more complex per-plane reduction. Vulkan ssim_vulkan already has chroma support (ADR-0188/0189 pattern); CUDA + SYCL twins inherit the host scaffolding.
- MS-SSIM chroma third. Same pattern at one further scale; the multi-scale loop is per-plane outside the existing per-scale loop.
cambi× {CUDA, SYCL} last. Highest risk by 5×. ADR-0210 chose Strategy II (hybrid host/GPU split) for Vulkan because the precision-sensitive sliding histogram incalculate_c_valuesresists straightforward GPU porting; CUDA + SYCL ports must mirror that split or pick a different one. ADR-0205 (cambi GPU feasibility) is the prerequisite reading. Defer to a separate planning round once chroma is in — the cambi backlog is large enough to warrant its own ADR(s).
Per-kernel risk notes¶
- (1)–(2) PSNR chroma: bit-exactness vs CPU is the gate. CPU
integer_psnr.cuses int64 SSE accumulator →mse = sse / (w*h)→10 * log10(peak² / mse). The Vulkan twin (ADR-0216) holds places=4 vs CPU. CUDA + SYCL kernels should hold the same: SSE is integer,log10is host-side double, no FMA / fast-math regression vectors on the device kernel. - (3)–(6) SSIM/MS-SSIM chroma: existing per-scale FMA-contract notes for
_vulkan/_cuda/_syclapply to each chroma plane unchanged. Place=4 gate per ADR-0214. - (7)–(8) cambi: the host-side sliding histogram in Strategy II is what holds precision in the Vulkan port; CUDA + SYCL ports inherit the same ULP envelope by design. Risk is integration plumbing, not numerics.
Why pick gap #1 (CUDA PSNR chroma) as the proof-of-concept first port¶
- Smallest diff: ~120 LOC across
integer_psnr_cuda.c+ a handful ofprovided_features/ SSE-array refactor lines. No new shader / kernel — thecalculate_psnr_kernel_8bpc/_16bpcentry points are plane-agnostic identical to Vulkan'spsnr.comp. - Direct precedent: ADR-0216 already wrote the design for the Vulkan twin. The CUDA port translates the per-plane state arrays and the per-plane dispatch loop into CUDA's async submit/collect model.
- Validates the chroma-extension pattern that gaps #2-#6 all reuse. If CUDA PSNR chroma holds places=4 vs CPU on Netflix golden inputs, the path is clear for SYCL PSNR + chroma SSIM + chroma MS-SSIM as follow-up PRs.
- Doesn't compete with cambi: gap #7-#8 (cambi CUDA + SYCL) need their own Strategy II ADR; this digest deliberately defers that planning.
Cross-backend gate plan for the first port¶
scripts/ci/cross_backend_vif_diff.py (the canonical places=4 gate per ADR-0214) supports --feature psnr --backend cuda. Post-port:
python3 scripts/ci/cross_backend_vif_diff.py \
--feature psnr --backend cuda --places 4 \
--ref python/test/resource/yuv/src01_hrc00_576x324.yuv \
--dis python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --bpc 8 --pix-fmt yuv420p
The gate checks psnr_y, psnr_cb, psnr_cr independently against the CPU integer_psnr reference at places=4. If any plane fails, the kernel doesn't ship. Per memory feedback_no_test_weakening, places=4 is non-negotiable.
Netflix golden gate impact¶
Per CLAUDE.md §8, the three CPU golden tests (src01_hrc00↔hrc01, checkerboard 1px, checkerboard 10px) measure CPU integer_psnr outputs and never see the CUDA backend. The psnr_cuda extractor extension changes only what provided_features claims and how many dispatches per frame run — CPU output is untouched, so the golden gate is structurally unaffected. Verify post-build by running make test-netflix-golden against the rebuilt tree.
Out of scope for this digest's first port¶
- All seven follow-up kernels (#2-#8). Each is a separate PR after #1 lands and proves the pattern.
- The
motion_five_frame_window=true5-frame mode. ADR-0219 explicitly defers it. - HIP backend coverage of any chroma metric (T7-10 audit-first, ADR-0212 — on a separate planning track).
- Updating
metrics-backends-matrix.mdwith chroma-PSNR cells. The matrix already implies chroma support via the row labels; if a future maintainer wants per-plane tracking, that's a docs-only ADR.
References¶
- BACKLOG.md row T3-15 (line 328) — the stale row this digest re-audits.
.workingdir2/analysis/metrics-backends-matrix.md— coverage source of truth (correct as of 2026-04-29; the BACKLOG row is what's stale, not the matrix).- ADR-0192 — long-tail terminus; frozen per ADR-0028. A
### Status update 2026-05-09: T3-15 first port landed (CUDA PSNR chroma)may append to References when the first port merges, per CLAUDE.md §12 r8 freeze rule. - ADR-0210 — Strategy II hybrid for cambi (prerequisite reading for gaps #7-#8).
- ADR-0214 — places=4 gate.
- ADR-0216 — Vulkan PSNR chroma; this digest's first-port direct precedent.
- ADR-0219 — motion3 GPU coverage in 3-frame mode + 5-frame deferral.
core/src/feature/cuda/integer_psnr_cuda.clines 40-237 — the file the first port edits.core/src/feature/vulkan/psnr_vulkan.clines 1-475 — the chroma-extension template the CUDA port mirrors.core/src/feature/integer_psnr.c— CPU reference; defines thepsnr_y / psnr_cb / psnr_crcontract.- Per-user-direction (popup-equivalent task description, 2026-05-09): "Phase A: research digest + one proof-of-concept kernel". Implementation of the first port follows this digest's gap #1 ordering.