ADR-0946: SYCL kernel coverage round 3 (float family + PSNR-HVS)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris, Claude
- Tags: sycl, test, gpu, parity, kernel-coverage
Context¶
Rounds 1 (PR #351) and 2 (PR #376) closed the bulk of the integer SYCL parity gap surfaced by the 2026-05-30 GPU-backend kernel-coverage audit. Round 1 added test_sycl_psnr_parity + test_sycl_vif_parity. Round 2 added test_sycl_adm_parity + test_sycl_ciede_parity + test_sycl_ssim_parity + test_sycl_ms_ssim_parity + test_sycl_motion_v2_parity. Pre-PR coverage was 12 % (~2 of 17 SYCL extractors); after rounds 1+2 it stood at ~47 % (8 of 17). The remaining 9 uncovered SYCL extractors include the entire float family (float_adm_sycl, float_vif_sycl, float_psnr_sycl, float_motion_sycl, float_moment_sycl), the integer PSNR-HVS (psnr_hvs_sycl), the SpEED kernels (speed_chroma_sycl, speed_temporal_sycl), and ssimulacra2_sycl.
The float family is the highest-value remaining target: it carries independent kernel topologies, single-precision accumulators, and CSF lookup tables that are not exercised by the integer-family tests landed in rounds 1+2. A stride, sub-group-mask, or precision drift would silently corrupt every float-VMAF model's primary feature on Intel-Arc CHUG re-extracts.
PSNR-HVS is the next-highest-value gap: it runs a per-8x8-block DCT with a 64-entry CSF lookup and a sub-group reduction — none of which is exercised by any other SYCL parity test.
Decision¶
Add five CPU vs. SYCL parity gates at ADR-0214 places=4 (1e-4) tolerance under core/test/, gated on enable_sycl:
| Kernel | New test | Headline score |
|---|---|---|
float_psnr_sycl | test_sycl_float_psnr_parity.c | float_psnr |
float_adm_sycl | test_sycl_float_adm_parity.c | VMAF_feature_adm2_score |
float_vif_sycl | test_sycl_float_vif_parity.c | VMAF_feature_vif_scale0_score |
float_motion_sycl | test_sycl_float_motion_parity.c | VMAF_feature_motion2_score (idx 1) |
psnr_hvs_sycl | test_sycl_psnr_hvs_parity.c | psnr_hvs |
Each mirrors the round-1 / round-2 pattern: 256x144 synthetic YUV420P fixture, per-frame XOR pattern with frame-dependent salt, CPU + SYCL feature extractor through the public vmaf_use_feature API, parity assertion via fabs(cpu - sycl) <= 1e-4, skip-on-no-device via [skip: no SYCL device] printf.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Five per-kernel test files (chosen) | Mirrors round-2 layout; one TU per kernel keeps build parallelism and failure-isolation; reviewers can bisect a single failing kernel | Five new TUs vs. one combined | Round-2 set the precedent; combining hides which kernel regressed |
One combined test_sycl_round3_parity.c | Single TU = ~250 LOC vs. ~900 LOC across five files | Failure attribution becomes harder; meson test --run-only granularity is lost | Loses the round-2 invariant from the AGENTS.md note in core/src/feature/sycl/AGENTS.md |
| Cover only the float-VMAF family (4 tests) | Tighter PR; defer PSNR-HVS to a round 4 | PSNR-HVS is the next-highest-value single gap; bundling it costs ~200 LOC | A 5-test PR still fits in the 200-800 LOC bundle target |
Defer to /cross-backend-diff skill only | No new test files; relies on the existing skill | Skill is operator-invoked; CI never gates on it; regressions slip until a developer happens to run it | Same rejection rationale as round 2 |
| Cover all 9 remaining SYCL kernels in one PR | One-and-done | ~1500 LOC; SpEED and SSIMULACRA2 carry per-extractor config-arg setup not yet templated; PR exceeds the 200-800 LOC target | Defer SpEED + SSIMULACRA2 + float_moment to round 4 |
Consequences¶
- Positive: SYCL parity coverage rises from ~47 % to ~76 % (13 of 17 extractors). The entire float-VMAF feature surface and the PSNR-HVS kernel now have CI-gated CPU↔SYCL parity at the same places=4 tolerance as the cross-backend gate (ADR-0214). CHUG re-extracts on Intel-Arc systems can be trusted against the CPU reference for every float-VMAF column.
- Negative: Five new test executables increase
enable_syclbuild time by ~30 s and the test runtime by ~10 s on systems with a SYCL device. Systems without a SYCL device run the skip-path and add ~0.5 s to the suite. - Neutral / follow-ups: Four SYCL extractors remain uncovered —
speed_chroma_sycl,speed_temporal_sycl,float_moment_sycl,ssimulacra2_sycl. Each carries per-extractor config-arg setup (speed_kernelscale,speed_prescale,yuv_matrix, etc.) that is not yet templated in the round-2 test scaffold. Tracked as a round-4 follow-up.
References¶
- ADR-0214 — cross-backend numerical-parity gate (places=4 / 1e-4)
- ADR-0219 — CHUG re-extraction trusted-column invariant
- ADR-0868 — GPU-backend kernel-coverage gap audit
- ADR-0884 — SYCL kernel coverage round 2
- ADR-0108 — fork-local deep-dive deliverables rule
- PR #351 — SYCL kernel coverage round 1
- PR #376 — SYCL kernel coverage round 2
- PR #293 — SYCL init-failure cleanup leak fix
- Research digest:
docs/research/0946-sycl-kernel-coverage-round3-2026-05-31.md - Source:
req— operator brief 2026-05-31 ("SYCL kernel coverage round 3 — extend beyond PRs #351 + #376").