ADR-0957: SYCL kernel coverage round 4 (float_moment + SpEED + SSIMULACRA2)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris, Claude
- Tags: sycl, test, gpu, parity, kernel-coverage
Context¶
Rounds 1 (PR #351), 2 (PR #376), and 3 (PR #446) drove SYCL parity coverage from 12 % to 78 % (14 of 18 SYCL extractors). The round-3 ADR-0946 closed the float family + PSNR-HVS and explicitly deferred four extractors to a round-4 backlog:
float_moment_sycl— shares its TU withinteger_moment_sycl(core/src/feature/sycl/integer_moment_sycl.cpp); the extractor is registered under the namefloat_moment_sycland emits four headline features (float_moment_ref1st/dis1st/ref2nd/dis2nd).speed_chroma_sycl— SpEED chroma family (ADR-0567). 752-LOC SYCL TU atcore/src/feature/sycl/speed_chroma_sycl.cpp. Discovery during round-4 work: the TU exists but is not wired intosycl_feature_sourcesincore/src/meson.buildand the extractor symbolvmaf_fex_speed_chroma_syclis not registered incore/src/feature/feature_extractor.c. The TU appears complete (no TODO/FIXME/stub markers) but ships as dormant scaffold.speed_temporal_sycl— SpEED temporal family (ADR-0567). 705-LOC SYCL TU atcore/src/feature/sycl/speed_temporal_sycl.cpp. Same build-wiring + registration gap asspeed_chroma_sycl. Carries theVMAF_FEATURE_EXTRACTOR_TEMPORALflag — frame 0 always emits0.0; the meaningful score would land at frame index 1.ssimulacra2_sycl— hybrid host + GPU port of the ADR-0201 Vulkan pipeline (ADR-0206). Multi-stage XYB + IIR + SSIM-combine pipeline whose float rounding accumulates past the places=4 baseline; the ADR-0214FEATURE_TOLERANCEtable assigns5e-3to it for the same reason.
The build-wiring + registration gap for the two SpEED twins is out-of-scope for a kernel-coverage PR (it would change the production extractor surface, not just test coverage). The two tests are added in dormant form and surface a [skip: ...not built into libvmaf] message that tracks the gap; they auto-activate as real parity gates the moment a follow-up PR wires the TUs into the build + registry.
Without a CPU↔SYCL parity gate, a stride / sub-group-mask / coefficient drift in any of these kernels would silently corrupt SpEED, float-moment, or SSIMULACRA2 columns on Intel-Arc CHUG re-extracts; the cross-backend gate (ADR-0214) covers some of these on Vulkan/lavapipe but not on the SYCL backend on Intel-Arc.
Decision¶
Add four CPU vs. SYCL parity gates under core/test/, gated on enable_sycl:
| Kernel | New test | Headline score(s) | Tolerance | Status |
|---|---|---|---|---|
float_moment_sycl | test_sycl_float_moment_parity.c | float_moment_{ref,dis}{1st,2nd} | 1e-4 (ADR-0214 default) | Active — extractor built + registered |
speed_chroma_sycl | test_sycl_speed_chroma_parity.c | Speed_chroma_feature_speed_chroma_{u,v,uv}_score | 1e-4 (ADR-0214 default) | Dormant skip — TU not yet wired into build/registry |
speed_temporal_sycl | test_sycl_speed_temporal_parity.c | Speed_temporal_feature_speed_temporal_score @ idx 1 | 1e-4 (ADR-0214 default) | Dormant skip — TU not yet wired into build/registry |
ssimulacra2_sycl | test_sycl_ssimulacra2_parity.c | ssimulacra2 | 5e-3 (ADR-0214 FEATURE_TOLERANCE['ssimulacra2']) | Active — extractor built + registered |
Each mirrors the rounds 1–3 pattern: 256x144 synthetic YUV420P fixture, per-frame XOR pattern with frame-dependent salt, CPU + SYCL feature extractor through the public vmaf_use_feature API with NULL options dict (defaults match between CPU and SYCL for all four kernels), parity assertion via fabs(cpu - sycl) <= TOL, skip-on- no-device via [skip: no SYCL device] printf.
The SSIMULACRA2 fixture fills all three planes (not just luma) because the pipeline consumes YUV → linear-RGB → XYB and chroma matters for the headline score. The speed_temporal fixture submits two frames (the TEMPORAL flag means frame 0 emits 0.0) and asserts at index 1.
The two SpEED tests additionally guard with vmaf_get_feature_extractor_by_name(...) and emit [skip: <name> not built into libvmaf] when the extractor is absent — see the Context section above for the build-wiring gap discovered during round-4 implementation. Each test auto-activates as a real parity gate the moment a follow-up PR wires the SpEED TUs into core/src/meson.build + core/src/feature/feature_extractor.c.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Four per-kernel test files (chosen) | Mirrors rounds 2 + 3 layout; one TU per kernel keeps build parallelism and failure-isolation; reviewers can bisect a single failing kernel | Four new TUs vs. one combined | Round-2 / round-3 set the precedent; combining hides which kernel regressed |
One combined test_sycl_round4_parity.c | Single TU = ~250 LOC vs. ~700 LOC across four files | Failure attribution becomes harder; meson test --run-only granularity is lost; breaks the AGENTS.md coverage matrix invariant (one row per parity test) | Loses the rounds 1–3 invariant from the AGENTS.md note in core/src/feature/sycl/AGENTS.md |
| Defer ssimulacra2 to round 5 (cover 3 kernels here) | Tighter PR; the ssimulacra2 gate uses a different tolerance, special-cases the fixture, and could land standalone | Leaves the headline SSIMULACRA2 SYCL gap open for another cycle; the per-kernel scaffold cost is the same whether bundled or split | Four kernels still fits comfortably in the 200–800 LOC bundle target |
Drop places=4 and use 5e-3 for all four | Single tolerance constant; easier to maintain | Hides drift in float_moment_sycl, speed_chroma_sycl, speed_temporal_sycl whose kernels are bit-tight per ADR-0214 default | ADR-0214 is per-feature; matching its FEATURE_TOLERANCE is the invariant |
Extract a shared test/sycl_parity_helpers.{h,c} first | Eliminates the run_cpu / run_sycl boilerplate duplicated across all 12 rounds 1–4 tests | Touches 12 existing tests + adds a new TU; out of scope for a kernel-coverage PR | Tracked as a follow-up refactor; the per-test divergence (config dicts, multi-score, multi-frame) is small enough that the boilerplate cost is acceptable |
Consequences¶
- Positive: SYCL parity test coverage rises from 78 % to 100 % of the SYCL extractors enumerated in the round-3 backlog (4 of 4). Every registered SYCL extractor (16 of 18, excluding the dormant SpEED scaffolds) now has an active CI-gated CPU↔SYCL parity test; the two unregistered SpEED twins carry dormant parity tests that activate automatically once the build wiring + registration follow-up lands. The round-4 backlog row in
core/src/feature/sycl/AGENTS.mdis closed. - Negative: Four new test executables increase
enable_syclbuild time by ~25 s and the test runtime by ~8 s on systems with a SYCL device. Systems without a SYCL device run the skip-path and add ~0.4 s to the suite. - Negative — discovered during round-4:
speed_chroma_sycl(752 LOC) andspeed_temporal_sycl(705 LOC) ship as dormant scaffold — full SYCL TUs that are neither compiled intolibvmaf.{so,a}nor registered with the extractor registry. This PR documents the gap and tracks it via dormant tests; the follow-up to wire them in is left to a dedicated PR (it changes the production extractor surface, not just test coverage). - Neutral / follow-ups: The run_cpu / run_sycl boilerplate is now duplicated across 12 SYCL parity tests (rounds 1 + 2 + 3 + 4). A future PR may extract
test/sycl_parity_helpers.{h,c}once the shape stabilises (rounds 1–4 already converge on the same six steps: init, import_state, use_feature, feed, EOS, score_at_index). Out of scope for this PR.
References¶
- ADR-0214 — cross-backend numerical-parity gate (per-feature tolerance table)
- ADR-0219 — CHUG re-extraction trusted-column invariant
- ADR-0567 — SpEED SYCL kernels (chroma + temporal)
- ADR-0206 — SSIMULACRA2 CUDA + SYCL ports (hybrid host/GPU)
- ADR-0192 — GPU long-tail batch 3 (per-feature precision contracts)
- ADR-0868 — GPU-backend kernel-coverage gap audit
- ADR-0884 — SYCL kernel coverage round 2
- ADR-0946 — SYCL kernel coverage round 3
- ADR-0108 — fork-local deep-dive deliverables rule
- PR #351 — SYCL kernel coverage round 1
- PR #376 — SYCL kernel coverage round 2
- PR #446 — SYCL kernel coverage round 3
- Research digest:
docs/research/0957-sycl-kernel-coverage-round4-2026-05-31.md - Source:
req— operator brief 2026-05-31 ("SYCL kernel coverage round 4 — close the last 4 uncovered SYCL kernels").