ADR-0202: float_adm CUDA + SYCL twins — sixth Group B float kernel finishes¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: cuda, sycl, gpu, feature-extractor, fork-local, places-4
Context¶
ADR-0192 tracks the GPU long-tail batch 3 roadmap; ADR-0199 shipped the Vulkan kernel for float_adm as part 6a. This ADR closes parts 6b (CUDA) and 6c (SYCL) — the cross-backend twins of the same kernel, following the established two-PR pattern from earlier batches (motion_v2 #146 → #147, psnr_hvs #143 → #144, etc.).
CPU reference: float_adm.c (thin wrapper)
adm.c::compute_adm(4-scale orchestration) +adm_tools.c(the float_s-suffixed primitives). Vulkan structural reference:float_adm_vulkan.candfloat_adm.comp.
Decision¶
Ship float_adm_cuda and float_adm_sycl as direct ports of the Vulkan kernel — same four pipeline stages, same -1 mirror form on both axes, same fused stage 3 with cross-band CM threshold. The CPU extractor's places=4 precision contract is preserved by both backends.
CUDA twin —¶
float_adm_cuda.c (float_adm_score.cu)
- Single
.cufile with four__global__entry points (one per stage), same shape as the float_vif CUDA kernel from ADR-0197. Submit/collect async-stream pattern matchesmotion_cudaandfloat_vif_cuda. - Per-frame: 16 launches (4 stages × 4 scales) on the picture stream + a D2H copy of per-scale (csf, cm) partials on the secondary event stream. Reduction across WGs runs on the host in
double. - Stage 3 grid:
(3 × num_active_rows, 1, 1)workgroups, each with aWG_SIZE = 16 × 16block. Warp + cross-warp reduction via__shfl_down_syncmirrors the VulkansubgroupAddtree. - Per-scale
ref_band[]/dis_band[]allocations (one buffer per scale) so the next scale's stage 0 can read the parent LL band without overwriting it. Vulkan reuses one buffer serially; CUDA pays a tiny extra-allocation cost for clearer ownership and easier debugging.
SYCL twin —¶
- Single
.cppfile with fourlaunch_*templates overSCALE, same shape asfloat_vif_sycl.cpp. Self-contained submit/collect — does NOT register with the shared_frame preallocation model (the multi-scale band/csf layout doesn't fit, same rationale as float_vif). [[intel::reqd_sub_group_size(32)]]on the stage-3 reduction ensures portable warp behaviour across Intel + Nvidia SYCL back-ends.
-fmad=false for the CUDA fatbin¶
The Vulkan kernel uses the GLSL precise qualifier on the angle-flag dot product (ot_dp = oh*th + ov*tv) so the comparison lhs >= rhs does NOT depend on FMA contraction. NVCC's default -fmad=true fuses the same expression into FMA(ov, tv, oh*th), which cascades through the CSF / CM cube reductions and pushes scale-3 and adm2 past places=4 (max_abs_diff seen at 3.6e-4 on the Netflix normal pair before fixing).
meson.build now carries a small per-kernel flag dict (cuda_cu_extra_flags) and threads --fmad=false
-Xcompiler=-ffp-contract=offinto thefloat_adm_scorefatbin only — the integer ADM kernel usesint64accumulators for which FMA is irrelevant, so the existing FMA-on path is preserved for it. This is a precision contract guard, not a performance regression —-fmad=falseonly affects the fourfloat_adm_*kernels and leaves the rest of the CUDA build untouched.
Parent-LL dimension trap (load-bearing)¶
Stage 0 at scale > 0 reads the parent's LL band. The mirror/clamp clamp dimensions are the parent's LL output dimensions (= scale_w/h[scale], the input dims at the current scale), NOT the parent's full-resolution image dimensions (= scale_w/h[scale - 1]). The Vulkan kernel passes pc.cur_w/cur_h which match the former; the first cut of the CUDA + SYCL submit code passed scale_w/h[scale - 1], which clamped against the wrong bounds and let the parent reads wander into uninitialised memory at scale 1+. Symptom: max_abs_diff = 3.6e-4 at adm_scale3 and 1.4e-4 at adm2 on the Netflix normal pair. Fix: float_adm_cuda.c::submit_fex_cuda
float_adm_sycl.cpp::submit_fex_syclboth now passscale_w/h[scale]. Cited inline at the declaration so future refactors don't regress the bounds.
Alternatives considered¶
- Five-file split per integer ADM CUDA pattern (
adm_dwt2.cu,adm_decouple.cu,adm_csf.cu,adm_csf_den.cu,adm_cm.cu). Reasoning: matches the most structurally-similar precedent. Rejected because the float pipeline shares no header dependencies with the integer path, the float_vif precedent uses one file (float_vif/float_vif_score.cu), and meson's per-target compile flags machinery is simpler with a single-file target. The user instructions called the directory layout "you can reuse" — i.e. permissive, not mandatory. - Targeted
__fmul_rn/__fadd_rnintrinsics in the angle-flag and cube reductions instead of a TU-wide-fmad=false. Rejected because the affected expressions appear in five distinct device functions and each rewrite would need a paired comment chain explaining "this exact parens layout matters"; the per-kernel meson flag is one line and isolates the fix to the kernel that needs it. - Bundle this PR with the ssimulacra2 cuda/sycl twins (the only other batch-3 metric that hasn't fanned out). Rejected to keep the PR shape consistent with #144 / #147 / #150 / #151 — one pure-Vulkan PR, then one CUDA+SYCL twin PR per metric.
Consequences¶
Positive¶
float_admis now available on Vulkan + CUDA + SYCL (and CPU, AVX2, AVX-512, NEON), closing the sixth and final Group B float metric gap from ADR-0192.- All five output metrics (
adm2,adm_scale0..3) hitmax_abs_diff ≤ 6e-6on the Netflix normal pair — the same tolerance the Vulkan kernel achieves, and well inside theplaces=4contract.
Negative¶
- One new fatbin (
float_adm_score) compiled with--fmad=false. Affects only this kernel; the rest of the CUDA build keeps its default FMA behaviour. - The Vulkan host wrapper allocates one band buffer; CUDA allocates four (one per scale). Per-frame device memory delta: ~
4 × buf_stride × half_h0 × float = 4 × 288 × 162 × 4 = ~720 KiBper CUDA frame on the Netflix normal pair.
Reproducer¶
# CUDA + SYCL build inside libvmaf/.
PATH="/opt/intel/oneapi/compiler/latest/bin:/opt/cuda/bin:$PATH" \
CXX=/opt/intel/oneapi/compiler/latest/bin/icpx \
CC=/opt/intel/oneapi/compiler/latest/bin/icx \
meson setup core/build_cs --reconfigure \
-Denable_cuda=true -Denable_sycl=true -Denable_vulkan=enabled \
-Denable_float=true \
-Dsycl_compiler=/opt/intel/oneapi/compiler/latest/bin/icpx \
libvmaf
ninja -C core/build_cs
# Cross-backend gate, places=4.
python3 scripts/ci/cross_backend_vif_diff.py \
--vmaf-binary core/build_cs/tools/vmaf \
--reference python/test/resource/yuv/src01_hrc00_576x324.yuv \
--distorted python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --feature float_adm \
--backend cuda --places 4
# Expected: 0/48 mismatches across all 5 metrics, max_abs_diff ≤ 6e-6.
References¶
req— user task instruction: "Implementfloat_adm_cuda+float_adm_sycl— the CUDA and SYCL GPU twins of the just-shippedfloat_adm_vulkankernel".- ADR-0192 — batch-3 roadmap.
- ADR-0199 — Vulkan kernel parent.
- ADR-0197 — closest Group B float twin precedent (CUDA + SYCL pattern, fmad-off, mirror-trap notes).
- ADR-0178 — integer ADM Vulkan parent (algorithm shape, dispatch grid).
Status update 2026-05-08: SYCL DWT rewrite to group_load¶
Per ADR-0028 the body above is frozen. This appendix records a downstream investigation outcome that touches the same SYCL TU.
Research-0086 §A.4 emitted a GO recommendation to rewrite the ADM DWT vertical and horizontal passes in integer_adm_sycl.cpp on top of sycl::ext::oneapi::experimental::group_load. The rewrite was attempted on 2026-05-08 and deferred under ADR-0332. Two blockers forced the deferral:
- The vertical-pass tile (
TILE_H × WG_X = 18 × 32 = 576int32 elements,WG_SIZE = 256work-items) does not satisfy the SYCL ext contracttotal = WG_SIZE × ElementsPerWorkItemfor any integerElementsPerWorkItem. The general expression2 × (WG_Y + 1) / WG_Yis integer only forWG_Y ∈ {1, 2}, neither viable for the current 8-row output stride. group_loadrequires a contiguousInputIteratorT; the multi-row tile load is contiguous only within a single tile row (WG_X = 32ints), separated by fullin_stridebetween rows.
The horizontal pass at line 358 carries no SLM tile and was a non-target.
The Battlemage register-pressure delta that motivated the digest's GO recommendation is unverifiable on the dev host (Arc A380 Alchemist; no Xe2 available). The kernel remains bit-exact-untouched on this dimension; the cross-backend gate (scripts/ci/cross_backend_vif_diff.py --feature adm --backend sycl, places=4) continues to apply against the unchanged manual cooperative tile load. See ADR-0332 for the full alternatives matrix and re-open conditions.