Skip to content

ADR-0415: CAMBI SYCL port — closes last CUDA-to-SYCL parity gap

  • Status: Accepted
  • Date: 2026-05-10
  • Deciders: lusoris
  • Tags: sycl, gpu, cambi, feature-extractor, fork-local, t3-15

Status update 2026-05-12. Implementation shipped as core/src/feature/sycl/integer_cambi_sycl.cpp (≈ 37 kB, three SYCL kernels + host fallback for calculate_c_values and top-K spatial pooling per the hybrid design below). Marks the SYCL feature-extractor parity matrix as 17 of 17 versus CUDA.

Context

The fork ships SYCL ports for 16 of 17 CUDA feature extractors. The sole remaining gap is integer_cambi — CAMBI (Contrast Aware Multiscale Banding Index), a banding-detection metric. The CUDA twin landed in ADR-0360 using a "Strategy II hybrid" design: three GPU kernels for the embarrassingly parallel stages (spatial mask, 2× decimate, 3-tap mode filter) with the precision-sensitive sliding-histogram calculate_c_values pass and top-K spatial pooling running on the host CPU via cambi_internal.h wrappers. The same design is carried forward to SYCL here, closing the CUDA-to-SYCL feature parity gap.

The fork must also support both Intel oneAPI (icpx -fsycl) and AdaptiveCpp (acpp --acpp-targets=…) since ADR-0335 landed AdaptiveCpp support. Since all arithmetic in the CAMBI GPU stages is integer-only, the strict-FP flag difference between the two toolchains (-fp-model=precise vs -ffp-contract=off) has no effect on CAMBI's kernels.

Decision

We implement integer_cambi_sycl.cpp as a direct SYCL port of the CUDA twin (ADR-0360), using the same Strategy II hybrid structure:

  • Three SYCL kernels submitted via sycl::queue::submit + nd_range parallel_for:
  • launch_spatial_mask — derivative + 7×7 box-sum + threshold (port of cambi_spatial_mask_kernel).
  • launch_decimate — strict 2× stride-2 subsample (port of cambi_decimate_kernel).
  • launch_filter_mode — separable 3-tap mode filter, horizontal then vertical (port of cambi_filter_mode_kernel).
  • USM device pointers (uint16_t *) via vmaf_sycl_malloc_device / vmaf_sycl_malloc_host; no shared allocations.
  • Synchronous per-scale loop in submit() (matching the CUDA v1 posture in ADR-0360): q.wait() between GPU and CPU stages. collect() only emits the pre-computed score.
  • Does not use vmaf_sycl_graph_register — the per-frame CPU residual serialises frames and is incompatible with the graph-replay model.
  • Registered in feature_extractor.c under #if HAVE_SYCL before the CUDA block, so SYCL is preferred when both backends are compiled in.

Precision contract: places=4 (ULP=0 on emitted score). All GPU stages are integer-only and bit-exact with the CPU scalar extractor (vmaf_fex_cambi). The host residual runs the exact CPU code from cambi_internal.h.

Alternatives considered

Option Pros Cons Why not chosen
Full GPU CAMBI (histogram on GPU too) Eliminates DtoH readback per scale; potentially higher throughput Sliding histogram is hard to parallelise bit-exactly; would break places=4 contract; deferred even for CUDA in ADR-0360 Deferred — Strategy II already acceptable
Reuse CUDA .cu kernels via SYCL compatibility layer Less new code SYCL compat layer is not in the fork's toolchain; complexity risk with AdaptiveCpp Not available in this toolchain stack
Vulkan compute shaders for CAMBI stages Maximum portability Three new GLSL shaders + specialisation constants; the Vulkan twin (ADR-0210) already provides this path Vulkan twin already covers Vulkan; SYCL twin covers Intel/AMD/NVIDIA SYCL path

Consequences

  • Positive: SYCL backend reaches full feature parity with the CUDA backend (17 of 17 extractors). CAMBI scoring on Intel Arc / iGPU / AMD-via-SYCL no longer falls back silently to the CPU path when SYCL is active.
  • Negative: The synchronous per-scale loop in submit() means no frame pipelining for CAMBI on the SYCL backend (same limitation as CUDA v1 in ADR-0360). A v2 async design would need per-scale pinned readback buffers.
  • Neutral / follow-ups: Smoke test test_integer_cambi_sycl verifies registration and a non-crash end-to-end run. Full bit-exactness (places=4) is verified by the cross-backend scoring gate.

References

  • ADR-0360: CAMBI CUDA twin (Strategy II hybrid).
  • ADR-0205 / ADR-0210: CAMBI Vulkan twin (original Strategy II precedent).
  • ADR-0335: AdaptiveCpp dual-toolchain support.
  • ADR-0138 / ADR-0139: Numerical correctness invariants for GPU ports.
  • core/src/feature/cuda/integer_cambi_cuda.c — CUDA reference.
  • core/src/feature/cuda/integer_cambi/cambi_score.cu — CUDA kernels.
  • core/src/feature/sycl/integer_cambi_sycl.cpp — this PR's SYCL port.
  • Source: per user direction (agent task brief 2026-05-10).