Skip to content

ADR-0587: Real Metal Compute Kernels for CAMBI

  • Status: Accepted
  • Date: 2026-05-16
  • Deciders: lusoris, Claude (Anthropic)
  • Tags: metal, cambi, gpu, build

Context

The CAMBI banding-detection feature extractor had a CUDA Strategy II hybrid (ADR-0360): three GPU kernels handle the embarrassingly parallel stages (spatial mask, 2x decimate, separable 3-tap mode filter), while the precision-sensitive calculate_c_values sliding-histogram pass and top-K spatial pooling run on the host CPU via cambi_internal.h wrappers.

The Metal backend had no real CAMBI kernel. This ADR records the decision to port the three CUDA kernels to MSL and wire them into the existing Metal dispatch infrastructure, using the same Strategy II hybrid as CUDA.

Decision

We implement three MSL compute kernels in integer_cambi.metal (cambi_spatial_mask_kernel, cambi_decimate_kernel, cambi_filter_mode_kernel) that are bit-exact ports of the CUDA kernels in cambi_score.cu. The host dispatch in integer_cambi_metal.mm mirrors the CUDA host code in integer_cambi_cuda.c: submit() is synchronous per scale (same as the CUDA v1 approach), and collect() emits the pre-computed score. All integer arithmetic is identical to the CUDA path, so ULP=0 vs CPU scalar is the precision target (places=4 cross-backend gate per ADR-0214).

Alternatives considered

Option Pros Cons Why not chosen
Full GPU (all 5 stages on Metal) No DtoH round-trips calculate_c_values is a complex sliding-histogram; porting it to MSL is high risk and not needed for places=4 parity Not chosen for v1; mirrors CUDA Strategy II rationale
Strategy I (all CPU, Metal no-op) Zero risk No GPU acceleration Does not deliver the requested real kernel
Async per-scale (multiple command buffers in flight) Better throughput Requires per-scale MTLBuffer ping-pong and synchronisation complexity Deferred to v2; CUDA v1 is synchronous for the same reason

References

  • req: "IMPLEMENT REAL Metal cambi kernel (NOT a stub). 20 min budget." (paraphrased)
  • ADR-0360: CUDA CAMBI Strategy II hybrid
  • ADR-0214: cross-backend ULP gate (places=4)
  • core/src/feature/cuda/integer_cambi/cambi_score.cu — CUDA kernel reference
  • core/src/feature/metal/integer_cambi.metal — MSL port
  • core/src/feature/metal/integer_cambi_metal.mm — host dispatch