Skip to content

ADR-0574: CUDA Twins for HDR-Model Features — Phase 1 (aim, adm3)

  • Status: Accepted
  • Date: 2026-05-18
  • Deciders: lusoris, Claude (Anthropic)
  • Tags: cuda, feature, hdr, adm

Context

The Netflix HDR VMAF model consumes five C-side sub-features that were already present in the CPU code but had no CUDA twin: aim (Anchored Integer Motion), adm3, motion3, chroma_from_luma, and cambi_eotf/effective_eotf. Without CUDA twins, --backend cuda silently fell back to CPU for these sub-features, defeating the purpose of GPU selection.

An audit (documented in docs/research/netflix-upstream-feature-additions-since-sync-2026-05-18.md) found:

  • motion3 — already ported to CUDA in integer_motion_cuda.c (emits VMAF_integer_feature_motion3_score).
  • cambi_eotf / effective_eotf — already ported in integer_cambi_cuda.c.
  • chroma_from_luma — this is a predictor field in the model (model.h), not a feature extractor; no kernel is needed.
  • aim and adm3 — genuinely missing from float_adm_cuda.c.

This ADR covers the aim and adm3 additions (Phase 1).

Decision

We will extend float_adm_cuda.c and float_adm/float_adm_score.cu to compute and emit VMAF_feature_aim_score and VMAF_feature_adm3_score from the CUDA float_adm extractor, matching the CPU float_adm.c implementation to places=4.

The implementation adds two new kernel stages per scale:

  • Stage 2b (float_adm_csf_r): Computes CSF of decouple_r (the remodulated component), writing csf_a_aim and csf_f_aim. This mirrors the CPU adm_csf(&decouple_r, ...) call.
  • Stage 3b (float_adm_aim_cm): Computes the AIM CM numerator using decouple_a masked by the csf_a_aim/csf_f_aim threshold, with noise_weight = 0. Results land in accumulator slots 6..8.

The FADM_ACCUM_SLOTS constant is extended from 6 to 9: [0..2] = csf_den, [3..5] = cm_num, [6..8] = aim_cm per band.

The host-side collect() reads slots 6..8 and computes:

  • score_aim = MIN(aim_num / aim_den, 1.0)
  • score_adm3 via harmonic mean or linear blend depending on adm_adm3_apply_hm / adm_dlm_weight

Six new options are exposed with the same defaults as CPU float_adm.c: adm_bypass_cm, adm_adm3_apply_hm, adm_p_norm, adm_dlm_weight, adm_min_val, adm_skip_aim_scale.

The --fmad=false nvcc flag already applied to float_adm_score.cu (required for places=4 parity, per AGENTS.md) covers the new kernels.

Alternatives considered

Option Pros Cons Why not chosen
Separate .cu file for AIM kernels Clean file boundaries Requires second PTX load, extra meson entry, duplicated DWT band reads Cost exceeds benefit; kernels share helper functions
Store decouple_r in a scratch buffer between stages 2 and 3 Avoids recomputation in stage 3b Extra buffer allocation (same size as csf_a), extra write+read per pixel Recomputation is cheap (few flops); bandwidth trade is unfavourable
Emit AIM only when explicitly requested via option flag Saves 2 kernel launches when unused Complicates dispatch logic; HDR model needs AIM unconditionally Unconditional is simpler; 2 extra launches per scale are inexpensive

Consequences

  • Positive: --backend cuda now produces VMAF_feature_aim_score and VMAF_feature_adm3_score without CPU fallback; HDR VMAF model runs fully on GPU.
  • Positive: Stages 2b and 3b reuse DWT bands already resident in L2 from stages 2 and 3, keeping the marginal latency low.
  • Negative: Per-frame launch count increases from 16 to up to 24 (6 stages × 4 scales; stage 3b is skipped for adm_skip_aim_scale).
  • Negative: FADM_ACCUM_SLOTS = 9 increases the pinned D2H copy size by 50% (6 to 9 floats per WG). At 1080p the accum buffer is a few KB per scale — negligible.
  • Negative: Two new device buffers (csf_a_aim, csf_f_aim) each sized at FADM_NUM_BANDS * buf_stride * scale_half_h[0] * 4 bytes (same as csf_a/csf_f). At 1080p this is approximately 3.5 MB additional VRAM.
  • Neutral / follow-ups: SYCL and Vulkan float_adm twins do not yet emit aim_score / adm3_score; those are Phase 2.

References

  • req: user task message 2026-05-18: "port CUDA twins for aim, adm3, motion3, chroma_from_luma, cambi_eotf"
  • docs/research/netflix-upstream-feature-additions-since-sync-2026-05-18.md
  • ADR-0192 / ADR-0202 — original float_adm CUDA twin specification
  • AGENTS.md in core/src/feature/cuda/--fmad=false invariant and cuLaunchKernel arg-pointer rule
  • ADR-0535: ADR atomic allocator