ADR-0574: CUDA Twins for HDR-Model Features — Phase 1 (aim, adm3)¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Anthropic)
- Tags:
cuda,feature,hdr,adm
Context¶
The Netflix HDR VMAF model consumes five C-side sub-features that were already present in the CPU code but had no CUDA twin: aim (Anchored Integer Motion), adm3, motion3, chroma_from_luma, and cambi_eotf/effective_eotf. Without CUDA twins, --backend cuda silently fell back to CPU for these sub-features, defeating the purpose of GPU selection.
An audit (documented in docs/research/netflix-upstream-feature-additions-since-sync-2026-05-18.md) found:
motion3— already ported to CUDA ininteger_motion_cuda.c(emitsVMAF_integer_feature_motion3_score).cambi_eotf/effective_eotf— already ported ininteger_cambi_cuda.c.chroma_from_luma— this is a predictor field in the model (model.h), not a feature extractor; no kernel is needed.aimandadm3— genuinely missing fromfloat_adm_cuda.c.
This ADR covers the aim and adm3 additions (Phase 1).
Decision¶
We will extend float_adm_cuda.c and float_adm/float_adm_score.cu to compute and emit VMAF_feature_aim_score and VMAF_feature_adm3_score from the CUDA float_adm extractor, matching the CPU float_adm.c implementation to places=4.
The implementation adds two new kernel stages per scale:
- Stage 2b (
float_adm_csf_r): Computes CSF ofdecouple_r(the remodulated component), writingcsf_a_aimandcsf_f_aim. This mirrors the CPUadm_csf(&decouple_r, ...)call. - Stage 3b (
float_adm_aim_cm): Computes the AIM CM numerator usingdecouple_amasked by thecsf_a_aim/csf_f_aimthreshold, withnoise_weight = 0. Results land in accumulator slots 6..8.
The FADM_ACCUM_SLOTS constant is extended from 6 to 9: [0..2] = csf_den, [3..5] = cm_num, [6..8] = aim_cm per band.
The host-side collect() reads slots 6..8 and computes:
score_aim = MIN(aim_num / aim_den, 1.0)score_adm3via harmonic mean or linear blend depending onadm_adm3_apply_hm/adm_dlm_weight
Six new options are exposed with the same defaults as CPU float_adm.c: adm_bypass_cm, adm_adm3_apply_hm, adm_p_norm, adm_dlm_weight, adm_min_val, adm_skip_aim_scale.
The --fmad=false nvcc flag already applied to float_adm_score.cu (required for places=4 parity, per AGENTS.md) covers the new kernels.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Separate .cu file for AIM kernels | Clean file boundaries | Requires second PTX load, extra meson entry, duplicated DWT band reads | Cost exceeds benefit; kernels share helper functions |
Store decouple_r in a scratch buffer between stages 2 and 3 | Avoids recomputation in stage 3b | Extra buffer allocation (same size as csf_a), extra write+read per pixel | Recomputation is cheap (few flops); bandwidth trade is unfavourable |
| Emit AIM only when explicitly requested via option flag | Saves 2 kernel launches when unused | Complicates dispatch logic; HDR model needs AIM unconditionally | Unconditional is simpler; 2 extra launches per scale are inexpensive |
Consequences¶
- Positive:
--backend cudanow producesVMAF_feature_aim_scoreandVMAF_feature_adm3_scorewithout CPU fallback; HDR VMAF model runs fully on GPU. - Positive: Stages 2b and 3b reuse DWT bands already resident in L2 from stages 2 and 3, keeping the marginal latency low.
- Negative: Per-frame launch count increases from 16 to up to 24 (6 stages × 4 scales; stage 3b is skipped for
adm_skip_aim_scale). - Negative:
FADM_ACCUM_SLOTS = 9increases the pinned D2H copy size by 50% (6 to 9 floats per WG). At 1080p the accum buffer is a few KB per scale — negligible. - Negative: Two new device buffers (
csf_a_aim,csf_f_aim) each sized atFADM_NUM_BANDS * buf_stride * scale_half_h[0] * 4bytes (same ascsf_a/csf_f). At 1080p this is approximately 3.5 MB additional VRAM. - Neutral / follow-ups: SYCL and Vulkan
float_admtwins do not yet emitaim_score/adm3_score; those are Phase 2.
References¶
- req: user task message 2026-05-18: "port CUDA twins for aim, adm3, motion3, chroma_from_luma, cambi_eotf"
docs/research/netflix-upstream-feature-additions-since-sync-2026-05-18.md- ADR-0192 / ADR-0202 — original float_adm CUDA twin specification
- AGENTS.md in
core/src/feature/cuda/—--fmad=falseinvariant andcuLaunchKernelarg-pointer rule - ADR-0535: ADR atomic allocator