Skip to content

ADR-0746: integer_adm_cuda — emit integer_adm3 + integer_aim (parity with CPU)

  • Status: Accepted
  • Date: 2026-05-28
  • Deciders: lusoris, Claude (Anthropic)
  • Tags: cuda, integer_adm, aim, adm3, parity

Context

integer_adm_cuda.c is the default CUDA ADM extractor (integer path is faster than the float path). PR #75 cross-backend baseline audit found that it did NOT emit VMAF_integer_feature_aim_score or VMAF_integer_feature_adm3_score, while the CPU integer_adm.c extractor does. Consumers of pooled_metrics who request these features on --backend cuda received NaN / missing values.

ADR-0574 added aim_score + adm3_score to float_adm_cuda in 2026-05-18, but the integer path (the CUDA default) was not updated.

The integer ADM pipeline is distinct from the float path:

  • DWT bands are stored as int16_t (scale 0) / int32_t (scales 1-3).
  • CSF and CM are fully inlined into fused kernels; there are no separate decouple_r, decouple_a, csf_a, csf_f device buffers.
  • Accumulation uses fixed-point int64_t with cube-root reduction.

Decision

Extend integer_adm_cuda.c and integer_adm/adm_cm.cu to compute and emit VMAF_integer_feature_aim_score and VMAF_integer_feature_adm3_score, ULP-equal to the CPU integer_adm.c path at places=4.

Implementation strategy: fully-inlined AIM CM kernels — no new device buffers.

The AIM CM pass in the CPU code swaps the roles of decouple_a (signal) and decouple_r (threshold). Because the CUDA integer kernels already inline both decouple computations, the AIM pass is expressed by writing two new kernel entry points:

  • i4_adm_cm_aim_line_kernel_fused (scales 1-3, int32 path): same structure as i4_adm_cm_line_kernel_fused but signal = a_val (inline csf_a) and threshold neighbourhood = inline 1/30 * |rfactor * r_val| for each of the 9 neighbourhood pixels.
  • adm_cm_aim_line_kernel_8 (scale 0, int16 path): same structure as adm_cm_line_kernel but threshold uses inline r_val-based filter values and signal = inline csf_a (rfactor * a_val).

Both kernels set noise_weight = 0, matching the CPU i4_adm_cm(..., noise_weight=0.0, measure_aim=true) call.

The RES_BUFFER_SIZE constant is extended from 24 to 36 (adding 12 AIM CM accumulator slots) so the existing D2H copy and host-side conclude_adm_cm can be reused for the AIM accumulators.

Host-side post-processing (write_scores):

score_aim  = aim_num / den   (same denominator as ADM2)
score_adm3 = MAX(score * adm_dlm_weight + (1 - score_aim) * (1 - adm_dlm_weight),
                 adm_min_val)

Matches the CPU integer_adm.c::extract() formula exactly. Two new options are exposed: adm_skip_aim (default false) and adm_dlm_weight (default 0.5), matching the CPU defaults.

Alternatives considered

Option Pros Cons Why not chosen
Reuse float kernel from integer path No new kernel code Defeats integer-path perf advantage; mixes fp32 and int64 accumulators; precision characteristics differ Rejected — defeats the entire point of the integer path
Separate AIM CSF buffer (like float_adm_cuda ADR-0574) Symmetric with float_adm approach Requires new buffer alloc (same size as csf_f), one extra kernel launch, extra VRAM Not needed — full inline costs ≤ 9× decouple_r recomputes per neighbourhood; decouple is light math
Post-process AIM from existing adm_cm accumulators Zero new kernel code Mathematically impossible — the AIM threshold swap changes which pixels are masked; the accumulator values are not equivalent Not viable
Separate adm_aim_csf.cu file Clean separation No benefit at this scale; kernels share inlines from adm_decouple_inline.cuh Not justified

Consequences

  • VMAF_integer_feature_aim_score and VMAF_integer_feature_adm3_score are now emitted by integer_adm_cuda on every frame, matching the CPU path.
  • adm_skip_aim=true disables both and skips the AIM kernel launches.
  • Per-frame launch count increases by up to 4 (2 new kernel entries × 1 per frame for scale-0 + 1 per scale for scales 1-3 = 4 additional launches for the non-skip case).
  • RES_BUFFER_SIZE 24 → 36: D2H copy grows by 96 bytes per frame (negligible).
  • No new device buffers.
  • SYCL / Vulkan / HIP integer_adm twins do not yet emit aim_score / adm3_score; this ADR covers CUDA only.

References

  • req: user task 2026-05-28: "make integer_adm_cuda emit both features, ULP-equal to integer_adm_cpu's emission (NOT float_adm_cuda's)"
  • ADR-0574 — float_adm_cuda AIM/ADM3 port (Phase 1)
  • ADR-0214 — GPU parity CI gate (places=4 tolerance)
  • core/src/feature/integer_adm.c — CPU reference implementation
  • core/src/feature/cuda/AGENTS.md — twin-update rules, parity invariant