Skip to content

ADR-0760: CUDA motion kernel multi-resolution ncu profiling methodology

  • Status: Accepted
  • Date: 2026-05-29
  • Deciders: lusoris
  • Tags: cuda, perf, research

Context

The CUDA motion feature is the only VMAF feature where GPU throughput is lower than CPU at 4K (Research-0751: CPU 290.8 fps vs CUDA 175.5 fps at 4K). Prior ncu analysis (Research-0735) profiled 576p only. Without a multi-resolution profile, the bottleneck class and the resolution at which the transition from dispatch-bound to compute-bound occurs are unknown.

This ADR documents the decision to run the multi-resolution profiling analysis before proposing any implementation changes, and records the approach taken (ncu --set basic with container isolation, three resolutions, live kernel metrics + wall-time cross-check).

Decision

Profile calculate_motion_score_kernel_8bpc at 576p, 1080p, and 4K using ncu --set basic in a one-off vmaf-dev-mcp:cuda13.3 container before any optimization is implemented. Capture kernel duration, occupancy, DRAM throughput, compute SM throughput, waves/SM, and registers per thread. Cross-check with end-to-end wall-time measurements to separate kernel time from dispatch overhead.

Alternatives considered

Option Pros Cons Why not chosen
Profile at 4K only Minimal profiling time Misses the dispatch-dominated regime that defines the CUDA/CPU ratio at sub-4K Sub-4K is the primary deployment target
Use perf record instead of ncu Host-native, no PMU privilege Does not break down CUDA kernel vs dispatch overhead GPU-specific data required
Implement batching first, then profile Fast path to improvement Risks implementing the wrong fix if bottleneck class is wrong Profile first per spec

Consequences

  • Positive: The multi-resolution data (Research-0760) confirms the bottleneck is dispatch overhead at 576p/1080p and latency-bound compute at 4K. Optimizations can now be ordered correctly: batching first (largest leverage), then separable filter (4K improvement).
  • Negative: Research-only commit; no user-visible improvement yet.

References

  • req: "Profile motion CUDA kernel at 576p, 1080p, and 4K to identify optimization candidates"
  • Research-0760 (docs/research/0760-cuda-motion-ncu-multi-resolution-20260529.md)
  • Research-0735 (prior 576p-only analysis)
  • Research-0751 (PR #90 4K cross-backend baseline)