ADR-0753: Runtime Resolution-Aware CUDA Kernel Variant Dispatch¶

Status: Accepted
Date: 2026-05-29
Deciders: lusoris
Tags: cuda, perf, build

Context¶

Multi-resolution profiling (Research-0748, Research-0749, Research-0751) demonstrated that no single CUDA kernel configuration is optimal across all workload sizes. The fork targets three distinct operating points:

576p (WS_SMALL): nearly every CUDA metric kernel is launch-overhead-bound. The ADM family runs < 1 wave at this resolution; adm_cm __launch_bounds__(128,8) shows zero gain. Motion loses to CPU outright. Occupancy optimisations are neutral because there are not enough waves to hide latency.
1080p (WS_MEDIUM): wave count is large enough for occupancy effects to materialise. adm_cm_line_kernel_8 with __launch_bounds__(128,8) saves −9.3% execution time by reducing register-bank pressure; filter1d_8_horizontal_kernel __ldg() + __launch_bounds__(128,10) gives +3.6% end-to-end VIF throughput. ms_ssim_decimate is already saturated (95% L1 hit rate) — no tiling needed.
4K (WS_LARGE): filter1d is fully saturated (253 waves, 69.7% active warps); __ldg() / __launch_bounds__ for that kernel are neutral. adm_cm __launch_bounds__ shows near-zero effect (−0.3%, noise). Shared-memory tiling for the ADM / ms_ssim families is now beneficial because working-sets exceed L2.

Choosing the variant at compile time would require separate binaries or fat-binary tricks. A cheap runtime branch (one integer comparison per frame at kernel-launch time) is the right trade-off.

Decision¶

We will classify every incoming frame pair into one of three workload classes (WS_SMALL, WS_MEDIUM, WS_LARGE) based on pixel count at the luma plane, using the thresholds below. Per-kernel optimisation policies are then expressed as a lookup in a flat dispatch table keyed on (feature, workload_class). The classification lives in a new header+TU (resolution_dispatch.h / resolution_dispatch.c) under core/src/feature/cuda/.

Resolution thresholds (luma pixel count):

Class	Condition	Canonical example
`WS_SMALL`	`w * h < 1280 * 720`	576p (576×324)
`WS_MEDIUM`	`1280720 <= wh < 3840*2160`	1080p (1920×1080)
`WS_LARGE`	`w * h >= 3840 * 2160`	4K (3840×2160)

Optimisation application policy (initial, based on measured data):

Optimisation	WS_SMALL	WS_MEDIUM	WS_LARGE
`adm_cm` `__launch_bounds__(128,8)`	SKIP	APPLY	SKIP
`filter1d` `__ldg + __launch_bounds__(128,10)`	SKIP	APPLY	APPLY
`ssim_vert_combine` `__ldg + __launch_bounds`	SKIP	APPLY	APPLY (confirm via 4K re-measurement follow-up if needed)
`ms_ssim_decimate` smem tiling	SKIP	SKIP	SKIP
motion CUDA vs CPU fallback	CPU	CUDA	CUDA

Three feature extractors now use this dispatch framework:

adm_cm_device() in integer_adm_cuda.c: selects between adm_cm_line_kernel_8 (bounded, WS_MEDIUM) and adm_cm_line_kernel_8_no_bounds (WS_SMALL / WS_LARGE). AdmStateCuda carries func_adm_cm_line_kernel_8_no_bounds.
filter1d_8() in integer_vif_cuda.c: selects between filter1d_8_horizontal_kernel_2_17_9 (bounded, WS_MEDIUM + WS_LARGE) and filter1d_8_horizontal_kernel_2_17_9_no_bounds (WS_SMALL). VifStateCuda carries func_filter1d_8_horizontal_kernel_2_17_9_no_bounds.
submit_fex_cuda() in integer_ssim_cuda.c: selects between calculate_ssim_vert_combine (bounded, WS_MEDIUM + WS_LARGE) and calculate_ssim_vert_combine_no_bounds (WS_SMALL). SsimStateCuda carries func_vert_no_bounds.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Compile-time variants only (separate build flags)	Zero runtime overhead; simpler code	Requires separate binaries for each resolution class; inconvenient for users	Defeats single-binary deployment story
First-frame auto-tuning (launch both variants, keep faster)	Self-calibrates to actual hardware	Adds 1-2 frame latency; requires result buffering; complex timing infrastructure	Implementation complexity outweighs marginal calibration benefit; profiling data already provides good starting thresholds
Hand-tuned single variant (always-bounds or never-bounds)	Simplest code	−9.3% regression at 1080p (no-bounds) or neutral waste at 576p/4K (always-bounds)	Not neutral — leaves measured gain on the table
Per-SM-count auto-selection (query `cuDeviceGetAttribute`)	Hardware-aware rather than resolution-proxy	A second GPU property query per init; resolution is the primary occupancy driver, SM count a secondary one	More complexity for marginal improvement; resolution is the dominant parameter

Consequences¶

Positive: adm_cm at 1080p recovers the −9.3% that would otherwise be left on the table when the __launch_bounds__ variant is unconditionally applied at all resolutions.
Positive: The dispatch policy is O(1) per-frame — a single w*h multiply and integer compare at kernel-launch time.
Positive: The table is the single source of truth for "which optimisation wins where"; future kernel authors add one row and one cuModuleGetFunction call.
Negative: AdmStateCuda grows one additional CUfunction pointer per resolution-split kernel. Currently one (func_adm_cm_line_kernel_8_no_bounds).
Negative: adm_cm.cu grows a second kernel instantiation, increasing cubin size marginally (a few KB per arch target).
Neutral: The vmaf_cuda_workload_class function is pure C and needs no CUDA headers, making it testable with the CPU-only build.
Neutral / follow-ups: Future expansions (motion CPU fallback at 576p, filter1d at WS_LARGE) follow the same pattern without touching the dispatch infrastructure.

References¶

Research-0748: filter1d_8_horizontal_kernel 1080p re-measurement.
Research-0749: ADM CM __launch_bounds__ profiling at 1080p.
Research-0751: Cross-backend 4K baseline + PR #79 adm_cm A/B at 4K.
ADR-0743 — VIF filter1d register pressure ceiling pattern.
ADR-0750 — adm_cm_line_kernel_8 __launch_bounds__(128,8) measurement and policy.
req: "design (not implement) a dispatch policy that picks the right variant at runtime based on detected workload size. Scaffold the C-side glue + ADR documenting the policy."