Skip to content

Research-0756: CUDA F3 Struct-by-Value Kernel Audit

  • Status: Active
  • Workstream: perf/cuda-f3-struct-by-value-audit (no ADR — research only)
  • Date: 2026-05-29
  • References: PR #93 (F3 finding deferral), Research-0754, Research-0734 through 0738

Background

PR #93 (perf/cuda-ssim-vert-combine-ldg-launch-bounds-leak-20260529) identified "F3" as the pattern where a __global__ kernel accepts a VmafCudaBuffer (or similar aggregating struct) by value. This hides the underlying CUdeviceptr pointer from the compiler's alias analysis and prevents ptxas from classifying those loads as non-aliased reads eligible for:

  • the read-only L1 texture cache path (ld.global.nc), and
  • __restrict__-based alias-free optimisation within the inner loop.

The fix used in PR #93 for calculate_ssim_vert_combine was to extract const float *__restrict__ raw pointers from each VmafCudaBuffer argument before the inner loop and then read via __ldg(). That approach is measured at -4.2% kernel duration at 1080p (Research-0754).

This digest performs the fork-wide inventory and severity-ranks the affected kernels by their measured DRAM throughput from PR #77's ncu profiles.


Struct definitions

Struct Fields What hides the pointer
VmafCudaBuffer (core/src/cuda/common.h:32) size_t size; CUdeviceptr data; data field is CUdeviceptr (an opaque unsigned long long); the compiler cannot deduce non-alias from a struct copy
VmafPicture (core/include/libvmaf/picture.h:40) pix_fmt, bpc, w[3], h[3], stride[3], void *data[3], VmafRef *ref, void *priv void *data[3] — opaque void array; no restrict-path possible
AdmBufferCuda (core/src/feature/cuda/integer_adm/integer_adm_cuda.h) Large struct of cuda_adm_dwt_band_t sub-structs containing int16_t * / int32_t * pointers The containing struct copy blocks restrict analysis on all pointer members

Full kernel inventory

AFFECTED — takes VmafCudaBuffer or similar struct by value

File Kernel Struct args Inner loop reads
integer_ms_ssim/ms_ssim_score.cu:70 ms_ssim_decimate VmafCudaBuffer src, VmafCudaBuffer dst 81 loads/output pixel via reinterpret_cast<const float *>(src.data)
integer_ms_ssim/ms_ssim_score.cu:100 ms_ssim_horiz VmafCudaBuffer 11 loads per channel per output pixel (K=11), 7 buffers
integer_ms_ssim/ms_ssim_score.cu:136 ms_ssim_vert_lcs VmafCudaBuffer 11 loads × 5 input buffers (K=11 vert Gaussian)
integer_psnr_hvs/psnr_hvs_score.cu:198 psnr_hvs VmafCudaBuffer ref_in, VmafCudaBuffer dist_in, VmafCudaBuffer partials_out 64-pixel DCT block reads from ref_in.data, dist_in.data
integer_psnr/psnr_score.cu:36 calculate_psnr_kernel_8bpc VmafCudaBuffer sse (output only) Write-only via sse.data — input is VmafPicture
integer_psnr/psnr_score.cu:74 calculate_psnr_kernel_16bpc VmafCudaBuffer sse (output only) Write-only — same as above
integer_psnr/psnr_score.cu:36 calculate_psnr_kernel_8bpc const VmafPicture ref, const VmafPicture dis void *data[3] pointer fields — no __restrict__ possible
integer_psnr/psnr_score.cu:74 calculate_psnr_kernel_16bpc const VmafPicture ref, const VmafPicture dis Same
integer_moment/moment_score.cu:54 calculate_moment_kernel_8bpc const VmafPicture ref, const VmafPicture dis Per-pixel ref[x], dis[x] loads from void *data[0]
integer_moment/moment_score.cu:87 calculate_moment_kernel_16bpc const VmafPicture ref, const VmafPicture dis Same
integer_ciede/ciede_score.cu:175 calculate_ciede_kernel_8bpc const VmafPicture ref, const VmafPicture dis, VmafCudaBuffer sum 6 channel loads per pixel (Y+U+V × 2)
integer_ciede/ciede_score.cu:218 calculate_ciede_kernel_16bpc Same Same
integer_ssim/ssim_score.cu:75 calculate_ssim_horiz_8bpc const VmafPicture ref, const VmafPicture cmp pixel reads via void *data[0]
integer_ssim/ssim_score.cu:112 calculate_ssim_horiz_16bpc Same Same
integer_adm/adm_decouple.cu:43 adm_decouple_kernel AdmBufferCuda buf buf.ref_dwt2.band_h[i*stride+j] etc — 6 loads per pixel
integer_adm/adm_decouple.cu:126 adm_decouple_s123_kernel AdmBufferCuda buf Same pattern
integer_adm/adm_cm.cu:144 i4_adm_cm_line_kernel_fused AdmBufferCuda buf, AdmFixedParametersCuda params Multiple indirect reads via buf.i4_ref_dwt2, buf.i4_csf_f
integer_adm/adm_cm.cu:365 adm_cm_line_kernel_N (macro) AdmBufferCuda buf Per-pixel cubic accumulation loop
integer_adm/adm_cm.cu:456 i4_adm_cm_aim_line_kernel_fused AdmBufferCuda buf Same
integer_adm/adm_cm.cu:682 adm_cm_aim_line_kernel_N (macro) AdmBufferCuda buf Same

Partial F3 (VmafCudaBuffer is output-only or single read — lower severity):

File Kernel Struct args Note
integer_motion/motion_score.cu:64 calculate_motion_score_kernel_8bpc VmafPicture src + VmafCudaBuffer src_blurred, prev_blurred, sad src.data[0] read once for tile load; src_blurred / prev_blurred written/read as uint16_t * in inner loop — F3 applies but kernel is already smem-tiled
integer_motion_v2/motion_v2_score.cu:57 motion_v2_kernel_8bpc VmafCudaBuffer sad (output-only) Only the atomic output is via struct — input pointers already __restrict__

SAFE-PTR — takes raw const T * __restrict__ pointers

File Kernel Note
float_psnr/float_psnr_score.cu:29 float_psnr_kernel_8bpc const uint8_t *__restrict__ ref/dis — clean
float_psnr/float_psnr_score.cu:63 float_psnr_kernel_16bpc Clean
float_vif/float_vif_score.cu:123 float_vif_compute Raw const uint8_t *ref_raw, *dis_raw
float_vif/float_vif_score.cu:283 float_vif_decimate Raw pointers
float_adm/float_adm_score.cu:115–547 All 6 float_adm kernels Raw const float *ref_band, *dis_band
integer_vif/filter1d.cu All 8 filter1d variants (macro-generated) Struct-internal ptrs promoted to locals before inner loop (PR #74 pattern)
integer_adm/adm_csf.cu adm_csf_kernel_*, i4_adm_csf_kernel_* Raw int pointer args from template params
integer_adm/adm_csf_den.cu Both scale-line kernels Raw int16_t *, int32_t * in template
integer_adm/adm_dwt2.cu All DWT2 kernels Smem-tiled; raw ptrs from struct extracted to locals at kernel entry
integer_cambi/cambi_score.cu All 3 cambi kernels const uint16_t *, uint16_t * — raw ptrs
ssimulacra2/ssimulacra2_blur.cu All 5 blur kernels const float *__restrict__ throughout
ssimulacra2/ssimulacra2_mul.cu ssimulacra2_mul3 __restrict__ on all 3 ptrs
speed/speed_score.cu All 5 speed kernels const float *__restrict__ throughout
integer_psnr_hvs/psnr_hvs_score.cu:198 psnr_hvs inner reads Shared-memory DCT ops after tile load — L1 resident

Severity ranking (DRAM throughput from PR #77 ncu measurements)

The critical question for F3 severity is: does the kernel do significant global-memory reads in a hot inner loop, and are those reads currently missing the read-only L1 cache path?

Rank Kernel DRAM Tp (baseline ncu) Source digest F3 severity Reason
1 ms_ssim_vert_lcs Not independently measured; structural twin of calculate_ssim_vert_combine (55.8% at 1080p — Research-0736) Research-0737 (inferred) HIGH 11-tap vert loop over 5 VmafCudaBuffer args (55 loads/pixel). No pointer extraction before loop. Identical pattern to calculate_ssim_vert_combine pre-PR-#93
2 ms_ssim_horiz Not independently measured; same kernel shape as calculate_ssim_horiz_8bpc (8.8% at 576p but scales with resolution) Research-0737 (inferred) HIGH 7 VmafCudaBuffer args; 11-tap horizontal loop; F3 prevents ld.global.nc on all 7 intermediate buffers
3 calculate_ciede_kernel_8bpc/16bpc Not measured in PR #77 digests (adm/motion/ssim/ms_ssim covered; ciede absent) Research-0738 gap MEDIUM-HIGH 6 channel reads per pixel from VmafPicture.data[0..2] + __restrict__-invisible; also ciede2000_dev() is compute-heavy, reducing relative weight of F3
4 adm_decouple_kernel / adm_decouple_s123_kernel 7.8–7.9% DRAM (Research-0734; all ADM kernels launch-starved at 576p) Research-0734 MEDIUM 6 pointer accesses per pixel from AdmBufferCuda sub-structs. At 1080p and above the DRAM pattern matters; launch starvation dominates at 576p
5 psnr_hvs Not measured in PR #77 digests Research-0738 gap MEDIUM 64-pixel DCT block loaded per CUDA block from ref_in.data, dist_in.data; post-load work is integer DCT in shared memory — F3 only affects the initial tile load

Below top-5:

  • calculate_psnr_kernel_*: VmafCudaBuffer sse is write-only; the VmafPicture inputs read at most 1 pixel/thread; launch-count and integer arithmetic dominate. F3 severity: LOW.
  • calculate_moment_kernel_*: Same pattern as psnr — 2 loads per pixel, arith-light kernel, warp-reduce dominated. F3 severity: LOW.
  • adm_cm_line_kernel_8 / i4_adm_cm_line_kernel_fused: Already receiving __launch_bounds__(128, 8) attention (PR #97). The AdmBufferCuda struct access pattern is indirect (buf.ref_dwt2.band_h[i*stride+j]); the inner loop is already register-limited, not DRAM-limited. Adding __ldg() here requires extracting ~6 per-band sub-pointers which is more invasive without proportionate DRAM gain.
  • calculate_motion_score_kernel_8bpc: Input is VmafPicture.data[0] but is loaded once per tile into shared memory. Post-smem-load inner loop does not re-read global memory. F3 severity: NEGLIGIBLE.
  • motion_v2_kernel_8bpc: VmafCudaBuffer sad is write-only (single atomic). Input args already const uint8_t *__restrict__. F3 severity: NONE (only output struct).

Fix strategies

In-kernel __ldg() extraction (low-risk, proven in PR #93)

// Before inner loop, once per kernel invocation:
const float *h_ref_mu = __ldg_ptr(reinterpret_cast<const float *>(h_ref_mu_buf.data));
// ... then use h_ref_mu[src_idx] throughout the loop

This is the approach PR #93 applied to calculate_ssim_vert_combine. It requires no call-site changes, is bit-identical (measured: 0.00e+00 max diff on all frames in Research-0754), and yields measurable improvement where the kernel is DRAM-bound at usable wave counts (1080p: -4.2% duration).

Host-side extraction (higher impact, more invasive)

Extract raw CUdeviceptr values from each VmafCudaBuffer on the host before cuLaunchKernel, then pass as raw kernel arguments. This removes the struct entirely from the kernel signature and allows ptxas to see const float * __restrict__ directly, enabling full ld.global.nc without __ldg(). Requires changing both the .cu kernel signature and every cuLaunchKernel call site in the corresponding _cuda.c dispatcher. Higher effort, not bit-risk, but more invasive. Recommended only where the kernel is also a candidate for AoS→SoA restructuring (F1, deferred in PR #93).


Top-5 PRs to dispatch from this audit

Priority PR to dispatch Target kernel(s) Strategy Estimated effort Expected ncu gain (1080p)
1 perf/cuda-ms-ssim-vert-lcs-ldg ms_ssim_vert_lcs In-kernel __ldg() extraction of 5 read-only buffers before K=11 vert loop ~30 LOC, 1 .cu file, 0 call-site changes -4 to -6% kernel duration (identical structural pattern to PR #93 ssim_vert)
2 perf/cuda-ms-ssim-horiz-ldg ms_ssim_horiz In-kernel __ldg() for 7 VmafCudaBuffer args before K=11 horiz loop ~35 LOC, same file -3 to -5% at ≥1080p; negligible at 576p (wave-limited)
3 perf/cuda-ciede-restrict-ldg calculate_ciede_kernel_8bpc + _16bpc const uint8_t * extraction from VmafPicture.data[0..2] before per-pixel compute; __ldg() on Y/U/V reads ~25 LOC per variant; needs ncu A/B to confirm (no existing baseline) Unknown — profile first; ciede2000_dev() is compute-heavy so DRAM saving may be <3%
4 perf/cuda-adm-decouple-ldg adm_decouple_kernel + adm_decouple_s123_kernel Extract 6 sub-struct band pointers from AdmBufferCuda before inner pixel loop ~20 LOC; ADM kernels are launch-starved at 576p so must profile at 1080p+ Only meaningful at ≥1080p; at 576p launch starvation dominates (DRAM 7.8%, Research-0734)
5 perf/cuda-psnr-hvs-ldg psnr_hvs Extract ref_in.data + dist_in.data pointers before shared-memory tile load ~10 LOC; tile load is the only global-read phase Moderate; tile load is a one-time per-block cost; post-load is smem-resident

Kernels confirmed already fixed

  • calculate_ssim_vert_combine (ssim_score.cu:153) — fixed in PR #93 / Research-0754. Measured -4.2% kernel duration at 1080p.

ncu profile commands for unvalidated kernels

# ms_ssim_vert_lcs (needs separate launch-count to capture past decimate)
ncu -k "regex:ms_ssim_vert_lcs" --set basic --launch-count 10 \
  core/build-ncu/tools/vmaf --feature float_ms_ssim_cuda --backend cuda \
  --reference python/test/resource/yuv/checkerboard_1920_1080_10_3_0_0.yuv \
  --distorted python/test/resource/yuv/checkerboard_1920_1080_10_3_1_0.yuv \
  --width 1920 --height 1080 --pixel_format 420 --bitdepth 8 -o /dev/null
# ncu section: LaunchStats + MemoryWorkloadAnalysis

# ciede (1080p, both bpc variants)
ncu -k "regex:calculate_ciede" --set basic --launch-count 6 \
  core/build-ncu/tools/vmaf --feature ciede_cuda --backend cuda \
  --reference python/test/resource/yuv/checkerboard_1920_1080_10_3_0_0.yuv \
  --distorted python/test/resource/yuv/checkerboard_1920_1080_10_3_1_0.yuv \
  --width 1920 --height 1080 --pixel_format 420 --bitdepth 8 -o /dev/null

Summary

Total AFFECTED kernels (F3 pattern present): 20 kernel variants across 8 families.

Of those, high severity (inner hot loop, not smem-tiled, no existing pointer extraction): 5 (ms_ssim_vert_lcs, ms_ssim_horiz, ciede 8/16bpc, adm_decouple).

Already fixed: calculate_ssim_vert_combine (PR #93).

Dispatch priority: PR-1 (ms_ssim_vert_lcs) first — identical pattern to PR #93, highest confidence of a measurable win at 1080p.

  • PR #93 / Research-0754 (F3 first fix — ssim_vert_combine)
  • Research-0736 (SSIM ncu hotpath)
  • Research-0737 (MS-SSIM ncu hotpath)
  • Research-0734 (ADM ncu hotpath)
  • Research-0738 (cross-metric summary)
  • ADR-0756 (this audit)