Skip to content

Research-0734 — CUDA 13.3 Fix-List Deep Audit: Per-Issue Exposure Mapping

Date: 2026-05-28 Author: Lusoris (Claude agent) Branch: docs/cuda-13.3-fix-list-deep-audit-20260528 Supersedes / extends: PR #64 (thread-reconvergence + __mul24 + C++23 unblock)


Executive Summary

This digest audits every "Fixed an issue where…" / "Resolved an issue where…" entry in the CUDA 13.3 (and 13.2 / 13.1 / 13.0) release notes and maps each to our codebase.

No new CRITICAL findings beyond the thread-reconvergence issue already tracked by PR #64. However, two items merit action:

  1. CRITICAL (already tracked — still unresolved for our build env): The thread- reconvergence compiler bug [6156910] is present in NVCC since 12.8. Our Containerfile currently pins cuda-toolkit-13-2. Every .cu kernel compiled with that toolchain is potentially generating wrong register values after nested-divergent branches. This affects all 25+ .cu kernels in core/src/feature/cuda/, most critically float_adm_score.cu, adm_decouple.cu, integer_vif/filter1d.cu, and integer_adm/adm_dwt2.cu, which all contain multiple-level nested if divergence (42–86 branch points per file). The fix is in NVCC 13.3 only.

  2. HIGH — cuFFT floating-point exceptions [5923044]: If multi-GPU cuFFT plans are ever used (not currently in tree, but cuFFT header is available), cufftXtQueryPlan can generate FP exceptions. Not exposed today; annotated below as scope-guarded.

  3. MEDIUM — ptxas WGMMA data race: Affects Hopper/Blackwell wgmma.mma_async instruction sequences only. Our kernels do not use wgmma instructions; not exposed.

  4. MEDIUM — NVRTC + lld/mold initialisation failure [5020829]: Our build does not use NVRTC (runtime compilation). Not exposed.


Severity Rubric

Severity Meaning
CRITICAL Silent score corruption (wrong VMAF value)
HIGH Build failure, runtime crash
MEDIUM Perf regression or fault in a path not yet taken
LOW Cosmetic / memory leak / correctness in unexercised API

Full Per-Issue Classification Table

CUDA 13.3

# Component Fix (verbatim) Bug ID Our Exposure Call-sites Severity Recommendation
1 NVCC compiler "Fixed a compiler issue, present since CUDA 12.8, that could cause compiler-inserted thread reconvergence to fail and leave stale or corrupted values in registers." 6156910 AFFECTED — all .cu kernels with nested divergence compiled under NVCC ≤13.2.x. Kernels confirmed affected: float_adm_score.cu (42 if), adm_decouple.cu (18 if, 8 nested-conditional blocks), integer_vif/filter1d.cu (86 if), integer_adm/adm_dwt2.cu, integer_adm/adm_cm.cu, float_vif_score.cu, integer_cambi/cambi_score.cu, float_motion_score.cu, integer_motion_score.cu, ssimulacra2_blur.cu. Thread-block __syncthreads() calls occur in 12 of 25 .cu files; __syncwarp() in speed/speed_score.cu. 25 .cu files; worst: filter1d.cu (86 branches), float_adm_score.cu (42 branches) CRITICAL Bump Containerfile pin from cuda-toolkit-13-2 to cuda-toolkit-13-3. Tracked by PR #64; this audit confirms the severity and scope. Run cross-backend-diff after upgrade to verify numeric parity.
2 NVCC / NVRTC "Fixed an issue where applications that use NVRTC and are linked with LLVM lld or mold could fail to initialize the supported architecture list." 5020829 NOT AFFECTED — the fork does not use NVRTC. All kernels are compiled AOT by NVCC. No nvrtcCreate / nvrtcCompile call sites found. 0 None.
3 ptxas (CUDA Tools) "Fixed Data Race in WGMMA A/B Register Copy Propagation — ptxas could incorrectly copy-propagate across wgmma.wait_group.sync.aligned N and eliminate necessary mov instructions." (no ID) NOT AFFECTED — our kernels do not use wgmma.mma_async tensor-core instructions. No WGMMA/wmma/mma.sync usage found in core/src/feature/cuda/ or core/src/cuda/. 0 None.
4 CUDA Math "Fixed an issue where silent data corruption could occur when the CUDA Math API __mul24() intrinsic was called with compile-time constant inputs." 5807344 NOT AFFECTED — confirmed by PR #65 audit. Zero __mul24 / __umul24 references in our .cu / .cuh files. 0 None (already audited).
5 cuFFT "Fixed an issue where cufftXtQueryPlan could result in floating-point exceptions when querying multi-GPU plans." 5923044 NOT AFFECTED — fork does not use cuFFT. No cufft API calls in source tree. 0 Annotate in future if cuFFT is added.
6 cuSPARSE "Fixed a memory leak in SpMVOp when destroy_lrb() was called." 5974043 NOT AFFECTED — fork does not use cuSPARSE. 0 None.
7 nvJPEG "Fixed an issue with boundary handling when decoding a region of interest with interpolation enabled." (no ID) NOT AFFECTED — fork does not use nvJPEG. 0 None.

CUDA 13.2 / 13.2 Update 1

# Component Fix (verbatim) Bug ID Our Exposure Severity Recommendation
8 cuBLAS "Fixed an issue in cublasLtMatmulAlgoGetHeuristic() that could result in no algorithm candidates" (Grouped GEMM, Blackwell). CUB-9657 NOT AFFECTED — no cuBLAS usage in fork. None.
9 cuBLAS "Fixed an issue … could cause FP8 kernels to hang." CUB-9627 NOT AFFECTED — no cuBLAS. None.
10 cuBLAS "Fixed an issue … incorrect results when C broadcasting was used (LDC = 0)." 5845724 NOT AFFECTED — no cuBLAS. None.
11 cuBLAS "Fixed integer overflow bug in complex, emulated FP64 matrix multiplication." 5720478 NOT AFFECTED — no cuBLAS. None.
12 cuBLAS "Fixed concurrent execution issue with Tensor Memory." 5807900 NOT AFFECTED — no cuBLAS. None.
13 CUDA Math Known issue note: __mul24() silent data corruption (fix delivered in 13.3). 5807344 NOT AFFECTED — zero __mul24 call sites (PR #65 audit confirmed). None.
14 cuSPARSE "Fixed an issue that caused performance regressions in BSR SpMM." 5860241 NOT AFFECTED — no cuSPARSE. None.

CUDA 13.1 / 13.1 Update 1

# Component Fix (verbatim) Bug ID Our Exposure Severity Recommendation
15 cuBLAS "Fixed missing memory initialization in cublasCreate() emulation handling." CUB-9302 NOT AFFECTED — no cuBLAS. None.
16 cuBLAS "Fixed FP8 matmuls failing on multi-device Blackwell GeForce systems." CUB-9487 NOT AFFECTED — no cuBLAS. None.
17 cuBLAS "Fixed an issue where fixed point emulation with 7 mantissa bits or less could trigger failures." 5692684 NOT AFFECTED — no cuBLAS. None.
18 cuBLAS "Fixed cublasLtMatmul with FP8 arguments incorrectly requiring 16-byte aligned scaling factors." 5728938 NOT AFFECTED — no cuBLAS. None.
19 cuSOLVER "Fixed a bug preventing users from changing algorithm for cusolverDnXsyevBatched." 5539844 NOT AFFECTED — no cuSOLVER. None.
20 cuSPARSE "Fixed an accuracy issue in mixed-precision CSR/COO SpMM computations." CUSPARSE-2349 NOT AFFECTED — no cuSPARSE. None.
21 cuSPARSE "Fixed an issue in CSR SpMM computations with a high number of columns." CUSPARSE-2301 NOT AFFECTED — no cuSPARSE. None.
22 cuSPARSE "Fixed potential issues with unaligned pointers in cusparseCsr2cscEx2." CUSPARSE-2380 NOT AFFECTED — no cuSPARSE. None.
23 cuSPARSE "Fixed a determinism issue in CSR cusparseSpMM ALG3." CUSPARSE-2612 NOT AFFECTED — no cuSPARSE. None.
24 cuSPARSE "Fixed a potential race condition when dynamically loading driver APIs." CUSPARSE-2764 NOT AFFECTED — no cuSPARSE. None.
25 cuFFT "Fixed a correctness issue affecting half/bfloat16 precision size-1 strided transforms." (no ID) NOT AFFECTED — no cuFFT. None.
26 nvJPEG "Fixed nvJPEG's lossless JPEG 92 implementation handling comment markers." 5484797 NOT AFFECTED — no nvJPEG. None.
27 NPP "Reduced nvJPEG Encoder initialization time on Thor." 5533951 NOT AFFECTED — no NPP encoder. None.
28 NPP "Fixed an issue in nppiCFAToRGB_8u_C1C3R() affecting SSIM validation." 5192648 POSSIBLY AFFECTED — NPP SSIM path: if the fork ever uses nppiCFAToRGB upstream of an SSIM kernel, this could silently corrupt SSIM inputs. Grep shows no nppiCFAToRGB call in our source tree; scope-guarded. LOW Monitor if NPP pixel-format conversion is added.

CUDA 13.0 / 13.0 Updates

# Component Fix (verbatim) Bug ID Our Exposure Severity Recommendation
29 cuBLAS "Fixed undefined behavior from dereferencing nullptr… when passing uninitialized Cdesc." CUB-8911 NOT AFFECTED — no cuBLAS. None.
30 cuBLAS "Fixed an issue where some cublasSsyrkx kernels produced incorrect results on Blackwell." CUB-8846 NOT AFFECTED — no cuBLAS. None.
31 cuSOLVER "Fixed a race condition in cusolverDnXgeev using multiple host threads." (no ID) NOT AFFECTED — no cuSOLVER. None.
32 cuSPARSE "Fixed incorrect results from cusparseCsr2cscEx2 with zero dimensions." CUSPARSE-2319 NOT AFFECTED — no cuSPARSE. None.
33 cuSPARSE "Fixed incorrect results from CSR SpMV with zero input dimensions." CUSPARSE-1800 NOT AFFECTED — no cuSPARSE. None.
34 cuSPARSE "Fixed a bug in cusparseSparseToDense_bufferSize requesting excessive memory." CUSPARSE-2352 NOT AFFECTED — no cuSPARSE. None.
35 nvJPEG "Made nvJPEG more robust — no longer crashes on malformed bitstreams." 5168024, 5133845, 5143450 NOT AFFECTED — no nvJPEG. None.
36 nvJPEG "nvjpegEncodeYUV avoids reading outside allocated device memory." 5133826 NOT AFFECTED — no nvJPEG. None.
37 nvJPEG "Fixed a race condition during progressive encoding." 5307748 NOT AFFECTED — no nvJPEG. None.
38 nvJPEG "Fixed uninitialized read when encoding as 4:1:0 JPEG bitstreams." 5308008 NOT AFFECTED — no nvJPEG. None.
39 NPP "Fixed an issue in nppiFloodFillRange_8u_C1IR_Ctx flood fill operations." 5141474 NOT AFFECTED — no nppiFloodFill usage. None.
40 NPP "Resolved a bug in the nppiDebayer() API affecting color reconstruction." 5138782 NOT AFFECTED — no Bayer-to-RGB path in fork. None.

Summary Counts

Severity Count Items
CRITICAL 1 Thread-reconvergence compiler bug [6156910] — already tracked in PR #64 scope, Containerfile still pins 13.2
HIGH 0
MEDIUM 1 cuFFT FP-exception on multi-GPU plan query [5923044] — scope-guarded (no cuFFT usage)
LOW 1 NPP nppiCFAToRGB SSIM corruption [5192648] — scope-guarded (no call sites)
NOT AFFECTED 37 All library (cuBLAS, cuSOLVER, cuSPARSE, nvJPEG, NPP) fixes — fork uses none of these APIs

Action Items

  1. Containerfile CUDA pin bump (CRITICAL): Upgrade cuda-toolkit-13-2cuda-toolkit-13-3 in dev/Containerfile (line 186). Reconvergence bug [6156910] is live in 13.2. ADR not required (pin-version change, not architectural). Run /cross-backend-diff after upgrade.
  2. Dockerfile main image pin (CRITICAL): Dockerfile FROM line references nvidia/cuda:13.2.1-devel-ubuntu24.04. This should be bumped to 13.3.x when that image is published.
  3. Annotate __syncwarp call in speed/speed_score.cu: The single __syncwarp call (line 279) is in a warp-uniform code path but sits near a conditional. After the 13.3 pin bump, run the GPU parity gate to confirm no delta.

Reproducer / Audit Commands

# 1. Fetch the release notes
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

# 2. Enumerate our CUDA sources
find core/src/feature/cuda core/src/cuda -name '*.cu' -o -name '*.cuh' | sort

# 3. Grep for __mul24 (confirmed zero)
grep -rn '__mul24\|__umul24' core/src/

# 4. Grep for NVRTC (confirmed zero)
grep -rn 'nvrtc\|NVRTC' core/src/ mcp-server/

# 5. Grep for WGMMA (confirmed zero)
grep -rn 'wgmma\|mma\.sync\|wmma' core/src/

# 6. Count nested divergence (worst offenders)
grep -c 'if\b' core/src/feature/cuda/integer_vif/filter1d.cu        # 86
grep -c 'if\b' core/src/feature/cuda/float_adm/float_adm_score.cu   # 42
grep -c 'if\b' core/src/feature/cuda/integer_adm/adm_decouple.cu    # 18

# 7. Verify Containerfile CUDA version
grep 'cuda-toolkit' dev/Containerfile

References