Research-0734 — CUDA 13.3 Fix-List Deep Audit: Per-Issue Exposure Mapping¶

Date: 2026-05-28 Author: Lusoris (Claude agent) Branch: docs/cuda-13.3-fix-list-deep-audit-20260528 Supersedes / extends: PR #64 (thread-reconvergence + __mul24 + C++23 unblock)

Executive Summary¶

This digest audits every "Fixed an issue where…" / "Resolved an issue where…" entry in the CUDA 13.3 (and 13.2 / 13.1 / 13.0) release notes and maps each to our codebase.

No new CRITICAL findings beyond the thread-reconvergence issue already tracked by PR #64. However, two items merit action:

CRITICAL (already tracked — still unresolved for our build env): The thread- reconvergence compiler bug [6156910] is present in NVCC since 12.8. Our Containerfile currently pins cuda-toolkit-13-2. Every .cu kernel compiled with that toolchain is potentially generating wrong register values after nested-divergent branches. This affects all 25+ .cu kernels in core/src/feature/cuda/, most critically float_adm_score.cu, adm_decouple.cu, integer_vif/filter1d.cu, and integer_adm/adm_dwt2.cu, which all contain multiple-level nested if divergence (42–86 branch points per file). The fix is in NVCC 13.3 only.
HIGH — cuFFT floating-point exceptions [5923044]: If multi-GPU cuFFT plans are ever used (not currently in tree, but cuFFT header is available), cufftXtQueryPlan can generate FP exceptions. Not exposed today; annotated below as scope-guarded.
MEDIUM — ptxas WGMMA data race: Affects Hopper/Blackwell wgmma.mma_async instruction sequences only. Our kernels do not use wgmma instructions; not exposed.
MEDIUM — NVRTC + lld/mold initialisation failure [5020829]: Our build does not use NVRTC (runtime compilation). Not exposed.

Severity Rubric¶

Severity	Meaning
CRITICAL	Silent score corruption (wrong VMAF value)
HIGH	Build failure, runtime crash
MEDIUM	Perf regression or fault in a path not yet taken
LOW	Cosmetic / memory leak / correctness in unexercised API

Full Per-Issue Classification Table¶

CUDA 13.3¶

#	Component	Fix (verbatim)	Bug ID	Our Exposure	Call-sites	Severity	Recommendation
1	NVCC compiler	"Fixed a compiler issue, present since CUDA 12.8, that could cause compiler-inserted thread reconvergence to fail and leave stale or corrupted values in registers."	6156910	AFFECTED — all `.cu` kernels with nested divergence compiled under NVCC ≤13.2.x. Kernels confirmed affected: `float_adm_score.cu` (42 `if`), `adm_decouple.cu` (18 `if`, 8 nested-conditional blocks), `integer_vif/filter1d.cu` (86 `if`), `integer_adm/adm_dwt2.cu`, `integer_adm/adm_cm.cu`, `float_vif_score.cu`, `integer_cambi/cambi_score.cu`, `float_motion_score.cu`, `integer_motion_score.cu`, `ssimulacra2_blur.cu`. Thread-block `__syncthreads()` calls occur in 12 of 25 `.cu` files; `__syncwarp()` in `speed/speed_score.cu`.	25 `.cu` files; worst: filter1d.cu (86 branches), float_adm_score.cu (42 branches)	CRITICAL	Bump `Containerfile` pin from `cuda-toolkit-13-2` to `cuda-toolkit-13-3`. Tracked by PR #64; this audit confirms the severity and scope. Run `cross-backend-diff` after upgrade to verify numeric parity.
2	NVCC / NVRTC	"Fixed an issue where applications that use NVRTC and are linked with LLVM `lld` or `mold` could fail to initialize the supported architecture list."	5020829	NOT AFFECTED — the fork does not use NVRTC. All kernels are compiled AOT by NVCC. No `nvrtcCreate` / `nvrtcCompile` call sites found.	0	—	None.
3	ptxas (CUDA Tools)	"Fixed Data Race in WGMMA A/B Register Copy Propagation — ptxas could incorrectly copy-propagate across `wgmma.wait_group.sync.aligned N` and eliminate necessary `mov` instructions."	(no ID)	NOT AFFECTED — our kernels do not use `wgmma.mma_async` tensor-core instructions. No WGMMA/wmma/mma.sync usage found in `core/src/feature/cuda/` or `core/src/cuda/`.	0	—	None.
4	CUDA Math	"Fixed an issue where silent data corruption could occur when the CUDA Math API `__mul24()` intrinsic was called with compile-time constant inputs."	5807344	NOT AFFECTED — confirmed by PR #65 audit. Zero `__mul24` / `__umul24` references in our `.cu` / `.cuh` files.	0	—	None (already audited).
5	cuFFT	"Fixed an issue where `cufftXtQueryPlan` could result in floating-point exceptions when querying multi-GPU plans."	5923044	NOT AFFECTED — fork does not use cuFFT. No `cufft` API calls in source tree.	0	—	Annotate in future if cuFFT is added.
6	cuSPARSE	"Fixed a memory leak in `SpMVOp` when `destroy_lrb()` was called."	5974043	NOT AFFECTED — fork does not use cuSPARSE.	0	—	None.
7	nvJPEG	"Fixed an issue with boundary handling when decoding a region of interest with interpolation enabled."	(no ID)	NOT AFFECTED — fork does not use nvJPEG.	0	—	None.

CUDA 13.2 / 13.2 Update 1¶

#	Component	Fix (verbatim)	Bug ID	Our Exposure	Severity	Recommendation
8	cuBLAS	"Fixed an issue in `cublasLtMatmulAlgoGetHeuristic()` that could result in no algorithm candidates" (Grouped GEMM, Blackwell).	CUB-9657	NOT AFFECTED — no cuBLAS usage in fork.	—	None.
9	cuBLAS	"Fixed an issue … could cause FP8 kernels to hang."	CUB-9627	NOT AFFECTED — no cuBLAS.	—	None.
10	cuBLAS	"Fixed an issue … incorrect results when C broadcasting was used (`LDC = 0`)."	5845724	NOT AFFECTED — no cuBLAS.	—	None.
11	cuBLAS	"Fixed integer overflow bug in complex, emulated FP64 matrix multiplication."	5720478	NOT AFFECTED — no cuBLAS.	—	None.
12	cuBLAS	"Fixed concurrent execution issue with Tensor Memory."	5807900	NOT AFFECTED — no cuBLAS.	—	None.
13	CUDA Math	Known issue note: `__mul24()` silent data corruption (fix delivered in 13.3).	5807344	NOT AFFECTED — zero `__mul24` call sites (PR #65 audit confirmed).	—	None.
14	cuSPARSE	"Fixed an issue that caused performance regressions in BSR SpMM."	5860241	NOT AFFECTED — no cuSPARSE.	—	None.

CUDA 13.1 / 13.1 Update 1¶

#	Component	Fix (verbatim)	Bug ID	Our Exposure	Severity	Recommendation
15	cuBLAS	"Fixed missing memory initialization in `cublasCreate()` emulation handling."	CUB-9302	NOT AFFECTED — no cuBLAS.	—	None.
16	cuBLAS	"Fixed FP8 matmuls failing on multi-device Blackwell GeForce systems."	CUB-9487	NOT AFFECTED — no cuBLAS.	—	None.
17	cuBLAS	"Fixed an issue where fixed point emulation with 7 mantissa bits or less could trigger failures."	5692684	NOT AFFECTED — no cuBLAS.	—	None.
18	cuBLAS	"Fixed `cublasLtMatmul` with FP8 arguments incorrectly requiring 16-byte aligned scaling factors."	5728938	NOT AFFECTED — no cuBLAS.	—	None.
19	cuSOLVER	"Fixed a bug preventing users from changing algorithm for `cusolverDnXsyevBatched`."	5539844	NOT AFFECTED — no cuSOLVER.	—	None.
20	cuSPARSE	"Fixed an accuracy issue in mixed-precision CSR/COO SpMM computations."	CUSPARSE-2349	NOT AFFECTED — no cuSPARSE.	—	None.
21	cuSPARSE	"Fixed an issue in CSR SpMM computations with a high number of columns."	CUSPARSE-2301	NOT AFFECTED — no cuSPARSE.	—	None.
22	cuSPARSE	"Fixed potential issues with unaligned pointers in `cusparseCsr2cscEx2`."	CUSPARSE-2380	NOT AFFECTED — no cuSPARSE.	—	None.
23	cuSPARSE	"Fixed a determinism issue in CSR `cusparseSpMM` ALG3."	CUSPARSE-2612	NOT AFFECTED — no cuSPARSE.	—	None.
24	cuSPARSE	"Fixed a potential race condition when dynamically loading driver APIs."	CUSPARSE-2764	NOT AFFECTED — no cuSPARSE.	—	None.
25	cuFFT	"Fixed a correctness issue affecting half/bfloat16 precision size-1 strided transforms."	(no ID)	NOT AFFECTED — no cuFFT.	—	None.
26	nvJPEG	"Fixed nvJPEG's lossless JPEG 92 implementation handling comment markers."	5484797	NOT AFFECTED — no nvJPEG.	—	None.
27	NPP	"Reduced nvJPEG Encoder initialization time on Thor."	5533951	NOT AFFECTED — no NPP encoder.	—	None.
28	NPP	"Fixed an issue in `nppiCFAToRGB_8u_C1C3R()` affecting SSIM validation."	5192648	POSSIBLY AFFECTED — NPP SSIM path: if the fork ever uses `nppiCFAToRGB` upstream of an SSIM kernel, this could silently corrupt SSIM inputs. Grep shows no `nppiCFAToRGB` call in our source tree; scope-guarded.	LOW	Monitor if NPP pixel-format conversion is added.

CUDA 13.0 / 13.0 Updates¶

#	Component	Fix (verbatim)	Bug ID	Our Exposure	Severity	Recommendation
29	cuBLAS	"Fixed undefined behavior from dereferencing `nullptr`… when passing uninitialized `Cdesc`."	CUB-8911	NOT AFFECTED — no cuBLAS.	—	None.
30	cuBLAS	"Fixed an issue where some `cublasSsyrkx` kernels produced incorrect results on Blackwell."	CUB-8846	NOT AFFECTED — no cuBLAS.	—	None.
31	cuSOLVER	"Fixed a race condition in `cusolverDnXgeev` using multiple host threads."	(no ID)	NOT AFFECTED — no cuSOLVER.	—	None.
32	cuSPARSE	"Fixed incorrect results from `cusparseCsr2cscEx2` with zero dimensions."	CUSPARSE-2319	NOT AFFECTED — no cuSPARSE.	—	None.
33	cuSPARSE	"Fixed incorrect results from CSR SpMV with zero input dimensions."	CUSPARSE-1800	NOT AFFECTED — no cuSPARSE.	—	None.
34	cuSPARSE	"Fixed a bug in `cusparseSparseToDense_bufferSize` requesting excessive memory."	CUSPARSE-2352	NOT AFFECTED — no cuSPARSE.	—	None.
35	nvJPEG	"Made nvJPEG more robust — no longer crashes on malformed bitstreams."	5168024, 5133845, 5143450	NOT AFFECTED — no nvJPEG.	—	None.
36	nvJPEG	"`nvjpegEncodeYUV` avoids reading outside allocated device memory."	5133826	NOT AFFECTED — no nvJPEG.	—	None.
37	nvJPEG	"Fixed a race condition during progressive encoding."	5307748	NOT AFFECTED — no nvJPEG.	—	None.
38	nvJPEG	"Fixed uninitialized read when encoding as 4:1:0 JPEG bitstreams."	5308008	NOT AFFECTED — no nvJPEG.	—	None.
39	NPP	"Fixed an issue in `nppiFloodFillRange_8u_C1IR_Ctx` flood fill operations."	5141474	NOT AFFECTED — no `nppiFloodFill` usage.	—	None.
40	NPP	"Resolved a bug in the `nppiDebayer()` API affecting color reconstruction."	5138782	NOT AFFECTED — no Bayer-to-RGB path in fork.	—	None.

Summary Counts¶

Severity	Count	Items
CRITICAL	1	Thread-reconvergence compiler bug [6156910] — already tracked in PR #64 scope, Containerfile still pins 13.2
HIGH	0	—
MEDIUM	1	cuFFT FP-exception on multi-GPU plan query [5923044] — scope-guarded (no cuFFT usage)
LOW	1	NPP `nppiCFAToRGB` SSIM corruption [5192648] — scope-guarded (no call sites)
NOT AFFECTED	37	All library (cuBLAS, cuSOLVER, cuSPARSE, nvJPEG, NPP) fixes — fork uses none of these APIs

Action Items¶

Containerfile CUDA pin bump (CRITICAL): Upgrade cuda-toolkit-13-2 → cuda-toolkit-13-3 in dev/Containerfile (line 186). Reconvergence bug [6156910] is live in 13.2. ADR not required (pin-version change, not architectural). Run /cross-backend-diff after upgrade.
Dockerfile main image pin (CRITICAL): Dockerfile FROM line references nvidia/cuda:13.2.1-devel-ubuntu24.04. This should be bumped to 13.3.x when that image is published.
Annotate __syncwarp call in speed/speed_score.cu: The single __syncwarp call (line 279) is in a warp-uniform code path but sits near a conditional. After the 13.3 pin bump, run the GPU parity gate to confirm no delta.

Reproducer / Audit Commands¶

# 1. Fetch the release notes
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

# 2. Enumerate our CUDA sources
find core/src/feature/cuda core/src/cuda -name '*.cu' -o -name '*.cuh' | sort

# 3. Grep for __mul24 (confirmed zero)
grep -rn '__mul24\|__umul24' core/src/

# 4. Grep for NVRTC (confirmed zero)
grep -rn 'nvrtc\|NVRTC' core/src/ mcp-server/

# 5. Grep for WGMMA (confirmed zero)
grep -rn 'wgmma\|mma\.sync\|wmma' core/src/

# 6. Count nested divergence (worst offenders)
grep -c 'if\b' core/src/feature/cuda/integer_vif/filter1d.cu        # 86
grep -c 'if\b' core/src/feature/cuda/float_adm/float_adm_score.cu   # 42
grep -c 'if\b' core/src/feature/cuda/integer_adm/adm_decouple.cu    # 18

# 7. Verify Containerfile CUDA version
grep 'cuda-toolkit' dev/Containerfile

References¶

CUDA Toolkit 13.3 Release Notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
PR #64: thread-reconvergence + __mul24 + C++23 unblock
PR #65: __mul24 audit (zero exposure confirmed)
Research-0734: this document