Research-0734 — CUDA 13.3 Fix-List Deep Audit: Per-Issue Exposure Mapping¶
Date: 2026-05-28 Author: Lusoris (Claude agent) Branch: docs/cuda-13.3-fix-list-deep-audit-20260528 Supersedes / extends: PR #64 (thread-reconvergence + __mul24 + C++23 unblock)
Executive Summary¶
This digest audits every "Fixed an issue where…" / "Resolved an issue where…" entry in the CUDA 13.3 (and 13.2 / 13.1 / 13.0) release notes and maps each to our codebase.
No new CRITICAL findings beyond the thread-reconvergence issue already tracked by PR #64. However, two items merit action:
-
CRITICAL (already tracked — still unresolved for our build env): The thread- reconvergence compiler bug [6156910] is present in NVCC since 12.8. Our
Containerfilecurrently pinscuda-toolkit-13-2. Every.cukernel compiled with that toolchain is potentially generating wrong register values after nested-divergent branches. This affects all 25+.cukernels incore/src/feature/cuda/, most criticallyfloat_adm_score.cu,adm_decouple.cu,integer_vif/filter1d.cu, andinteger_adm/adm_dwt2.cu, which all contain multiple-level nestedifdivergence (42–86 branch points per file). The fix is in NVCC 13.3 only. -
HIGH — cuFFT floating-point exceptions [5923044]: If multi-GPU cuFFT plans are ever used (not currently in tree, but cuFFT header is available),
cufftXtQueryPlancan generate FP exceptions. Not exposed today; annotated below as scope-guarded. -
MEDIUM — ptxas WGMMA data race: Affects Hopper/Blackwell
wgmma.mma_asyncinstruction sequences only. Our kernels do not usewgmmainstructions; not exposed. -
MEDIUM — NVRTC + lld/mold initialisation failure [5020829]: Our build does not use NVRTC (runtime compilation). Not exposed.
Severity Rubric¶
| Severity | Meaning |
|---|---|
| CRITICAL | Silent score corruption (wrong VMAF value) |
| HIGH | Build failure, runtime crash |
| MEDIUM | Perf regression or fault in a path not yet taken |
| LOW | Cosmetic / memory leak / correctness in unexercised API |
Full Per-Issue Classification Table¶
CUDA 13.3¶
| # | Component | Fix (verbatim) | Bug ID | Our Exposure | Call-sites | Severity | Recommendation |
|---|---|---|---|---|---|---|---|
| 1 | NVCC compiler | "Fixed a compiler issue, present since CUDA 12.8, that could cause compiler-inserted thread reconvergence to fail and leave stale or corrupted values in registers." | 6156910 | AFFECTED — all .cu kernels with nested divergence compiled under NVCC ≤13.2.x. Kernels confirmed affected: float_adm_score.cu (42 if), adm_decouple.cu (18 if, 8 nested-conditional blocks), integer_vif/filter1d.cu (86 if), integer_adm/adm_dwt2.cu, integer_adm/adm_cm.cu, float_vif_score.cu, integer_cambi/cambi_score.cu, float_motion_score.cu, integer_motion_score.cu, ssimulacra2_blur.cu. Thread-block __syncthreads() calls occur in 12 of 25 .cu files; __syncwarp() in speed/speed_score.cu. | 25 .cu files; worst: filter1d.cu (86 branches), float_adm_score.cu (42 branches) | CRITICAL | Bump Containerfile pin from cuda-toolkit-13-2 to cuda-toolkit-13-3. Tracked by PR #64; this audit confirms the severity and scope. Run cross-backend-diff after upgrade to verify numeric parity. |
| 2 | NVCC / NVRTC | "Fixed an issue where applications that use NVRTC and are linked with LLVM lld or mold could fail to initialize the supported architecture list." | 5020829 | NOT AFFECTED — the fork does not use NVRTC. All kernels are compiled AOT by NVCC. No nvrtcCreate / nvrtcCompile call sites found. | 0 | — | None. |
| 3 | ptxas (CUDA Tools) | "Fixed Data Race in WGMMA A/B Register Copy Propagation — ptxas could incorrectly copy-propagate across wgmma.wait_group.sync.aligned N and eliminate necessary mov instructions." | (no ID) | NOT AFFECTED — our kernels do not use wgmma.mma_async tensor-core instructions. No WGMMA/wmma/mma.sync usage found in core/src/feature/cuda/ or core/src/cuda/. | 0 | — | None. |
| 4 | CUDA Math | "Fixed an issue where silent data corruption could occur when the CUDA Math API __mul24() intrinsic was called with compile-time constant inputs." | 5807344 | NOT AFFECTED — confirmed by PR #65 audit. Zero __mul24 / __umul24 references in our .cu / .cuh files. | 0 | — | None (already audited). |
| 5 | cuFFT | "Fixed an issue where cufftXtQueryPlan could result in floating-point exceptions when querying multi-GPU plans." | 5923044 | NOT AFFECTED — fork does not use cuFFT. No cufft API calls in source tree. | 0 | — | Annotate in future if cuFFT is added. |
| 6 | cuSPARSE | "Fixed a memory leak in SpMVOp when destroy_lrb() was called." | 5974043 | NOT AFFECTED — fork does not use cuSPARSE. | 0 | — | None. |
| 7 | nvJPEG | "Fixed an issue with boundary handling when decoding a region of interest with interpolation enabled." | (no ID) | NOT AFFECTED — fork does not use nvJPEG. | 0 | — | None. |
CUDA 13.2 / 13.2 Update 1¶
| # | Component | Fix (verbatim) | Bug ID | Our Exposure | Severity | Recommendation |
|---|---|---|---|---|---|---|
| 8 | cuBLAS | "Fixed an issue in cublasLtMatmulAlgoGetHeuristic() that could result in no algorithm candidates" (Grouped GEMM, Blackwell). | CUB-9657 | NOT AFFECTED — no cuBLAS usage in fork. | — | None. |
| 9 | cuBLAS | "Fixed an issue … could cause FP8 kernels to hang." | CUB-9627 | NOT AFFECTED — no cuBLAS. | — | None. |
| 10 | cuBLAS | "Fixed an issue … incorrect results when C broadcasting was used (LDC = 0)." | 5845724 | NOT AFFECTED — no cuBLAS. | — | None. |
| 11 | cuBLAS | "Fixed integer overflow bug in complex, emulated FP64 matrix multiplication." | 5720478 | NOT AFFECTED — no cuBLAS. | — | None. |
| 12 | cuBLAS | "Fixed concurrent execution issue with Tensor Memory." | 5807900 | NOT AFFECTED — no cuBLAS. | — | None. |
| 13 | CUDA Math | Known issue note: __mul24() silent data corruption (fix delivered in 13.3). | 5807344 | NOT AFFECTED — zero __mul24 call sites (PR #65 audit confirmed). | — | None. |
| 14 | cuSPARSE | "Fixed an issue that caused performance regressions in BSR SpMM." | 5860241 | NOT AFFECTED — no cuSPARSE. | — | None. |
CUDA 13.1 / 13.1 Update 1¶
| # | Component | Fix (verbatim) | Bug ID | Our Exposure | Severity | Recommendation |
|---|---|---|---|---|---|---|
| 15 | cuBLAS | "Fixed missing memory initialization in cublasCreate() emulation handling." | CUB-9302 | NOT AFFECTED — no cuBLAS. | — | None. |
| 16 | cuBLAS | "Fixed FP8 matmuls failing on multi-device Blackwell GeForce systems." | CUB-9487 | NOT AFFECTED — no cuBLAS. | — | None. |
| 17 | cuBLAS | "Fixed an issue where fixed point emulation with 7 mantissa bits or less could trigger failures." | 5692684 | NOT AFFECTED — no cuBLAS. | — | None. |
| 18 | cuBLAS | "Fixed cublasLtMatmul with FP8 arguments incorrectly requiring 16-byte aligned scaling factors." | 5728938 | NOT AFFECTED — no cuBLAS. | — | None. |
| 19 | cuSOLVER | "Fixed a bug preventing users from changing algorithm for cusolverDnXsyevBatched." | 5539844 | NOT AFFECTED — no cuSOLVER. | — | None. |
| 20 | cuSPARSE | "Fixed an accuracy issue in mixed-precision CSR/COO SpMM computations." | CUSPARSE-2349 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 21 | cuSPARSE | "Fixed an issue in CSR SpMM computations with a high number of columns." | CUSPARSE-2301 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 22 | cuSPARSE | "Fixed potential issues with unaligned pointers in cusparseCsr2cscEx2." | CUSPARSE-2380 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 23 | cuSPARSE | "Fixed a determinism issue in CSR cusparseSpMM ALG3." | CUSPARSE-2612 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 24 | cuSPARSE | "Fixed a potential race condition when dynamically loading driver APIs." | CUSPARSE-2764 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 25 | cuFFT | "Fixed a correctness issue affecting half/bfloat16 precision size-1 strided transforms." | (no ID) | NOT AFFECTED — no cuFFT. | — | None. |
| 26 | nvJPEG | "Fixed nvJPEG's lossless JPEG 92 implementation handling comment markers." | 5484797 | NOT AFFECTED — no nvJPEG. | — | None. |
| 27 | NPP | "Reduced nvJPEG Encoder initialization time on Thor." | 5533951 | NOT AFFECTED — no NPP encoder. | — | None. |
| 28 | NPP | "Fixed an issue in nppiCFAToRGB_8u_C1C3R() affecting SSIM validation." | 5192648 | POSSIBLY AFFECTED — NPP SSIM path: if the fork ever uses nppiCFAToRGB upstream of an SSIM kernel, this could silently corrupt SSIM inputs. Grep shows no nppiCFAToRGB call in our source tree; scope-guarded. | LOW | Monitor if NPP pixel-format conversion is added. |
CUDA 13.0 / 13.0 Updates¶
| # | Component | Fix (verbatim) | Bug ID | Our Exposure | Severity | Recommendation |
|---|---|---|---|---|---|---|
| 29 | cuBLAS | "Fixed undefined behavior from dereferencing nullptr… when passing uninitialized Cdesc." | CUB-8911 | NOT AFFECTED — no cuBLAS. | — | None. |
| 30 | cuBLAS | "Fixed an issue where some cublasSsyrkx kernels produced incorrect results on Blackwell." | CUB-8846 | NOT AFFECTED — no cuBLAS. | — | None. |
| 31 | cuSOLVER | "Fixed a race condition in cusolverDnXgeev using multiple host threads." | (no ID) | NOT AFFECTED — no cuSOLVER. | — | None. |
| 32 | cuSPARSE | "Fixed incorrect results from cusparseCsr2cscEx2 with zero dimensions." | CUSPARSE-2319 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 33 | cuSPARSE | "Fixed incorrect results from CSR SpMV with zero input dimensions." | CUSPARSE-1800 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 34 | cuSPARSE | "Fixed a bug in cusparseSparseToDense_bufferSize requesting excessive memory." | CUSPARSE-2352 | NOT AFFECTED — no cuSPARSE. | — | None. |
| 35 | nvJPEG | "Made nvJPEG more robust — no longer crashes on malformed bitstreams." | 5168024, 5133845, 5143450 | NOT AFFECTED — no nvJPEG. | — | None. |
| 36 | nvJPEG | "nvjpegEncodeYUV avoids reading outside allocated device memory." | 5133826 | NOT AFFECTED — no nvJPEG. | — | None. |
| 37 | nvJPEG | "Fixed a race condition during progressive encoding." | 5307748 | NOT AFFECTED — no nvJPEG. | — | None. |
| 38 | nvJPEG | "Fixed uninitialized read when encoding as 4:1:0 JPEG bitstreams." | 5308008 | NOT AFFECTED — no nvJPEG. | — | None. |
| 39 | NPP | "Fixed an issue in nppiFloodFillRange_8u_C1IR_Ctx flood fill operations." | 5141474 | NOT AFFECTED — no nppiFloodFill usage. | — | None. |
| 40 | NPP | "Resolved a bug in the nppiDebayer() API affecting color reconstruction." | 5138782 | NOT AFFECTED — no Bayer-to-RGB path in fork. | — | None. |
Summary Counts¶
| Severity | Count | Items |
|---|---|---|
| CRITICAL | 1 | Thread-reconvergence compiler bug [6156910] — already tracked in PR #64 scope, Containerfile still pins 13.2 |
| HIGH | 0 | — |
| MEDIUM | 1 | cuFFT FP-exception on multi-GPU plan query [5923044] — scope-guarded (no cuFFT usage) |
| LOW | 1 | NPP nppiCFAToRGB SSIM corruption [5192648] — scope-guarded (no call sites) |
| NOT AFFECTED | 37 | All library (cuBLAS, cuSOLVER, cuSPARSE, nvJPEG, NPP) fixes — fork uses none of these APIs |
Action Items¶
- Containerfile CUDA pin bump (CRITICAL): Upgrade
cuda-toolkit-13-2→cuda-toolkit-13-3indev/Containerfile(line 186). Reconvergence bug [6156910] is live in 13.2. ADR not required (pin-version change, not architectural). Run/cross-backend-diffafter upgrade. - Dockerfile main image pin (CRITICAL):
DockerfileFROM line referencesnvidia/cuda:13.2.1-devel-ubuntu24.04. This should be bumped to13.3.xwhen that image is published. - Annotate
__syncwarpcall inspeed/speed_score.cu: The single__syncwarpcall (line 279) is in a warp-uniform code path but sits near a conditional. After the 13.3 pin bump, run the GPU parity gate to confirm no delta.
Reproducer / Audit Commands¶
# 1. Fetch the release notes
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
# 2. Enumerate our CUDA sources
find core/src/feature/cuda core/src/cuda -name '*.cu' -o -name '*.cuh' | sort
# 3. Grep for __mul24 (confirmed zero)
grep -rn '__mul24\|__umul24' core/src/
# 4. Grep for NVRTC (confirmed zero)
grep -rn 'nvrtc\|NVRTC' core/src/ mcp-server/
# 5. Grep for WGMMA (confirmed zero)
grep -rn 'wgmma\|mma\.sync\|wmma' core/src/
# 6. Count nested divergence (worst offenders)
grep -c 'if\b' core/src/feature/cuda/integer_vif/filter1d.cu # 86
grep -c 'if\b' core/src/feature/cuda/float_adm/float_adm_score.cu # 42
grep -c 'if\b' core/src/feature/cuda/integer_adm/adm_decouple.cu # 18
# 7. Verify Containerfile CUDA version
grep 'cuda-toolkit' dev/Containerfile
References¶
- CUDA Toolkit 13.3 Release Notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
- PR #64: thread-reconvergence +
__mul24+ C++23 unblock - PR #65:
__mul24audit (zero exposure confirmed) - Research-0734: this document