Research digest 0734 — cuDNN version audit for ONNX Runtime / tiny-AI inference¶
Date: 2026-05-28 Triggered by: PR #64 note — cuDNN ships on its own cadence, separate from the CUDA Toolkit; the ORT cuDNN dependency must be tracked independently. Scope: Identify which cuDNN version our ONNX Runtime build depends on; check the latest cuDNN release for silent-corruption / performance fixes relevant to our INT8/FP16 small-CNN inference workload.
1. Our cuDNN exposure — summary verdict: NONE (CPU-only ORT)¶
1.1 C library (core/src/dnn/)¶
dev/Containerfile installs ONNX Runtime from the CPU-only upstream tarball:
ARG ORT_VERSION=1.26.0
RUN curl -fsSL \
"https://github.com/microsoft/onnxruntime/releases/download/v${ORT_VERSION}/onnxruntime-linux-x64-${ORT_VERSION}.tgz" \
...
The artifact onnxruntime-linux-x64-<version>.tgz (no -gpu suffix) is the CPU-only build. It does not link cuDNN. ort_backend.c registers CUDAExecutionProvider at runtime via SessionOptionsAppendExecutionProvider_CUDA — that code path is present in the source but only reachable when the user selects VMAF_DNN_DEVICE_CUDA and a GPU-capable ORT wheel is installed at runtime. The default container and the C library build have no runtime cuDNN dependency.
1.2 Python / training layer (ai/)¶
ai/pyproject.toml pins:
"onnxruntime>=1.26.0,<2.0", # container pins 1.26.0 (ADR-0568); floor kept at 1.20 for host installs
This resolves to the onnxruntime PyPI package (CPU-only). The GPU variant (onnxruntime-gpu) is not listed in any pyproject.toml, requirements.txt, or Containerfile in tree.
The single cuDNN mention in the codebase is a documentation comment in ai/scripts/measure_quant_drop_per_ep.py:
# cuda -> ORT CUDAExecutionProvider (NVIDIA, requires
# CUDA 12 / cuDNN 9 runtime libs and an ORT-GPU wheel).
This is a usage note for an optional manual invocation; the script's installed dependencies do not pull onnxruntime-gpu.
1.3 Containerfile CUDA install¶
Stage 2 of dev/Containerfile installs cuda-toolkit (the meta-package, currently CUDA 13.2 — explicitly commented as "CUDA 13.2"). The meta-package does not pull libcudnn*; cuDNN is a separate optional package on NVIDIA's apt repository. No apt-get install libcudnn* line exists anywhere in the Containerfile.
2. Latest cuDNN version¶
cuDNN 9.22.0 (May 2026). Confirmed via https://developer.nvidia.com/cudnn-downloads.
Previous release: cuDNN 9.21.1.
ORT's CUDA EP compatibility table (ORT ≥ 1.18.x with CUDA 12.x requires cuDNN 9.x). ORT 1.26.0 (CUDA 12 variant, if ever installed) would require cuDNN ≥ 9.0.
3. cuDNN 9.22.0 release notes — verbatim excerpts¶
Source: https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html (retrieved 2026-05-28).
3.1 Known issues in 9.22.0 (excerpted verbatim)¶
"For certain convolution-related workloads, memory allocations are made that are not released until process termination."
"Performance of some matrix multiplication operations through the cuBLASLt engine might be slower when run with CUDA Toolkit 13.2 Update 1 compared with earlier CUDA Toolkit 13.x versions."
"Runtime compilation of
LayerNormandRMSNormexecution plans might be protracted on compute capability 12.0 devices.""Scaled dot-product attention bprop is not supported when both key/value have sequence length of 1."
3.2 Fixed in 9.21.1 (excerpted verbatim)¶
"An issue where scaled dot-product attention causal forward could hang with extremely long sequence lengths on Blackwell-architecture GPUs has been fixed."
"An issue where scaled dot-product attention forward could crash during execution plan creation when dropout was enabled on Blackwell-architecture GPUs has been fixed."
"A performance regression when using the SM count attribute with Layer and RMS Normalization has been fixed."
3.3 Fixed in an earlier 9.x release (from web search aggregation)¶
"Performance regression for convolutional workloads on Turing-architecture GPUs introduced in cuDNN 9.17.0 is now fixed."
"Updated runtime fusion engine heuristics for INT8 for both Convolution and Matmul operations on the NVIDIA Hopper architecture to support new batched GEMM kernel, with improved fusion performance, and compilation time."
4. VMAF tiny-AI exposure classification¶
| cuDNN issue | Affected ops | Tiny-AI exposure | Assessment |
|---|---|---|---|
| Convolution memory not freed until process exit (9.22.0 KI) | Conv on CUDA | RELEVANT — we run small-CNN conv kernels via CUDA EP when GPU path is manually invoked | Low risk: not a correctness bug; memory is freed at process exit. Inference-server deployments that restart the process periodically are unaffected. Would become high-severity in a persistent daemon. |
| cuBLASLt matmul perf regression (CUDA 13.2 Update 1, 9.22.0 KI) | MatMul | MARGINALLY RELEVANT — DNN models include FC layers implemented as matmul | Perf only, not correctness. Affects CUDA 13.2 Update 1 specifically. Container pins CUDA 13.2 meta-package (pre-Update-1 per Containerfile comment); low exposure. |
| Turing conv perf regression (9.17.0–9.20.x) | Conv, Turing GPUs | RELEVANT if running on a Turing-class card (RTX 2000/3000-series) | Fixed in 9.21.x. If onnxruntime-gpu is ever installed against cuDNN 9.17–9.20 on Turing, affected. |
| INT8 Hopper fusion heuristic improvement | INT8 Conv + Matmul | RELEVANT — INT8 is one of our quantization modes | Positive fix (performance gain on Hopper). Not a bug. |
| Attention hang on Blackwell (9.21.1 fix) | Scaled-dot-product attention | NOT RELEVANT — tiny-AI models are CNN-only, no attention ops | Safe. |
| Attention dropout crash on Blackwell (9.21.1 fix) | Attention + dropout | NOT RELEVANT — no attention layers in tiny-AI models | Safe. |
| RNN ops | RNN/LSTM/GRU | NOT RELEVANT — we are CNN-only | Safe. |
| Large-batch specialised kernels | Large batch | NOT RELEVANT — single-frame-at-a-time inference | Safe. |
5. Recommendation¶
Immediate action: NONE required.
The fork's container and default Python environment install the CPU-only ORT 1.26.0 build; cuDNN is not a transitive dependency of any installed artifact. The CUDA execution-provider code path in ort_backend.c is present but only reachable when a user manually installs onnxruntime-gpu (not provided by the fork).
Track separately:
-
If a future PR adds
onnxruntime-gputoai/pyproject.tomlor the Containerfile, pin cuDNN ≥ 9.21.1 (fixes the Turing conv perf regression) and document the constraint indev/Containerfileandai/pyproject.toml. -
The convolution-memory-not-freed issue (9.22.0 KI) would become medium-severity if the fork ever ships a long-running inference server. Track in
docs/state.mdas a deferred potential bug; reopen when server-mode is implemented (VMAFX Phase 3 cloud-native plan, ADR-0714 area). -
The cuBLASLt perf regression with CUDA 13.2 Update 1 is distinct from the cuDNN layer and affects any matmul regardless of cuDNN; monitor when the Containerfile CUDA pin advances beyond 13.2.
6. Reproducer / verification commands¶
# Confirm CPU-only ORT tarball is used (no 'gpu' in filename):
grep 'ORT_VERSION\|onnxruntime-linux' dev/Containerfile
# Confirm no onnxruntime-gpu dependency in Python packages:
grep -rn 'onnxruntime-gpu\|onnxruntime-cuda' ai/ pyproject.toml python/ tools/
# Confirm no cuDNN apt install in Containerfile:
grep 'libcudnn\|cudnn' dev/Containerfile
# If investigating GPU EP: check ORT CUDA compatibility table at
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html
7. References¶
- ORT 1.26.0 CPU tarball:
github.com/microsoft/onnxruntime/releases/download/v1.26.0/onnxruntime-linux-x64-1.26.0.tgz - ORT CUDA EP compat table:
https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html - cuDNN 9.22.0 release notes:
https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html - cuDNN downloads:
https://developer.nvidia.com/cudnn-downloads - Containerfile:
dev/Containerfilelines 529–539 (ORT), lines 146–171 (CUDA 13.2) - DNN backend:
core/src/dnn/ort_backend.c,core/src/dnn/ort_backend.h - Quantisation script comment:
ai/scripts/measure_quant_drop_per_ep.pylines 17–18