ADR-0773: CUDA ADM decouple-inline — __ldg() F3 fix on active path¶
- Status: Accepted
- Date: 2026-05-29
- Deciders: lusoris
- Tags:
cuda,performance,adm,fork-local
Context¶
PR #106 (ADR-0763) applied the F3 __ldg() pattern to adm_decouple.cu (scale-0 and scale-1-3 standalone kernels). That file is dead code in the fork: the decouple logic is inlined into adm_csf.cu and adm_cm.cu via adm_decouple_inline.cuh (rebase-note 0002). The ADR-0763 change was a preparatory maintenance pass with no score impact because those kernels are never dispatched.
The active path is the six inline __device__ helpers in adm_cm.cu (inline_i4_csf_a, inline_i4_decouple_r, inline_s0_csf_a, inline_s0_decouple_r, inline_i4_csf_r, inline_s0_csf_r) and the two kernel templates in adm_csf.cu (i4_adm_csf_kernel, adm_csf_kernel). All eight sites load six band values (band_h/v/d for ref and dis) via plain struct-member access — no __ldg(), no __restrict__ extraction — so ptxas routes those loads through the coherent L2 cache instead of the read-only L1 texture path.
ADR-0756 listed adm_decouple as a priority-dispatch F3 target; ADR-0763 addressed the dead-code file; this ADR addresses the live path.
Decision¶
Extract const T *__restrict__ band pointers from the cuda_*_adm_dwt_band_t struct arguments at the top of each of the eight helper functions / kernel templates before any indexed load, then replace every ref->band_h[idx]-style load with __ldg(&rh[idx]). Write-back calls (none in these helpers — they are read-only) would use plain pointers.
The change covers:
adm_csf.cu:i4_adm_csf_kernel<>(int32 path) andadm_csf_kernel<>(int16 path)adm_cm.cu:inline_i4_csf_a,inline_i4_decouple_r,inline_s0_csf_a,inline_s0_decouple_r,inline_i4_csf_r,inline_s0_csf_r
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
__ldg() extraction in callers (chosen) | Exposes alias-free invariant per-call; < 15 LOC per site | None | Chosen |
Move extraction into .cuh helper signatures | Single place | .cuh functions receive pre-loaded integers, not pointers | Architecture mismatch; .cuh functions have no pointers to extract |
| No change | Zero risk | Leaves 6 global reads per pixel on the coherent L2 path at every CSF/CM dispatch | Performance opportunity missed; inconsistent with ADR-0754/0762/0763 precedent |
Consequences¶
- Positive: Six per-pixel band loads at every CSF and CM kernel call now route through the L1 read-only cache. The ADM pipeline issues CSF then CM every frame; at 1080p the DWT2 planes (6 × int16 or 6 × int32) are too large for L1 without the non-coherent hint. Consistent with F3 pattern applied across the CUDA suite.
- Negative: None; change is semantics-preserving and bit-exact. CUDA vs CPU correctness confirmed at places=4 (max diff ≤ 1.00e-06 on 576×324 Netflix pair, 0.00e+00 on 1080p checkerboard).
- Neutral:
adm_decouple.cu(dead code) already carries ADR-0763; this ADR addresses the live path only.
References¶
- ADR-0763 (PR #106): F3 fix on the dead-code
adm_decouple.cu. - ADR-0756: fork-wide F3 audit;
adm_decouplelisted as dispatch target. - ADR-0754 (PR #93): original
__ldg()F3 fix (SSIM vert_combine). - ADR-0762 (PR #102): F3 fix on CIEDE2000 kernels.
- ADR-0214: GPU vs CPU parity gate (places=4).
- rebase-note 0002: explains why
adm_decouple.cuis dead code. - req: "Apply the same F3 fix (raw
__restrict__pointer extraction +__ldg()on read-only loads) to the INLINE header — that's the path that actually executes."