Research-0435: ADM decouple gather locality (2026-05-18)¶
Summary¶
Analysis of the adm_decouple_avx512 gather bottleneck and evaluation of two candidate acceleration strategies.
Findings¶
LUT properties¶
adm_div_lookupis declaredint32_t div_lookup[65537]incore/src/feature/integer_adm.h(line 33), populated once viapthread_once.- Size: 65 537 × 4 = 262 148 bytes ≈ 256 KB.
- Typical L1 data cache: 32–48 KB. Typical L2: 256–512 KB.
- The table fits in L2 but shares it with the frame buffer data; effective hit rate is low for the scattered gather pattern.
Index pattern¶
The three gather indices are oh + 32768, ov + 32768, od + 32768 where oh, ov, od are int16 DWT sub-band coefficients sign-extended to int32 before the shift. These values are input-dependent and are not monotone within a row.
Inspected path (in adm_decouple_avx512, lines 895–945):
__m512i oh = _mm512_cvtepi16_epi32(
_mm256_loadu_si256((__m256i *)(ref->band_h + i * stride + j)));
// ... (ov, od similarly)
// indices = oh + 32768, range [0, 65536] arbitrary scatter
__m512i oh_div = _mm512_i32gather_epi32(
_mm512_add_epi32(oh, _mm512_set1_epi32(32768)), adm_div_lookup, 4);
Approach A ruling¶
vpermd + sequential load requires that all 16 indices per vector lane fit within a contiguous 16-element window (or at most a 32-element window accessible by a 512-bit load + permute). DWT coefficients from H, V, D sub-bands can take any value in [−32768, +32767]; the expected scatter width across the 65 K-entry table far exceeds any cache-line-width window. Approach A is not viable without a structural change to the LUT layout (e.g., hash-partitioned micro-tables), which would require a scalar reference change and invalidate the bit-exact contract.
Approach B evaluation¶
Software prefetch 2 iterations ahead is standard technique for gather-heavy kernels where the index stream is known 2 iterations before the gather executes. Distance of 2 × 16 = 32 elements covers approximately 300 instructions of arithmetic between the prefetch hint and the gather — comfortably within the L2 miss latency (100–250 cycles on Zen 4 / Skylake-X).
The implementation prefetches into T1 (L2) not T0 (L1) because:
- The 48 cache lines (3 bands × 16 elements) per iteration will immediately be evicted from L1 by the dense band-buffer loads that follow.
- L2 residence is sufficient to eliminate the L3/DRAM stall.
Benchmark results (BBB 1080p, 302 frames, release build)¶
| Build | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 | Run 7 | Run 8 | Mean |
|---|---|---|---|---|---|---|---|---|---|
| Baseline (master) | 9364 | 8670 | 9651 | 9084 | 10520 | 9593 | 9920 | 10022 | 9603 ms |
| Patched (ADR-0502) | 8747 | 8677 | 9714 | 9540 | 8974 | 9768 | 8876 | 8100 | 9049 ms |
Improvement: 554 ms / 9603 ms = −5.8 % wall time.
Bit-exactness verification¶
diff of all frame metrics across 302 frames × 15 metrics = 4 530 metric-frame pairs: max absolute difference = 0. The change is a pure memory-access-strategy change; no arithmetic path is altered.
Conclusion¶
Approach B (software prefetch, T1) is viable, ships a measured 5.8 % wall-time improvement, and preserves bit-exactness. Implemented as ADR-0502.