Skip to content

Research-0435: ADM decouple gather locality (2026-05-18)

Summary

Analysis of the adm_decouple_avx512 gather bottleneck and evaluation of two candidate acceleration strategies.

Findings

LUT properties

  • adm_div_lookup is declared int32_t div_lookup[65537] in core/src/feature/integer_adm.h (line 33), populated once via pthread_once.
  • Size: 65 537 × 4 = 262 148 bytes ≈ 256 KB.
  • Typical L1 data cache: 32–48 KB. Typical L2: 256–512 KB.
  • The table fits in L2 but shares it with the frame buffer data; effective hit rate is low for the scattered gather pattern.

Index pattern

The three gather indices are oh + 32768, ov + 32768, od + 32768 where oh, ov, od are int16 DWT sub-band coefficients sign-extended to int32 before the shift. These values are input-dependent and are not monotone within a row.

Inspected path (in adm_decouple_avx512, lines 895–945):

__m512i oh = _mm512_cvtepi16_epi32(
    _mm256_loadu_si256((__m256i *)(ref->band_h + i * stride + j)));
// ... (ov, od similarly)
// indices = oh + 32768, range [0, 65536] arbitrary scatter
__m512i oh_div = _mm512_i32gather_epi32(
    _mm512_add_epi32(oh, _mm512_set1_epi32(32768)), adm_div_lookup, 4);

Approach A ruling

vpermd + sequential load requires that all 16 indices per vector lane fit within a contiguous 16-element window (or at most a 32-element window accessible by a 512-bit load + permute). DWT coefficients from H, V, D sub-bands can take any value in [−32768, +32767]; the expected scatter width across the 65 K-entry table far exceeds any cache-line-width window. Approach A is not viable without a structural change to the LUT layout (e.g., hash-partitioned micro-tables), which would require a scalar reference change and invalidate the bit-exact contract.

Approach B evaluation

Software prefetch 2 iterations ahead is standard technique for gather-heavy kernels where the index stream is known 2 iterations before the gather executes. Distance of 2 × 16 = 32 elements covers approximately 300 instructions of arithmetic between the prefetch hint and the gather — comfortably within the L2 miss latency (100–250 cycles on Zen 4 / Skylake-X).

The implementation prefetches into T1 (L2) not T0 (L1) because:

  • The 48 cache lines (3 bands × 16 elements) per iteration will immediately be evicted from L1 by the dense band-buffer loads that follow.
  • L2 residence is sufficient to eliminate the L3/DRAM stall.

Benchmark results (BBB 1080p, 302 frames, release build)

Build Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Mean
Baseline (master) 9364 8670 9651 9084 10520 9593 9920 10022 9603 ms
Patched (ADR-0502) 8747 8677 9714 9540 8974 9768 8876 8100 9049 ms

Improvement: 554 ms / 9603 ms = −5.8 % wall time.

Bit-exactness verification

diff of all frame metrics across 302 frames × 15 metrics = 4 530 metric-frame pairs: max absolute difference = 0. The change is a pure memory-access-strategy change; no arithmetic path is altered.

Conclusion

Approach B (software prefetch, T1) is viable, ships a measured 5.8 % wall-time improvement, and preserves bit-exactness. Implemented as ADR-0502.