ADR-0502: ADM decouple gather prefetch (Approach B)¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Anthropic)
- Tags:
simd,perf,adm,avx512,fork-local
Context¶
The adm_decouple_avx512 function in core/src/feature/x86/adm_avx512.c spends 66.5 % of its own cycles — 2.31 % of total VMAF wall time — on three vpgatherdd instructions that look up the 256 KB adm_div_lookup table (65 537 × int32, indexed by band_h/v/d + 32768).
The LUT exceeds L1 capacity (32–48 KB) and saturates L2 (256–512 KB) because DWT sub-band coefficients are drawn arbitrarily from [−32 768, +32 767] with no monotone ordering within a row. Each gather therefore issues up to 48 scatter reads with frequent L2/L3 misses, stalling the execution port that feeds the downstream arithmetic.
The perf-profiler identified two candidate approaches:
- Approach A — replace gather with
vpermd+ sequential loads if indices are monotone per row. - Approach B — issue software prefetches 2 iterations ahead if indices are not monotone.
On inspecting the index-compute path (oh, ov, od are raw DWT coefficients sign-extended from int16_t; their values are input-dependent and have no ordering guarantee) Approach A is not applicable. Approach B is implemented here.
Decision¶
We will prefetch the adm_div_lookup cache lines corresponding to the iteration 2 steps ahead (j + 32 elements) into L2 using _MM_HINT_T1 before each set of three vpgatherdd instructions. The prefetch loop reads 16 int16 values per band from the next-next-iteration pointers, computes [val + 32768] offsets, and calls _mm_prefetch. No computed value changes; the patch is a pure access-strategy change.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Approach A — vpermd + sequential load | Eliminates gather latency entirely; best throughput | Requires monotone-increasing indices within a row | DWT coefficients are arbitrary int16; monotone ordering does not hold |
| Approach B — software prefetch (chosen) | Hides L2/L3 miss latency; preserves gather semantics; zero risk to bit-exactness | Adds 48 scalar prefetch instructions per iteration; guard branch needed for row-end | Only option given non-monotone indices |
| No change | Zero risk | 2.31 % of wall time left on table | Does not meet the T4 perf-win bar |
Consequences¶
- Positive: −5.8 % total wall time on BBB 1080p (302-frame, 8-run mean: 9 603 ms → 9 049 ms); gathered miss latency hidden behind ~300 arithmetic cycles.
- Negative: +48
prefetchT1instructions per inner-loop iteration; negligible on modern front-ends (prefetch is a hint, does not stall). - Neutral / follow-ups: The scalar tail loop (
j = right_mod16 … right) does not use the LUT in a vectorized manner and is unmodified. Theadm_decouple_avx2path is not affected; it uses_mm256_i32gather_epi32on the same LUT but that function is not in the hot path at this profile depth.
References¶
- perf-profiler finding:
adm_decouple_avx512gather cluster = 66.5 % of function cycles = 2.31 % total wall time (Perf Win #4). - Research-0435
- Related: ADR-0500 — VIF LUT shrink (similar gather-miss problem, different solution space).
- PR:
perf/adm-decouple-gather-locality-2026-05-18