ADR-0463: ADM p-norm fast-path split and VIF scalar-fallback malloc hoist¶
- Status: Accepted
- Date: 2026-05-16
- Deciders: lusoris
- Tags:
perf,adm,vif,simd,cpu,fork-local
Context¶
Profiling (perf-audit-cpu-2026-05-16.md, findings 4 and 5) identified two CPU-side hot-path inefficiencies in the ADM and VIF feature extractors:
ADM adm_p_norm == 3.0 inner-loop branch: adm_cm_s, adm_csf_den_scale_s, and adm_sum_cube_s each contain if (adm_p_norm == 3.0) { x*x*x } else { powf(x, adm_p_norm) } inside their innermost accumulation loops. The default value of adm_p_norm is 3.0 (set in float_adm.c), so the branch is predictably-taken on every call in normal operation. The branch prevents auto-vectorisation of the cube accumulation and wastes branch-predictor state. There are 8+ such branch sites across the three functions (14 total if (adm_p_norm == 3.0) checks inside inner loops).
VIF vif_filter1d_* per-call aligned_malloc: the three scalar fallback functions vif_filter1d_s, vif_filter1d_sq_s, and vif_filter1d_xy_s each call aligned_malloc(ALIGN_CEIL(w * sizeof(float)), MAX_ALIGN) on every invocation. On x86 with AVX2, the ARCH_X86 guard routes to the SIMD path before the fallback fires, so the malloc is skipped in practice. On ARM64 (no ARCH_AARCH64 guard) and non-AVX2 x86, the fallback fires every frame: 3 functions × 4 VIF scales = 12 aligned_malloc + aligned_free pairs per frame. The caller (vif.c:compute_vif) already preallocates a tmpbuf of size buf_sz_one = buf_stride * h >= ALIGN_CEIL(w * sizeof(float)) and passes it as a parameter — the scalar fallback was ignoring it.
Decision¶
Fix 1: Split each branching function into a p3 fast-path variant (adm_cm_s_p3, adm_csf_den_scale_s_p3, adm_sum_cube_s_p3) that hardcodes cube arithmetic and cbrtf in place of powf(x, adm_p_norm). The dispatch is performed once per scale iteration in compute_adm with an if (adm_p_norm == 3.0) guard. The generic path is retained for non-default values.
cbrtf(x) == powf(x, 1.0f/3.0f) for finite non-negative x (IEEE-754 guarantee). Accumulation order is identical in both paths. The change is bit-exact for all inputs when adm_p_norm == 3.0.
Fix 2: Replace the per-call aligned_malloc / aligned_free pattern in the three scalar VIF filter fallbacks with direct use of the caller-supplied tmpbuf parameter. The caller guarantees tmpbuf != NULL and sizeof(tmpbuf) >= ALIGN_CEIL(w * sizeof(float)) (ensured by the VIF_BUF_CNT-slab allocation in compute_vif).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Function cloning via __attribute__((optimize)) or -fprofile-generate PGO | Zero code duplication | Compiler-specific; not guaranteed to specialize; violates JPL-P10 rule (no compiler extensions not in C99/C11 portable subset) | Non-portable, unreliable |
| Macro-based specialization | No duplication of logic | NOLINT-heavy; hard to read; macros banned by coding standard for non-trivial bodies | Violates style guide |
Move dispatch from compute_adm to adm.h wrapper inline | Slightly cleaner call site | Inline expansion in header increases build time; wrapper still required | Marginal gain; not worth it |
Hoist VIF tmpbuf to a separate VifScalarWorkspace struct | Clean ownership model | More invasive API change; overkill when tmpbuf is already passed | Complexity not justified |
Consequences¶
- Positive: eliminates 14 per-pixel branch-and-
powfpairs on the default code path in ADM; removes 12aligned_malloc/aligned_freepairs per frame on ARM64; inner loops in_p3variants are auto-vectorisable by GCC/clang. - Negative:
adm_tools.cis ~500 lines longer (the_p3functions). Any future change to the accumulation logic in the generic path must be mirrored to the_p3variants. TheAGENTS.mdinvariant documents this. - Neutral:
adm_tools.hgains 3 new declarations;adm.cgains 3 new#definealiases. No public API change (all new symbols are in.c/.hfiles internal tolibvmaf).
References¶
- Per-audit findings:
.workingdir/perf-audit-cpu-2026-05-16.md(findings 4 + 5). - PR #881
tmpbufhoist precedent: commita3123a8dcintroduces theVIF_BUF_CNT-slab pattern that this ADR extends. - ADR-0418: double-precision accumulator for ADM sum reductions.
- Source: per user direction (task brief
perf/adm-p-norm-fast-path-vif-arm64-malloc-2026-05-16).