ADR-0578: Hoist VIF scratch buffer from per-frame allocation to VifState¶
- Status: Accepted
- Date: 2026-05-16
- Deciders: lusoris
- Tags: perf, vif, cpu, build
Context¶
compute_vif in core/src/feature/vif.c allocated a contiguous 10 × ALIGN_CEIL(w × sizeof(float)) × h byte block on every frame call and freed it on return. At 1080p (stride = 7712, h = 1080) this is approximately 79 MB of allocator traffic per frame, consisting of ten full-plane scratch buffers (ref_scale, dis_scale, mu1, mu2, ref_sq_filt, dis_sq_filt, ref_dis_filt, num_array, den_array, tmpbuf). The geometry is known at init_fex time (from the VmafFeatureExtractor init callback), so the allocation can be lifted to the init/close lifecycle.
This matches the pattern already used by float_adm, integer_vif, ssimulacra2, and integer_adm. The CPU perf audit (.workingdir/ perf-audit-cpu-2026-05-16.md) identified this as Win 1, the top CPU-side performance opportunity for the float VIF path.
Win 2 (enabling VIF_OPT_FAST_LOG2) is a no-op: vif_options.h line 28 already defines VIF_OPT_FAST_LOG2 unconditionally. The macro substitutes log2f_approx (a 9-term Horner polynomial) for libm log2f. This was confirmed present before any code changes; documentation in the research digest records the discovery for future maintainers.
Decision¶
- Add a
float *vif_buffield toVifStateinfloat_vif.c. - Change
compute_vif's signature to acceptfloat *data_buf— the caller owns the allocation.compute_vifslices sub-pointers from it exactly as before. - In
float_vif.c::init, allocateVIF_SCRATCH_BUF_CNT (10) × scaled_float_stride × scaled_hbytes once and assign tos->vif_buf. - Pass
s->vif_buftocompute_viffromextract. - Free
s->vif_bufinclose. - In
vifdiff(the standalone utility also invif.c), allocate the scratch buffer once before the frame loop and free it after — still one allocation per video, not one per frame.
Bit-exactness: the scratch buffer is pure write-before-read within each compute_vif call; no cross-frame state is stored in it. The hoist changes lifetime, not contents.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep per-frame malloc | Zero code change | ~79 MB/frame allocator traffic at 1080p; system allocator lock contention in threaded mode | Performance regression is the status quo; ruled out |
| Thread-local storage for scratch | Zero signature change | Complicates sanitiser / valgrind runs; not idiomatic for this codebase | Not chosen — VifState is already the right place |
| Opaque scratch pointer in VifState (this ADR's decision) | Matches existing float_adm, ssimulacra2 patterns; zero arithmetic change | Requires signature update to compute_vif and vif.h | Chosen |
References¶
- req: perf audit
.workingdir/perf-audit-cpu-2026-05-16.mdWin 1 directive - Research digest 0135