ADR-0591: Restore rfe_hw_flags per-frame bitmask cache after PR #1067 clobber¶
- Status: Accepted
- Date: 2026-05-16
- Deciders: lusoris
- Tags:
cuda,perf,bug,libvmaf
Context¶
PR #1056 (commit 86e2498a8) introduced a lazy dirty-flag cache (rfe_hw_flags_cache / rfe_hw_flags_dirty) on VmafContext to eliminate the O(n_extractors) linear scan of registered_feature_extractors that rfe_hw_flags() performs on every call to vmaf_read_pictures. The optimisation was tracked as perf-audit-pipeline-2026-05-16 finding F2-B.
PR #1067 (c2a3c7e0f) was a large squash-merge that included unrelated refactor work; it rebased against a tree that already contained the struct fields and three write/read sites, but the merge silently dropped all five of those lines, reverting libvmaf.c to the pre-#1056 state. The regression restores the per-frame O(n) scan.
Decision¶
Restore the five edit sites from 86e2498a8 verbatim:
VmafContextstruct — two new fields under#ifdef HAVE_CUDA:unsigned rfe_hw_flags_cacheandbool rfe_hw_flags_dirty.vmaf_init— initialiserfe_hw_flags_dirty = trueso the first frame triggers a recompute.vmaf_use_feature— setrfe_hw_flags_dirty = trueafter a new extractor is appended, so the cache is invalidated correctly when the extractor set changes after frame-loop start.vmaf_read_pictures— replace the unconditionalrfe_hw_flags()call with a dirty-flag check; only recompute when dirty, then read from the cache.
All sites are guarded by #ifdef HAVE_CUDA; CPU-only and SYCL builds are unaffected.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Re-apply #1056 as a cherry-pick | Exact replay | Cherry-pick would conflict on the post-#1067 tree | Chosen as manual patch (identical result) |
| Leave as-is (accept regression) | No code change | Per-frame O(n) scan degrades throughput at n_extractors > ~4 | Perf regression, not acceptable |
Consequences¶
- Positive: Restores the F2-B throughput gain;
rfe_hw_flags()called once per unique extractor-set epoch instead of once per frame. - Negative: None — purely additive cache with correct invalidation.
- Neutral / follow-ups: Consider a CI guard (e.g. a grep-based check) that verifies the cache fields are present in
libvmaf.cto prevent silent future regressions.
References¶
- PR #1056 (introduces the cache): commit
86e2498a8 - PR #1067 (clobbers the cache): commit
c2a3c7e0f - Source:
req— "PR #1067 clobbered #1056's F2-B perf optimization … restore the cache fields, the dirty-flag write paths, and the read-path that uses the cache instead of recomputing every frame."