ADR-0147: Thread-pool job-object recycling + inline data buffer¶
- Status: Accepted
- Date: 2026-04-24
- Deciders: Lusoris, Claude (Anthropic)
- Tags: performance, threading, upstream-port
Context¶
The fork's core/src/thread_pool.c allocates and frees a VmafThreadPoolJob and a separate payload buffer on every vmaf_thread_pool_enqueue call. Callers submit tiny payloads (struct { VmafPicture ref, dist; ... } in the main extractor path, ~48 bytes in the MCP frame-event path). For a 120-frame 1080p run with multi-feature extraction, this comes out to thousands of paired malloc / free calls per wall-second inside the hot path vmaf_read_pictures → threaded_read_pictures → vmaf_thread_pool_enqueue chain.
Netflix upstream PR #1464 (Tolga Kilicli, Feb 2026 — closed without merge) bundled twelve unrelated optimizations. Two of them apply cleanly to the fork's thread-pool:
- Job-object free list — recycle
VmafThreadPoolJobslots via a pool-local linked list rather thanmalloc/freeon every job. - Inline data buffer — payloads ≤ 64 bytes are copied into a fixed-size
char inline_data[64]array at the tail of the job struct, eliminating the second allocation.
The rest of Netflix #1464 (PSNR AVX2, ADM micro-opts, VIF epsilon removal, predict.c refactor, feature-collector capacity bump, convolution stride hoisting, comprehensive test suite) either overlaps with fork-local work that has already landed (T3-4 motion, T7-5 predict refactor, ADR-0142 VIF, 81fcd42e PSNR SIMD) or directly conflicts with ADR-0138/0139 bit-exactness invariants. A narrow port is the right shape.
The fork's thread-pool carries two extensions relative to upstream that must survive the port:
VmafThreadPoolWorkerstruct with per-workervoid *dataplus athread_data_freedestructor callback. Callers submit work via afunc(void *data, void **thread_data)signature, not upstream'sfunc(void *data).pthread_cond_signal(notbroadcast) on empty → already incorporated atthread_pool.c:175as of master. Upstream PR #1464 claimed a "thundering-herd fix" ported over the samebroadcast → signalchange; the fork has it already.
Decision¶
Port the job-object free list + inline data buffer from Netflix #1464, adapted to the fork's void **thread_data signature:
- Add
#define JOB_INLINE_DATA_SIZE 64and embedchar inline_data[JOB_INLINE_DATA_SIZE]inVmafThreadPoolJob. - Add a
VmafThreadPoolJob *free_jobslist toVmafThreadPool, protected by the existingqueue.lock. - Split the cleanup path into
vmaf_thread_pool_job_clear_data(drops heap-allocated payload if and only if it isn't pointing at the inline buffer) +vmaf_thread_pool_job_destroy(legacy free-the-whole-thing path, used only on pool teardown for leaked slots) +vmaf_thread_pool_job_recycle(new — pushes a finished slot onto the free list). vmaf_thread_pool_runnernow calls_recycleinstead of_destroyafter running a job.vmaf_thread_pool_enqueuenow takes the queue lock before allocating the slot; it pops fromfree_jobsif non-empty, otherwisemallocs a new slot. Payloads ≤ 64 bytes are copied intojob->inline_dataandjob->datais pointed at that buffer; larger payloads keep the legacymallocpath.vmaf_thread_pool_destroywalks thefree_jobslist aftervmaf_thread_pool_waitreturns (all workers have exited, so no locking needed) and frees every recycled slot.
Bit-exactness and throughput are both load-bearing:
- Bit-exactness: scalar vs SIMD (VMAF_CPU_MASK=0 vs =255) remains byte-identical under
--threads 4on the Netflix golden pair (diffexit 0). - Throughput: a 500 000-job micro-benchmark (4 worker threads,
int v = 1payload per job) shows ~1.8–2.6× speedup:
| Config | Median throughput |
|---|---|
| Before (master) | ~1.20 M jobs/sec |
| After (this PR) | ~2.20 M jobs/sec |
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Full Netflix #1464 port | One-shot sweep of all twelve upstream optimizations | Direct conflict with ADR-0138 / 0139 (VIF epsilon removal, ADM bit-shift aggression), with T7-5 predict.c refactor, with the fork's feature-collector extensions, and with the fork's existing 81fcd42e PSNR SIMD (already more complete than upstream's AVX2-only path) | Rejected — narrow port avoids re-litigating already-decided invariants; the omitted pieces are either done, unsafe, or out of scope |
| Job-pool only, no inline buffer | Smaller patch, removes only one malloc/free pair per job | Callers with small payloads (the common case) still do a second malloc for every enqueue | Rejected — the inline buffer is where the bulk of the throughput win comes from; job recycling alone is about 1.3× |
| Heap-backed object pool with a max-size cap | Could bound memory growth under adversarial enqueue bursts | Current queue already provides natural backpressure via n_working + queue → no unbounded growth in practice; a cap adds mutex-held branch-heavy fallback code with no measurable win | Rejected — match upstream's simpler unbounded-growth-then-free-on-destroy strategy |
| Thread-local slab allocator | Zero lock contention on the pool lock for allocations | Current pool lock is already held while touching the queue; the allocation-under-lock pattern does not serialize any real work that wasn't already serialized | Rejected — premature optimization. Revisit if profile-hotpath shows lock contention later |
Consequences¶
- Positive:
- ~2× enqueue throughput in the micro-benchmark. On a thread-pool-heavy run (multi-feature, 1080p,
--threads 4) this drops the allocator's share of the per-job cost from "visible in flamegraph" to "disappears into the noise floor". - Payload copies now go through a predictable inline buffer for the common case — one less cache-miss class to worry about in subsequent profiling.
vmaf_thread_pool_destroyremains leak-free under all exit paths (normal shutdown, early stop) — every recycled slot is walked and freed after the workers stop.- Negative:
sizeof(VmafThreadPoolJob)grows from 24 bytes to 88 bytes (func 8 + data 8 + inline_data 64 + next 8). Each long-lived pool pays64 * peak_jobs_in_flightbytes more. A 4-thread pool in practice carries a handful of slots alive at any time; the trade is trivial relative to even one 1080p frame.- Enqueue now acquires
queue.lockbefore allocation (previously it allocated first, then locked). Allocations under the lock can theoretically stretch the critical section on a cold cache or heavy fragmentation; in the benchmarked workload throughput strictly improved, so the concern is academic for now. - Neutral / follow-ups:
- If profile shows the 64-byte buffer is undersized for a future consumer, bump
JOB_INLINE_DATA_SIZEand recompile — no API change. - Netflix #1464 is still open upstream. If it ever lands, keep the fork's version on conflict (fork's
void **thread_datasignature is required for the fork's per-worker data path — seecore/src/AGENTS.md).
References¶
- Upstream source: Netflix/vmaf#1464, specifically the thread-pool portion of commit
ad2b90c3("Add AVX2 PSNR, thread pool job pool, and comprehensive tests"). - Backlog item:
.workingdir2/BACKLOG.mdT3-6 (the AVX2-PSNR half was already covered by fork commit81fcd42e; this PR closes the thread-pool half). - ADR-0141 — applied to
thread_pool.c: zero clang-tidy warnings on the touched file, no NOLINT. - User direction 2026-04-24 popup: "Port thread-pool job-pool (Recommended)" after T7-5 merged.