ADR-1079: TSan-eligible thread-safety test for threaded_extract_batch_func¶
- Status: Accepted
- Date: 2026-06-06
- Deciders: Lusoris
- Tags:
ci,test,threading
Context¶
PRs #765 (ADR-1072) and #769 (ADR-1073) fixed two concurrency bugs in threaded_extract_batch_func (libvmaf.c):
-
ADR-1072: bare
memsetonfex->prev_refafterextract()without callingvmaf_picture_unreffirst; exhausted the picture pool after ~pool_size frames and deadlockedpthread_cond_wait. -
ADR-1073:
flush_context_threadedcalledfex->flush()on the shared, never-initialized extractor context.flush()lazily allocateds->feature_name_dict;vmaf_feature_extractor_context_closereturned-EINVALearly (guarded byis_initialized == false) and leaked the dict. Fix: setfex_ctx->is_initialized = trueimmediately before the flush loop, aftervmaf_thread_pool_waithas already serialized all workers.
The existing test_pic_preallocation.c exercises these paths (via vmaf_preallocate_pictures + vmaf_fetch_preallocated_picture) but is excluded from every sanitizer run — ASan, UBSan, and TSan — because vmaf_preallocate_pictures allocates large pool buffers (up to 40 × 1920×1080 ≈ 95 MB) that the TSan shadow-memory allocator cannot satisfy at its default reservation limit, triggering SIGABRT before any test logic runs. The exclusion covers the entire test binary, including the threading sub-tests that have no huge-alloc dependency.
This leaves the post-PR threaded_extract_batch_func PREV_REF code path — specifically the vmaf_picture_unref + memset sequence and the is_initialized write — with no TSan coverage. A latent data race that escapes the happens-before edges established by vmaf_thread_pool_wait would be undetectable until a production deadlock or ASan report surfaced it.
Decision¶
Add core/test/test_thread_safety_batch.c, a TSan-eligible regression test that covers the same threaded_extract_batch_func + flush_context_threaded code paths using vmaf_picture_alloc (heap allocation, no pool) and small (64×64) frames. The test is registered in core/test/meson.build as a fast-suite target, is NOT added to any sanitizer exclusion list, and therefore automatically participates in the TSan run in sanitizers.yml.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Fix test_pic_preallocation.c to avoid huge allocs | Single file, no new file | The pool-based tests are the point of that file; removing the pool would defeat them. The TSan crash is inherent to vmaf_preallocate_pictures at large pool sizes. | Structural conflict between pool-size stress and sanitizer allocator limits. |
Split test_pic_preallocation.c into pool and non-pool halves | Keeps all preallocation tests together | Changes an existing excluded file and risks re-triggering the SIGABRT on new sub-tests if the split is imperfect. | New file is safer: zero risk of accidentally reintroducing huge-alloc patterns. |
Guard huge-alloc sub-tests with #if __has_feature(thread_sanitizer) | Single file | Requires TSan-detection macros in every affected sub-test; fragile across compilers; masks the sub-test from the sanitizer view rather than running a clean variant. | More fragile than a dedicated file that is clean by construction. |
| Do nothing (accept the coverage gap) | No new code | The ADR-1072/ADR-1073 fixes have zero TSan visibility; any regression in the PREV_REF lifecycle or is_initialized guard would be silent until a deadlock surfaces in production. | Unacceptable for two fixes that directly address data-race-adjacent lifetime bugs. |
Consequences¶
- Positive:
threaded_extract_batch_funcPREV_REF lifecycle (ADR-1072) andflush_context_threadedis_initializedgate (ADR-1073) now have TSan coverage on every master push. Regressions in either fix surface as TSan data-race reports within the 35-minute sanitizer budget. - Positive: The test uses
vmaf_picture_alloc+ 64×64 frames; TSan shadow overhead is trivial (~1 MB). No new sanitizer exclusion entry is needed. - Negative: One additional test binary in the
fastsuite; compile time increase is negligible (single.cfile linked against existinglibvmaf). - Neutral:
test_pic_preallocation.cremains excluded from all sanitizer runs; its pool-stress tests (e.g.test_picture_pool_stresswith 16 threads and a 4-picture pool) are not covered by the new test, which is intentional — that test stresses pool exhaustion, not the PREV_REF lifecycle.
References¶
- ADR-1072:
docs/adr/1072-prev-ref-batch-refcount-leak.md(PR #765) - ADR-1073:
docs/adr/1073-mcp-score-at-index-eagain-guard.md(PR #769) - TSan exclusion list:
.github/workflows/sanitizers.ymllines 206–210 - req: "Look for new shared state in PRs #765-#771 that lacks TSan coverage. Especially threaded_extract_batch_func paths and sycl_state graph_extractors array"