ADR-1092: framesync producer-death deadlock — abort flag + shutdown broadcast¶
- Status: Accepted
- Date: 2026-06-07
- Deciders: Lusoris
- Tags:
core,threading,correctness,sanitizer,fork-local
Context¶
VmafFrameSyncContext couples a producer thread (calling vmaf_framesync_acquire_new_buf + vmaf_framesync_submit_filled_data) with a consumer thread (calling vmaf_framesync_retrieve_filled_data).
retrieve_filled_data contains an unbounded while loop that calls pthread_cond_wait whenever the requested frame index is not yet in BUF_FILLED state. There was no way to interrupt this wait. If the producer thread died — from an OOM in acquire_new_buf, from an error returned by the feature extractor's extract() callback, or from an unhandled early exit — the consumer would wait forever. Because vmaf_thread_pool_wait in vmaf_close cannot return until all worker threads finish, the entire process would hang.
As a consequence, calling vmaf_framesync_destroy while a consumer was still blocked in pthread_cond_wait is undefined behaviour per POSIX (pthread_cond_destroy with waiters: UB).
The SAN-FRAMESYNC-MUTEX-DOMAIN fix (PR #548, ADR referenced at state.md line 604) established the strict M0-before-M1 lock ordering that prevents TSan races. That fix correctly handles the normal producer–consumer handshake. The producer-death path was not addressed.
Decision¶
Add an aborted (plain int) field to VmafFrameSyncContext. Add a new function vmaf_framesync_abort(fs_ctx) that:
- Takes
retrieve_lock(M1). - Sets
fs_ctx->aborted = 1. - Broadcasts on the
retrievecondvar. - Releases M1.
Update retrieve_filled_data to check aborted in two places:
- Before entering
pthread_cond_wait(persistent flag covers the case where abort fired before cond_wait was reached). - Immediately after waking from
pthread_cond_wait(covers the case where abort's broadcast was the wake signal).
Both checks return -ECANCELED with all locks released.
Add a safety-net vmaf_framesync_abort call at the top of vmaf_framesync_destroy so that any consumer still sleeping when destroy is called receives a broadcast before the condvar is destroyed — preventing the POSIX UB even if the caller forgot to drain threads first.
vmaf_framesync_abort is declared in framesync.h so callers (feature extractors, libvmaf.c error paths) can invoke it directly on producer failure.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
pthread_cancel + cancellation points | No new API | Non-deterministic; interacts badly with cleanup handlers; cancellation state is thread-scoped, not context-scoped; POSIX discourages it in production code | Not chosen |
Timed wait (pthread_cond_timedwait) | No new API surface | Adds latency on the happy path; retry interval is a policy decision with no right answer; does not actually fix the root cause | Not chosen |
| Abort flag only (no broadcast in destroy) | Simpler destroy | Still leaves a POSIX UB window if a waiter is in flight when destroy is called | Not chosen — defence-in-depth broadcast kept |
C11 _Atomic int for the flag | Formally cleaner memory model | Not needed: POSIX guarantees that all writes made while holding a mutex are visible to any thread that subsequently acquires the same mutex; retrieve_lock is always held when reading/writing aborted | Not chosen |
Consequences¶
- Positive: producer-death no longer causes an infinite hang or POSIX UB in
vmaf_framesync_destroy; TSan / helgrind will see a clean condvar lifecycle. - Positive: consumers get a well-defined
-ECANCELEDerror code they can propagate up the call stack. - Negative: callers that want to surface the abort to the user must check for
-ECANCELEDin theirretrieve_filled_datareturn-value handling. - Neutral: the
abortedfield adds 4 bytes toVmafFrameSyncContext(internal struct, no ABI impact). - Neutral: feature extractor error paths should call
vmaf_framesync_abortbefore returning; thevmaf_framesync_destroysafety-net broadcast is defence-in-depth, not the primary signal path.
References¶
SAN-FRAMESYNC-MUTEX-DOMAINfix: PR #548 (2026-05-09), state.md line 604.- POSIX
pthread_cond_destroy: "Attempting to destroy a condition variable upon which other threads are currently blocked results in undefined behavior." feedback_no_test_weakening: fix the implementation, not the gate.- Internal analysis:
framesync.c:retrieve_filled_datainfinite loop on producer death — identified 2026-06-07 by framesync-deadlock-paths agent.