Skip to content

ADR-0347: Sanitizer matrix — concrete test-set scope per sanitizer

  • Status: Accepted
  • Status update 2026-05-15: implemented; ASan / UBSan / TSan matrix job active in .github/workflows/tests-and-quality-gates.yml; per-sanitizer deselect lists tracked in docs/state.md.
  • Date: 2026-05-09
  • Deciders: Lusoris, Claude (Anthropic)
  • Tags: ci, testing, sanitizer, asan, ubsan, tsan, fork-local

Context

ADR-0015 established the sanitizer matrix (ASan + UBSan + TSan) but left the test-set underspecified. The lane in tests-and-quality-gates.yml runs meson test -C build --suite=unit, but no test() call in core/test/meson.build carries a suite: 'unit' tag — the entire C unit-test set lands in the default suite. Result: every sanitizer leg in the matrix prints No suitable tests defined. and exits 0 with zero correctness coverage. The matrix builds with sanitizers enabled but never exercises code; pure cost, no signal.

This is the failure mode our memory-rule feedback_no_test_weakening warns against: a gate that exists for correctness coverage but is silently neutered by a configuration mismatch. It also matches the Research-0089 §5 finding that flagged the symptom out-of-scope of the parent CI audit.

The matrix's intended job — surfacing memory, UB, and concurrency bugs the Netflix golden gate cannot catch — has been silently OFF since at least ADR-0015 landed. Empirical evidence (this PR's local run) shows the matrix would have caught seven real bugs that have been hiding behind the --suite=unit no-op.

Decision

We will run the full C unit-test set under each of ASan, UBSan, and TSan, with a per-sanitizer deselect list documenting tests that fail because of a real underlying bug (not a sanitizer mis-configuration). Each deselected entry cites a follow-up bug in docs/state.md so the gap stays visible until the underlying defect is fixed and the deselect can be removed.

Concretely:

  • ASan: build with clang-18 + lld-18 + b_sanitize=address + buildtype=debug + b_lto=false + b_lundef=false. Run all C unit tests except test_model, test_predict, test_float_ms_ssim_min_dim (three real defects — see Per-sanitizer verification table).
  • UBSan: same toolchain with b_sanitize=undefined and c_args+cpp_args=-fno-sanitize=function. The function check trips on the K&R-prototype static char *test_X() pattern in ~50 minunit-style harness files (core/test/test.h + every test/test_*.c), which is upstream Netflix harness code we are not refactoring in this PR; suppressing the function check surfaces every other UBSan signal cleanly. Run all C unit tests except test_model (real defect — same svm.cpp parser path ASan flags).
  • TSan: same toolchain with b_sanitize=thread. Run all C unit tests except test_model (same defect), test_pic_preallocation (integer_adm.c global-init race), test_framesync (framesync.c mutex-domain mismatch).

We do not add MSan to the matrix. The matrix in flight has always been ASan + UBSan + TSan (the user prompt's "MSan" naming was a misnomer for TSan). MSan requires every linked library — including system libc++ — to be MSan-instrumented, which is impractical with stock Ubuntu glibc. Adding MSan would require an instrumented-libc++ build leg (-stdlib=libc++ + custom -DLLVM_USE_SANITIZER=Memory toolchain) that is well beyond the T7-MCP-SMOKE-CI / Research-0089 scope. Defer to a separate ADR if a case for MSan emerges.

The Netflix golden Python gate stays in its own job (netflix-golden); the sanitizer matrix is a C-level memory / UB / race net, not a numerical-correctness gate. Running the Netflix Python pairs under sanitized libvmaf would burn ~25 min per leg with marginal additional signal versus the existing non-sanitized golden lane, so we keep the two concerns separated.

The DNN suite is excluded by virtue of not enabling -Denable_dnn=enabled in the sanitizer build (the existing matrix configuration). ONNX Runtime ships uninstrumented binaries that would generate spurious sanitizer reports across core/src/dnn/; the dnn lane has its own job (dnn: in tests-and-quality-gates.yml) which exercises those paths without sanitizers. Adding a sanitized dnn run is a separate ADR.

Alternatives considered

Option Pros Cons Why not chosen
Status quo — --suite=unit (zero tests) Zero CI runtime cost Zero correctness coverage; matrix is theatre; feedback_no_test_weakening fires The whole point of the matrix is signal; cost without signal is the failure mode.
Run all tests, fail on every existing bug Forces immediate fix of all 7 surfaced defects Blocks every unrelated PR until 7 fixes land; conflates "regression net" with "outstanding-debt audit" Land the gate now, file the defects in state.md, fix them in dedicated follow-up PRs. Same final state, much shorter merge train.
Run all tests + continue-on-error: true Surfaces signal without blocking A perma-yellow gate is functionally indistinguishable from no gate (alert blindness) Required-status semantics demand a binary green/red signal; that is the only way to prevent regressions from landing.
Add MSan as a fourth leg Catches uninitialised-memory bugs ASan misses Requires instrumented libc++ + deep build-system surgery; out of scope Defer to a separate ADR after the existing 3-leg matrix produces real signal for ≥1 release cycle.
Run the full Python python/test/ suite under sanitizers Maximum coverage of integration paths ~25 min per leg × 3 sanitizers; CI runner-minute cost dominates the signal-per-minute ratio The Netflix golden lane already exercises the integration paths on a non-sanitized build; the C unit suite under sanitizers is the right cost/signal tradeoff.
Per-sanitizer deselect lists with documented gaps (chosen) Closes the coverage hole today; surfaces real bugs as future work; gate stays binary green/red Three deselect lists to maintain; risk of new test additions silently inheriting "no sanitizer coverage" if a new test fails Mitigated by state.md rows tracking each gap and a follow-up cadence to close them; new tests default to running on all sanitizers.

Consequences

  • Positive: the sanitizer matrix produces real signal again. Seven defects that have been hiding behind --suite=unit are now documented and tracked. Future regressions in the same classes (use-after-free, signed-overflow UB, lock-domain races) fail CI on the introducing PR.
  • Positive: the gap between "the matrix exists" and "the matrix prevents regressions" closes. feedback_no_test_weakening is honoured — instead of relaxing the test set to make CI pass, we land documented gaps that bug-fix PRs erase one at a time.
  • Negative: per-leg wall time grows from ~0s (no tests) to ~5–6s (47 tests under ASan, 49 under UBSan, 47 under TSan). The full sanitizers job still finishes in ~5 min including the build (vs ~5 min before with no test execution), so total wall-clock cost is roughly unchanged once compile-time amortisation is included.
  • Negative: three deselect lists in the workflow YAML need to stay in sync with state.md rows. The PR template's Bug-status hygiene checkbox already requires reviewers to verify.
  • Neutral / follow-ups: file issues for each of the seven bugs surfaced (see Per-sanitizer verification table); each follow-up PR removes the corresponding deselect entry. Track in state.md Open-bugs section.

Per-sanitizer verification table

All numbers from a local run on this PR's branch (ci/sanitizer-matrix-test-scope, tip <filled-in-by-pr-body>), Ubuntu host with clang version 22.1.3 (target x86_64-pc-linux-gnu) standing in for clang-18; CI uses clang-18 + lld-18 + libclang-rt-18-dev per the existing job. The defect classes reproduce on both clang-18 and clang-22, so the local-vs-CI compiler delta does not change the verification.

Reproducer command for any single leg:

cd libvmaf
CC=clang CXX=clang++ LDFLAGS=-fuse-ld=lld \
  meson setup build-asan -Db_sanitize=address \
    -Denable_cuda=false -Denable_sycl=false --buildtype=debug \
    -Db_lto=false -Db_lundef=false
meson compile -C build-asan
meson test -C build-asan --print-errorlogs $(meson test -C build-asan --list \
  | grep '^libvmaf:' \
  | grep -vE 'test_model$|test_predict$|test_float_ms_ssim_min_dim$' \
  | sed 's/^libvmaf://')
Sanitizer Build flags (added on top of existing job) Tests run Wall (test exec only) Deselected Defect class First-bad evidence
ASan none (b_sanitize=address) 36 / 36 OK ~5.3 s test_model, test_predict, test_float_ms_ssim_min_dim (a) test_model::test_json_model_synthetic_branchesSVMModelParser::parse_support_vectors (core/src/svm.cpp:2955) requests 0xfffffffffffffff8-byte allocation on malformed JSON model buffer (allocation-size-too-big); (b) test_predict::test_propagate_metadata — direct + indirect leaks of metadata dict entries (core/src/dict.c:124) and string keys not freed when feature_collector_dispatch_metadata re-enters; (c) test_float_ms_ssim_min_dim::test_float_ms_ssim_init_* — direct calloc leaks in test-side invoke_init (test/test_float_ms_ssim_min_dim.c:33) — init allocates an extractor state never reclaimed ASan abort + stack traces in the PR description
UBSan c_args=-fno-sanitize=function, cpp_args=-fno-sanitize=function 38 / 38 OK ~5.4 s test_model test_model::test_json_model_synthetic_branchesSVMModelParser::parse_support_vectors passes NULL as memcpy arg2 (core/src/svm.cpp:2989, declared __nonnull) on malformed JSON model buffer UBSan runtime-error + abort (same defect as ASan finding (a) — both surface the same parse path's missing-validation bug from different angles)
TSan none (b_sanitize=thread) 36 / 36 OK ~1.3 s test_model, test_pic_preallocation, test_framesync (d) test_model — same defect as ASan/UBSan finding (a); (e) test_pic_preallocationdiv_lookup global table init in div_lookup_generator (core/src/feature/integer_adm.h:36) is racing across worker threads (integer_adm.c::init runs from vmaf_thread_pool_runner) — write-write race on div_lookup[idx], classic missing-atomic on a "compute-once-cache-forever" lookup table; (f) test_framesyncvmaf_framesync_submit_filled_data reads count (core/src/framesync.c:125) under one mutex while vmaf_framesync_acquire_new_buf writes the same count (framesync.c:102) under a different mutex — mutex-domain mismatch on the per-buffer state TSan WARNING: data race + abort

The function-check exclusion under UBSan is documented in the ADR, not deferred to a TODO: the K&R-prototype harness pattern is upstream Netflix code at scale, and rewriting ~50 test files to use static char *test_X(void) is its own PR (T7-5-style sweep). The exclusion preserves UBSan's signal on the actual library sources — the only tests that "would fail" without this exclusion fail because of the harness, not the code under test.

References