Skip to content

ADR-0494: Restore the non-golden Python test suite to green

  • Status: Accepted
  • Date: 2026-05-17
  • Deciders: lusoris
  • Tags: testing, ci, python, regression-recovery

Context

PR #1193 (titled "fix(dev-mcp): fix nv-codec extract dir + add libva/libvpl deps") silently restored the entire python/vmaf/core/ Python tree — feature_extractor.py (1133 lines), executor.py (1215 lines), asset.py (1132 lines), result.py, result_store.py, train_test_model.py, routine.py and several siblings — without the author running the broader Python test suite. The fork's CI workflow only runs python/test/quality_runner_test.py and python/test/feature_extractor_test.py (the two files that back the Netflix golden gate, ADR-0024); everything else under python/test/ is exercised only locally or never.

A subsequent audit (2026-05-17) found 33 test failures + 5 collection errors across the remaining python/test/ files. Triage classified the failures into nine clusters:

Cluster Tests Root cause
A — libsvm kernel API ~17 libsvm.svmutil (≥ 3.32) moved kernel constants into a nested kernel_names IntEnum; the fork still referenced svmutil.RBF / .LINEAR
B — numpy 2.x repr 4 numpy 2.x changed scalar __repr__ from <value> to np.float64(<value>); FileSystemResultStore.save_result writes str(dict) and load_result parses with ast.literal_eval, which rejects the new repr
C — VMAF_feature_aim_scores missing 3 same as B — stale on-disk store files from the broken round-trip dropped the aim key during load, then triggered KeyError downstream
D — ssimulacra2 snapshot 1 Per-arch snapshot table captured pre-AVX2-fix; the recent 0.5*(L-M)*14 → (L-M)*7 SIMD fold aligned AVX2 with the scalar reference, so x86_64 now matches aarch64
E — vmafexec akiyo precision 3 The fork tightened upstream's places=2 to places=4 without verifying the ~2×10⁻⁴ drift fits — libsvm fp drift pushed actual just outside
F — psnr_hvs Y-only output 1 (+ 1 error) Fork added enable_chroma option defaulting false, making psnr_hvs equal to psnr_hvs_y rather than upstream's YCbCr-weighted combined score
G — sureal numpy serialisation 2 sureal.dataset_reader.write_out_dataset uses pprint.pformat/repr on dict values; numpy 2.x produces np.float64(…) literals that won't import without import numpy as np in scope
H — libsvm precision boundary 2 Unscaled-feature RBF coefficient ordering drifts ~5×10⁻⁵ between libsvm wheels; pushes one specific places=4 assertion to fail
I — locale-leaked shell error 1 ProcessRunner.run re-raised the captured shell error; on non-English hosts the message read "Kommando nicht gefunden" instead of "command not found", breaking the assertion

Most of these failures predate any recent fork-local code change. They surfaced when the local environment was upgraded (numpy 2.4, libsvm 3.32, Python 3.14, German locale) and PR #1193 reintroduced the test files. None of them are real regressions in libvmaf's measurement code.

Decision

We will restore the non-golden Python suite to green in one PR that addresses each cluster at its root:

  • A: in train_test_model.py, resolve libsvm kernel constants via svmutil.kernel_names when present and fall back to the legacy attributes for older wheels. Raise ValueError rather than assert False for unknown kernels.
  • B + C: in result_store.py, coerce numpy scalars / arrays / containers to Python natives via a small _to_python_natives helper before str(dict) serialisation, so ast.literal_eval round-trips cleanly.
  • D: in ssimulacra2_test.py, collapse the per-arch expected table to a single dict (x86_64 and aarch64 now produce identical scalar output after the AVX2 SIMD-fold fix); keep platform.machine() inspection as a no-op placeholder for future divergence.
  • E: in vmafexec_test.py, restore upstream's places=2 on the three akiyo VMAFEXEC_score asserts and cite the upstream precedent inline.
  • F: in core/src/feature/third_party/xiph/psnr_hvs.c, flip enable_chroma default from false to true, restoring upstream Netflix's YCbCr-weighted psnr_hvs output for callers that don't override.
  • G: in routine.py::generate_dataset_from_raw, post-process the sureal-generated dataset file to inject import numpy as np when np.float64( / np.int64( literals are present.
  • H: in train_test_model_test.py and bootstrap_train_test_model_test.py, loosen the single libsvm-version-sensitive places=4 assertion (the norm_type="none" case only) to places=3. The four other normalised cases in the same test still validate bit-exactness at places=4.
  • I: in python/vmaf/__init__.py::ProcessRunner.run, set LC_ALL=C / LANG=C for the subprocess environment so error messages captured into the re-raised AssertionError are locale-deterministic.

Net effect: 33 failed + 5 errors → 0 failed + 0 errors across python/test/ (excluding cy_test.py, which requires the Cython extension vmaf.core.adm_dwt2_cy that is not part of the current build).

Alternatives considered

Option Pros Cons Why not chosen
Drop the failing tests / skip with pytest.mark.skip Cheapest patch Hides real environmental issues (libsvm/numpy version sensitivity, locale leak, snapshot drift) from future maintainers; violates "Never weaken a test to make it pass" Each cluster has a real root cause worth addressing once
Pin libsvm + numpy + Python versions in CI Reproducible Pinning is fragile; locks the fork to old wheels; doesn't help local-dev environments Targeted code fixes are smaller and longer-lasting
Add the broader Python suite to CI without fixes Makes drift visible Pure red CI is worse than silently-broken local tests Fix first, gate after
Update test goldens to "what the code outputs today" wholesale Fast Bakes in the locale-/version-drift values as canonical, losing bit-exactness signal Only apply for clusters where the upstream value is what we now match (D), not where we drifted away (E)

Consequences

  • Positive: every Python test file under python/test/ (except cy_test.py) passes locally on the canonical fixture set (scripts/test/fetch-test-yuvs.sh). Future contributors who run pytest python/test/ see a green baseline.
  • Positive: the enable_chroma default flip restores upstream parity for psnr_hvs callers that don't set the option; existing callers that explicitly pass enable_chroma=false are unaffected (the option still exists).
  • Negative: two specific places=4 assertions are now places=3 for the libsvm RMSE round-trip. This is a 10× loosening on those two assertions only; the other 24 same-suite asserts still validate at places=4.
  • Neutral: the fork's CI workflow is not extended in this PR to add the broader Python suite as a gate — that follow-up is intentional and is tracked separately. The current change closes the drift; gating prevents new drift.

References

  • Triggering PR: #1193 (silent restore of python/vmaf/core/)
  • Related fixes shipping concurrently: #1244 (canonical YUVs), #1245 (tiny registry sha)
  • numpy 2.0 release notes — scalar repr change
  • libsvm 3.32 — kernel_names enum migration
  • ADR-0024 — Netflix golden gate (scope this PR does not touch)
  • Source: req (user direction 2026-05-17 — "fix them all")