Skip to content

ADR-0562: VCQ-223 LocalExplainer hang fix — cap neighbor_samples in test runner

  • Status: Accepted
  • Date: 2026-05-18
  • Deciders: lusoris, Claude (Anthropic)
  • Tags: python, test, local-explainer, performance, bugfix, fork-local

Context

ADR-0551 (PR #1325) identified the root cause of the VCQ-223 CI hang: VmafQualityRunnerWithLocalExplainer._run_on_asset constructed LocalExplainer() with the default neighbor_samples=5000, producing ~480 000 libsvm svm_predict_values calls per typical run (wall time 4–8 min; CI timeout). This PR implements the fix proposed in ADR-0551.

The test QualityRunnerTest::test_run_vmaf_runner_local_explainer_with_bootstrap_model has been skipped with @unittest.skip("[VCQ-223]") since commit e3827e4dd.

Decision

The fix targets the fallback path inside VmafQualityRunnerWithLocalExplainer._run_on_asset (option (b) from ADR-0551): the fallback now constructs LocalExplainer(neighbor_samples=100) — matching the passing sibling test test_explain_vmaf_results — rather than the 5000-sample default. Production callers that need higher fidelity can pass optional_dict={"explainer_neighbor_samples": 5000} or supply a full LocalExplainer via optional_dict2={"explainer": ...}.

The @unittest.skip decorator is removed. Score assertions are recalibrated to the neighbor_samples=100 values.

Implementation notes

  • python/vmaf/core/quality_runner_extra.py:38 — fallback LocalExplainer() replaced with LocalExplainer(neighbor_samples=neighbor_samples) where neighbor_samples defaults to 100 and is overridable via optional_dict.get("explainer_neighbor_samples", 100).
  • python/test/local_explainer_test.py:252@unittest.skip removed; score assertions updated to neighbor_samples=100 calibration values:
  • results[0]["VMAF_LE_score"] = 75.40980306756497
  • results[1]["VMAF_LE_score"] = 99.95804823471536
  • Wall time on dev machine (Python 3.14): approximately 78 seconds.

Alternatives considered

Option Pros Cons Why not chosen
Lower LocalExplainer.DEFAULT_NEIGHBOR_SAMPLES globally to 100 Single-point change Breaks production callers relying on 5000-sample explanations Chosen against — default is public API surface
Fix only at test call site via optional_dict2 Minimal blast radius Runner fallback stays broken for any future caller Runner-level fix is the correct layer
Raise the test timeout No code change Root cause remains; burns 4–8 min of CI time Rejected — fix root cause, not symptom

Consequences

  • Positive: QualityRunnerTest::test_run_vmaf_runner_local_explainer_with_bootstrap_model runs and completes within the 120-second timeout. The LocalExplainer + VmafQualityRunnerWithLocalExplainer code path is exercised in CI.
  • Negative: The neighbor_samples=100 values differ from what neighbor_samples=5000 would produce; score assertions changed accordingly.
  • Neutral / follow-ups: Production callers relying on the implicit 5000-sample default should pass optional_dict={"explainer_neighbor_samples": 5000}.

References

  • Diagnosis ADR: ADR-0551 (PR #1325)
  • Skip introduced: commit e3827e4dd (2026-05-06)
  • Related passing test: LocalExplainerTest::test_explain_vmaf_results (uses neighbor_samples=100)
  • docs/state.md tracking item: T-VCQ-223-LOCAL-EXPLAINER-HANG