Skip to content

Tiny AI — bisect-model-quality

Binary-search a list of ONNX checkpoints for the first one that drops below a PLCC / SROCC / RMSE gate. Two surfaces:

  • CLIvmaf-train bisect-model-quality (local use, ad-hoc on a training-run timeline).
  • Nightly CI.github/workflows/nightly-bisect.yml runs the CLI against the committed fixture cache every night and posts the result to sticky tracker issue #40.

CLI

vmaf-train bisect-model-quality \
    path/to/model_*.onnx \
    --features path/to/features.parquet \
    --min-plcc 0.85 \
    --input-name input \
    --json result.json \
    --fail-on-first-bad

Required: exactly one of --min-plcc, --min-srocc, --max-rmse. The model list is interpreted as a timeline; the head must be "good" and the tail "bad" for the bisect to make sense (the tool exits early with a clear verdict if both ends pass or both fail).

--fail-on-first-bad exits 2 when the bisect localises a regression, which is what the nightly workflow uses to flip CI red.

Feature parquet shape

The parquet must contain:

  • A mos column (the regression target).
  • At least one of DEFAULT_FEATURES (adm2, vif_scale0..3, motion2).

ONNX models in the timeline must accept input shape [N, F] where F matches the number of feature columns present in the parquet, with input tensor name matching --input-name.

Underlying algorithm

Visit indices 0 and N-1 first to confirm the timeline is bisectable. If so, classic log₂(N) binary search localises the first bad index. Each visit caches its EvalReport so re-visited indices are free. Source: ai/src/vmaf_train/bisect_model_quality.py.

Nightly workflow

Cron: 37 4 * * * (04:37 UTC). Steps:

  1. Set up Python 3.12, install the ai/ package.
  2. Run python ai/scripts/build_bisect_cache.py --check — regenerates the committed cache from fixed seeds and asserts content equality against the committed tree. Catches drift in pandas / pyarrow / onnx serialisation. Add --manifest-out runs/bisect-cache-check.json when you need durable replay evidence for the cache check.
  3. Run vmaf-train bisect-model-quality --fail-on-first-bad against ai/testdata/bisect/.
  4. Always upload the JSON report as a workflow artifact (bisect-report).
  5. Always edit the sticky comment on issue #40 with the rendered verdict + per-step Markdown table.

A red workflow run means either:

  • The committed cache regenerated to different bytes (toolchain drift — regen + commit per the README), or
  • The bisect localised a real regression (with the synthetic placeholder this means the algorithm or runtime broke; with a real cache it means a model regressed).

The sticky comment on #40 always reflects the latest run, even on green, so the per-step PLCC/SROCC/RMSE numbers are one click from the issue. The full historical report set lives in the workflow-artifact retention window.

Cache generator

The committed ai/testdata/bisect/ defaults to deterministic synthetic data so the nightly check is fully reproducible from a clean checkout. The generator also accepts a real DMOS/MOS-aligned feature parquet and materialises the same cache shape:

python ai/scripts/build_bisect_cache.py \
  --source-features runs/dmos_features.parquet \
  --target-column dmos \
  --manifest-out runs/bisect-cache-manifest.json

The source parquet must contain the canonical six columns (adm2, vif_scale0, vif_scale1, vif_scale2, vif_scale3, motion2) and a target column. Without --target-column, the script tries mos, dmos, target, then score. The generated features.parquet always renames the target to mos, and the generated ONNX timeline is fit from that target before deterministic tiny perturbations are applied. When --manifest-out is supplied, the generator writes a bisect-cache-manifest-v1 sidecar with the generation/check mode, target-column candidates, default feature list, artifact counts, and shared AI run_provenance.

See the fixture README for the exact cache contract and Research-0001 for the original swap-path investigation. The synthetic-regression case ("introducing a deliberately bad ONNX trips the alert") is covered by the unit test test_bisect_localises_first_bad, not by the committed timeline.

Regenerating the cache

python ai/scripts/build_bisect_cache.py
git add ai/testdata/bisect
git commit -m "chore(ai): regenerate bisect-model-quality fixture cache"

The CI --check step will fail until the regenerated bytes are committed.

Sticky tracker issue (#40)

Issue #40 is owned by the workflow. Do not close it while the workflow is enabled. The single comment authored by github-actions[bot] whose body starts with <!-- bisect-tracker --> is the sticky comment; everything else (label edits, manual notes) is fine to add and won't confuse the helper (scripts/ci/post-bisect-comment.py).

See also

  • ADR-0109 — design choices.
  • Research-0001 — cache-shape investigation.
  • Issue #4 — original request, closed by this surface.