Skip to content

Predictor v2 — real-corpus LOSO training (Phase 2)

docs/ai/predictor.md (introduced by PR #450) covers the runtime predict-then-verify loop and the synthetic-stub training pipeline. This page covers Phase 2: how to promote those stubs into production-flippable models trained on real corpora and gated by the ADR-0303 production-flip threshold.

The Phase-2 trainer ships in this repo; the trained-model artefacts do not — operators run the trainer locally against their corpora and commit the resulting ONNX + model-card diff in a follow-up PR.

When to run this

Run Phase 2 after either:

  • A real corpus has been generated under ~/.workingdir2/netflix/ (canonical-6 schema; 9 Netflix Public Dataset sources × NVENC / QSV / SW codecs), ~/.workingdir2/konvid-150k/ (KoNViD-1k UGC, when ingested), or ~/.workingdir2/bvi-dvc-raw/ (BVI-DVC raw YUVs, when ingested). Or
  • An operator wants to validate that the shipped synthetic stubs remain the right ship for a codec by running the gate against whatever corpus is locally available.

The trainer never auto-overwrites a stub ONNX without first clearing the gate; failing codecs keep the stub and the model card gains an explicit Status: Proposed (gate-failed: REASON) block.

What the gate enforces

The Phase-2 gate is the same two-part threshold as ADR-0303 §Decision applied per codec rather than per ensemble seed:

Sub-gate Threshold Failure consequence
Mean fold PLCC >= 0.95 Codec marked fail; ONNX stub kept.
Spread (max - min fold PLCC) <= 0.005 Codec marked fail; ONNX stub kept.
Per-fold floor >= 0.95 Codec marked fail; ONNX stub kept.
LOSO fold count 5 Corpora with < 5 distinct sources -> insufficient-sources.

These constants live in ai/scripts/train_predictor_v2_realcorpus.py as SHIP_GATE_MEAN_PLCC, SHIP_GATE_PLCC_SPREAD_MAX, SHIP_GATE_PER_FOLD_MIN, LOSO_FOLD_COUNT. Do not lower them to make a codec pass — per CLAUDE.md §13 / feedback_no_test_weakening, the gate is load-bearing. If a codec genuinely requires a different threshold, supersede ADR-0303 with a new ADR and update both call sites (predictor trainer + scripts/ci/ensemble_prod_gate.py) together.

How to run the trainer

The orchestration shell is the canonical entry point. It auto- discovers corpora, runs the trainer, retrains every passing codec on the full corpus, and patches the model cards:

bash ai/scripts/run_predictor_v2_training.sh

Common overrides:

# Point at a specific corpus.
CORPUS=/path/to/canonical6.jsonl \
  bash ai/scripts/run_predictor_v2_training.sh

# Override the discovery roots.
CORPUS_ROOTS="$HOME/data/corpus_a $HOME/data/corpus_b" \
  bash ai/scripts/run_predictor_v2_training.sh

# Diagnostic run when no corpora are on disk yet.
ALLOW_EMPTY=1 bash ai/scripts/run_predictor_v2_training.sh

The script writes:

  • runs/predictor_v2_realcorpus/report.json — machine-readable per-codec verdict + per-fold metrics. Schema:
  • gate.{mean_plcc_threshold, plcc_spread_max, per_fold_min, loso_fold_count, adr}.
  • codecs[].{codec, status, mean_plcc, plcc_spread, mean_srocc, mean_rmse, n_rows_total, n_distinct_sources, failure_reasons[], folds[]}.
  • summary.{n_pass, n_fail, n_insufficient, n_missing_rows}.
  • run_provenance.{schema, entrypoint, argv, args, inputs, outputs} records the exact trainer script, command line, corpus roots/files, and report target used for the run.
  • runs/predictor_v2_realcorpus/train_<UTC>.log — full trainer log.
  • model/predictor_<codec>.onnx — overwritten only for codecs that PASS the gate. Synthetic stubs are kept for failing codecs.
  • model/predictor_<codec>_card.md — every codec's card gains a Status: Production (ADR-0303 gate cleared) or Status: Proposed (gate-failed: REASON) block. Idempotent across re-runs (the prior Status: block is replaced).

Direct trainer invocation

The shell wraps ai/scripts/train_predictor_v2_realcorpus.py. For tighter control:

# Restrict to a single codec for debugging.
python ai/scripts/train_predictor_v2_realcorpus.py --codec libx264

# Synthetic-smoke run (no real corpus required; never produces PASS).
python ai/scripts/train_predictor_v2_realcorpus.py --synthetic-smoke

# Custom report path + fewer epochs for a quick iteration.
python ai/scripts/train_predictor_v2_realcorpus.py \
  --epochs 50 --report-out /tmp/p2-fast.json

The trainer's exit code is 0 iff every codec passes the gate; this is the CI hook a future workflow can consume once the corpus is hosted somewhere CI can reach.

Reading a fail report

Example honest-fail output (codec genuinely under-fits the corpus):

libvvenc       FAIL                   0.8520  0.0420    320     8
  - mean PLCC 0.8520 < 0.9500 (ADR-0303 part 1)
  - PLCC spread 0.0420 > 0.0050 (ADR-0303 part 2)

Two recovery paths:

  1. Ship more training data. libvvenc may need additional sources to clear the spread bound; add corpora under one of the discovery roots and re-run.
  2. Supersede ADR-0303. If after exhaustive corpus expansion the gate still fails for a structural reason (e.g. encoder is inherently noisier than the deterministic v2 baseline), open a superseding ADR. Do NOT silently lower the threshold in code.

The Status: Proposed (gate-failed: REASON) block on the model card makes the fail visible to anyone reading the card — there is no silent-pass path.

Test coverage

ai/tests/test_train_predictor_v2_realcorpus.py (23 cases) pins:

  • Gate enforcement is honest — synthetic FoldResults at PLCC = 0.85 land in the n_fail bucket, never silently in n_pass. Constants match ADR-0303 §Decision.
  • LOSO partitioning is by source — held-out fold sources never appear in the training fold, even when the same source contributes many rows.
  • Corpus discovery skips missing roots — operators with only one of the three configured corpora do not crash the batch.
  • Report schema is stablegate, codecs[], summary keys are pinned so the orchestration shell can rely on the layout, and diagnostic reports carry ADR-0661 run_provenance.

The fold-level training body itself (the per-fold MLP fit) is exercised by the existing tools/vmaf-tune/tests/test_predictor_train.py suite from PR #450 — Phase 2 uses the same trainer module so the ONNX export remains byte-stable across the synthetic-stub and real- corpus paths.

Cross-references