Research-0078: ENCODER_VOCAB v3 — 16-slot schema expansion + retrain plan¶
- Status: First retrain landed gate-passing (ADR-0323, 2026-05-06); multi-codec retrain still pending
- Date: 2026-05-05
- Companion ADR: ADR-0302
- Predecessors: ADR-0235 (codec-aware decision + LOSO PLCC ship gate), ADR-0272 (smoke scaffold), ADR-0291 (production flip + LOSO PLCC = 0.9681 baseline)
Question¶
Three vmaf-tune codec adapters (libsvtav1, h264_videotoolbox, hevc_videotoolbox) have landed on master since fr_regressor_v2 flipped to production against the 13-slot ENCODER_VOCAB v2. Inference for those codecs falls through to the unknown one-hot column. What is the smallest change that adds them to the vocab without invalidating the v2 ONNX consumer contract, and what retrain effort does the new vocab require to clear the ADR-0291 ship gate?
Scope¶
- In scope: the schema expansion definition (16-slot tuple), the LOSO ship gate the retrain must clear, and the backwards-compat shim that keeps the v2 ONNX serving until v3 ONNX clears the gate.
- Out of scope: the actual Phase A corpus expansion run for the three new codecs (Phase A runner already supports them, but the retrain corpus has not been generated yet — that lands in the follow-up retrain PR), and any change to the
fr_regressor_v2graph topology beyond the codec-block one-hot width.
Three new slots¶
| idx | slot | adapter file | corpus tag |
|---|---|---|---|
| 13 | libsvtav1 | tools/vmaf-tune/src/vmaftune/codec_adapters/svtav1.py | "libsvtav1" |
| 14 | h264_videotoolbox | tools/vmaf-tune/src/vmaftune/codec_adapters/h264_videotoolbox.py | "h264_videotoolbox" |
| 15 | hevc_videotoolbox | tools/vmaf-tune/src/vmaftune/codec_adapters/hevc_videotoolbox.py | "hevc_videotoolbox" |
The corpus runner emits the codec tag verbatim from the adapter registry key, so the vocab strings must match the adapter file's registry name exactly. ADR-0235 §References lists this rule ("never silently default to a codec that doesn't match what the script actually encoded"); this PR honours it by deriving the new slot strings directly from the registry.
Retrain ship gate¶
Mean LOSO PLCC ≥ 0.95 across all 9 Netflix sources, matching the gate ADR-0291 cleared at 0.9681 ± 0.0207. Acceptance criteria for the follow-up retrain PR:
- Mean LOSO PLCC ≥ 0.95 (hard floor — exit non-zero on failure).
- Multi-codec PLCC lift ≥ +0.005 over the v1 single-input regressor, matching the ADR-0235 invariant. v2 cleared this comfortably; v3 adds 3 slots' worth of one-hot width without changing the MLP topology, so the lift floor should remain trivially clearable.
- No source held-out PLCC below 0.85 (relaxed per-source floor; the v2 OldTownCross outlier sat at 0.9183 and was held in scope by ADR-0291; v3 inherits the same relaxation rather than tightening it on a vocab-only change).
- RMSE within 1.5× of the v2 production checkpoint's per-source RMSE for any source already covered in v2 (regression detector — a vocab bump should not degrade prediction quality on previously shipped codecs).
Backwards-compat strategy¶
The schema scaffold (this PR) does not alter the live ENCODER_VOCAB constant or ENCODER_VOCAB_VERSION. It adds an ENCODER_VOCAB_V3 parallel tuple as documentation of the target schema. The v2 13-slot ONNX continues to serve every consumer; the runtime fallback for an unrecognised encoder string remains the unknown one-hot column.
The follow-up retrain PR is responsible for:
- Bumping
ENCODER_VOCABto the 16-slot v3 tuple in place. - Bumping
ENCODER_VOCAB_VERSIONfrom 2 to 3. - Removing the
ENCODER_VOCAB_V3parallel constant (the liveENCODER_VOCABbecomes the single source of truth again). - Training a fresh ONNX against the expanded Phase A corpus and shipping it under
model/tiny/fr_regressor_v2.onnx(path stays stable; sha256 + sidecar inmodel/tiny/registry.jsonare the integrity contract that prevents accidental v2-vs-v3 mixing). - Honouring the documented load-fallback shim: a runtime that encounters a v2 ONNX in registry but receives a v3 vocab string collapses the unknown indices into the
unknowncolumn rather than failing the inference call. Symmetrically, a v2 vocab string loaded against a v3 ONNX uses the matching v3 column index — the v2 indices 0..12 are preserved verbatim under append-only.
Production-flip checklist (for the follow-up retrain PR)¶
- Phase A corpus coverage:
runs/phase_a/full_grid/per_frame_canonical6.jsonlcontains rows taggedlibsvtav1,h264_videotoolbox,hevc_videotoolboxfor ≥ 6 of the 9 Netflix sources each (matching v2's per-source coverage on the existing 12 hardware encoders). - LOSO eval clears the four acceptance criteria above.
-
ENCODER_VOCAB_VERSIONbumped 2 → 3 inai/scripts/train_fr_regressor_v2.py. -
model/tiny/registry.jsonupdated: new sha256, byte length, andvocab_version: 3field (the schema must allow this — if it does not today, the registry schema bump rides along in the same PR). - Sidecar JSON
encoder_vocabarray contains 16 entries in the order documented in ADR-0302's table. -
docs/ai/inference.mdexample includes at least one of the three new codecs in its sample output. -
ai/AGENTS.mdv3 retrain invariant section moved from "pending" to "shipped"; the v2 ONNX entry is removed from the "do not replace until cleared" list. - Smoke check: a synthetic 16-row test passes
python -m pytest ai/tests/ -k encoder_vocabafter the retrain PR adds the test fixture.
Reproducer (this PR — schema scaffold only)¶
# Verify the v3 constant parses and has the documented length.
python3 -c "
import importlib.util, pathlib
spec = importlib.util.spec_from_file_location(
't', pathlib.Path('ai/scripts/train_fr_regressor_v2.py')
)
m = importlib.util.module_from_spec(spec)
spec.loader.exec_module(m)
assert len(m.ENCODER_VOCAB_V3) == 16, len(m.ENCODER_VOCAB_V3)
assert m.ENCODER_VOCAB_VERSION == 2, 'live vocab still v2 on the scaffold PR'
print('OK')
"
# Will be a no-op until the follow-up retrain PR adds the test
# fixture; included here as the canonical smoke command.
python -m pytest ai/tests/ -k encoder_vocab
Headline results — first retrain (ADR-0323, 2026-05-06)¶
The first v3 retrain shipped under PR feat(ai): fr_regressor_v3 — train + register on ENCODER_VOCAB v3 (16-slot), training ai/scripts/train_fr_regressor_v3.py on the existing Phase A canonical-6 corpus (5,640 rows, NVENC-only).
Gate PASS. Mean LOSO PLCC = 0.9975 ± 0.0018 across the 9 Netflix sources. Per-source PLCC range 0.9945 → 0.9996; every source clears the 0.99 mark and the relaxed per-source 0.85 floor. The 0.95 hard floor is cleared with ~5pp margin.
| Source | PLCC | SROCC | RMSE |
|---|---|---|---|
| BigBuckBunny_25fps | 0.9973 | 0.9878 | 0.787 |
| BirdsInCage_30fps | 0.9988 | 0.9989 | 0.432 |
| CrowdRun_25fps | 0.9996 | 0.9972 | 0.677 |
| ElFuente1_30fps | 0.9987 | 0.8805 | 0.822 |
| ElFuente2_30fps | 0.9950 | 0.9984 | 3.288 |
| FoxBird_25fps | 0.9945 | 0.9329 | 0.904 |
| OldTownCross_25fps | 0.9981 | 0.9951 | 0.810 |
| Seeking_25fps | 0.9989 | 0.9877 | 1.013 |
| Tennis_24fps | 0.9962 | 0.9436 | 1.061 |
The OldTownCross outlier from v2 (0.9183) cleared 0.998 on v3 — the extra two-epoch budget (200 vs the v2 ensemble's 200) and the fold-local StandardScaler combine to lift the trickiest-content fold. ElFuente2's 3.288 RMSE is the largest residual; the per-frame VMAF range on that source is wide (panning + saturation transitions), but PLCC stays at 0.995.
Caveat — the multi-codec lift floor (≥+0.005 PLCC over v1 per ADR-0235) is NOT yet measurable on this corpus drop. The corpus is NVENC-only; 15 of 16 vocab slots receive zero training rows. v3 vs v1 on NVENC-only collapses to v1-vs-v1 on a single codec. The multi-codec lift gate is deferred to the follow-up retrain that consumes a multi-codec Phase A corpus drop. The first retrain ships with smoke: false because the 0.95 floor — the ADR-0302-cited gate — passed; the multi-codec lift becomes a gate on the v2 → v3 in-place promotion PR, not on this parallel-checkpoint PR.
Open questions (for the follow-up retrain PR)¶
- VT corpus availability: the Phase A runner supports VT, but VT requires Apple silicon. Does the local corpus drop need to include VT rows, or can the retrain skip VT and document its slots as zero-weight columns until VT corpus is generated? Provisional answer: include VT slots in the schema today (this PR), defer VT corpus rows to a follow-up; the retrain proceeds on libsvtav1 + the existing 12 hardware encoders, with the VT slots receiving training-time mask = 0. This keeps the column indices stable and avoids a second vocab bump when VT corpus eventually lands.
- Per-source CQ range parity: OldTownCross was the v2 outlier. Does adding three new codecs widen its per-pair VMAF range enough to lift its LOSO PLCC above 0.95? Empirical question for the retrain run.
References¶
- ADR-0302 — this digest's companion ADR.
- ADR-0291 — v2 production flip + 0.95 LOSO PLCC ship gate this digest re-uses.
- ADR-0235 — append-only vocab invariant + multi-codec lift floor (+0.005 PLCC).
- ADR-0272 — smoke scaffold; documents the codec block layout this digest preserves.
- ADR-0283 — VT adapters that motivate slots 14/15.
- ADR-0294 —
libsvtav1adapter that motivates slot 13.