ADR-0302: ENCODER_VOCAB v3 — 16-slot schema expansion + retrain plan¶

Status: Accepted
Date: 2026-05-05
Companion research digest: Research-0078
Related: ADR-0235 (codec-aware FRRegressor + 0.95 LOSO PLCC ship gate), ADR-0272 (fr_regressor_v2 smoke scaffold), ADR-0291 (fr_regressor_v2 flip from smoke to production)
Re-scope of: PR #373 (deferred VT-adapters-plus-vocab change; the VT adapters landed via a separate PR — see tools/vmaf-tune/src/vmaftune/codec_adapters/h264_videotoolbox.py / hevc_videotoolbox.py on master via ADR-0283)

Context¶

fr_regressor_v2 shipped to production in ADR-0291 against ENCODER_VOCAB v2 (13 slots: libx264, libaom-av1, libx265, h264_nvenc, hevc_nvenc, av1_nvenc, h264_amf, hevc_amf, av1_amf, h264_qsv, hevc_qsv, av1_qsv, libvvenc). Three vmaf-tune codec adapters have landed since:

libsvtav1 (PR #294-series, ADR-0294) — software AV1 alongside libaom-av1, materially different rate-distortion behaviour at matched CQ + preset.
h264_videotoolbox and hevc_videotoolbox (ADR-0283) — Apple hardware-accelerated H.264 / HEVC adapters; first VT family on the corpus side.

The Phase A corpus runner can already emit canonical-6 features for those three encoders, but fr_regressor_v2 has never seen them: the inference path silently maps every unrecognised encoder string to the unknown one-hot column and returns a low-confidence prediction. The cleanest fix is a vocab bump (v2 → v3) plus a fresh LOSO retrain that clears the same 0.95 mean-LOSO-PLCC ship gate ADR-0291 cleared.

This ADR documents the schema expansion as a scaffold-only change. The production ONNX swap is gated on a follow-up retrain PR landing the new checkpoint and clearing the LOSO PLCC ship gate; the in-tree v2 ONNX continues to serve until then. ADR-0235's append-only invariant is preserved — every v2 index keeps its column position; the three new slots append at indices 13/14/15 (one-based: 14/15/16).

Decision¶

Land a 16-slot ENCODER_VOCAB v3 schema scaffold in ai/scripts/train_fr_regressor_v2.py as a parallel constant (ENCODER_VOCAB_V3), without wiring it into the active training pipeline. The live ENCODER_VOCAB and ENCODER_VOCAB_VERSION = 2 remain the source of truth for any retraining run shipping today — this PR ships only the schema definition + the documentation contract that future v3 retrains MUST satisfy.

v3 schema (16 slots, append-only over the user-facing v2 layout documented in ADR-0291):

idx	slot	family	new in v3
0	libx264	SW H.264	—
1	libaom-av1	SW AV1	—
2	libx265	SW HEVC	—
3	h264_nvenc	NVENC H.264	—
4	hevc_nvenc	NVENC HEVC	—
5	av1_nvenc	NVENC AV1	—
6	h264_amf	AMF H.264	—
7	hevc_amf	AMF HEVC	—
8	av1_amf	AMF AV1	—
9	h264_qsv	QSV H.264	—
10	hevc_qsv	QSV HEVC	—
11	av1_qsv	QSV AV1	—
12	libvvenc	SW VVC	—
13	libsvtav1	SW AV1 (SVT)	new
14	h264_videotoolbox	VT H.264	new
15	hevc_videotoolbox	VT HEVC	new

Backwards-compat strategy. Until the v3 ONNX ships and clears the LOSO PLCC ship gate, the runtime continues to load the v2 13-slot ONNX. The v3 schema constant is information-only; no inference path consumes it yet. Once a follow-up retrain PR clears the ship gate, that PR (not this one) flips ENCODER_VOCAB_VERSION from 2 to 3, replaces the live ENCODER_VOCAB tuple, registers the new ONNX in model/tiny/registry.json, and documents the v2 → v3 fallback shim removal.

Ship gate. Mean LOSO PLCC ≥ 0.95 across all 9 Netflix sources, matching the gate ADR-0291 cleared. Per ADR-0235, the multi-codec lift over the v1 single-input regressor must remain ≥ +0.005 PLCC; that floor was already cleared by v2 and is preserved as the v3 acceptance criterion.

Alternatives considered¶

Option	Pros	Cons	Verdict
16-slot retrain (chosen)	Single LOSO run covers all three new codecs; preserves append-only invariant; matches ADR-0291 ship-gate cadence	Requires Phase A corpus coverage for SVT-AV1 + VT (the corpus runner already supports them, no blocker)	Selected — clears the ship gate in one pass, no schema churn
Incremental per-PR retrains (one new slot per PR)	Smallest blast-radius per change; easier bisect if a single codec drags PLCC	3× the LOSO wall-time + 3× the PR overhead; vocab churn invalidates intermediate ONNX checkpoints; users running `fr_regressor_v2` would see three back-to-back schema bumps	Rejected — cost-of-PR overhead dominates; no real bisect benefit since LOSO already attributes drag per source × encoder cell
Deprecate v2 + retrain from scratch (open vocab, no append-only)	Frees the column ordering; lets us drop unused slots	Breaks ADR-0235's append-only invariant; invalidates every shipped `fr_regressor_v2_*.onnx` consumer; forces a v3 majeure version bump on every downstream caller	Rejected — append-only is the contract that lets the ONNX checkpoint freeze across vocab edits; abandoning it for a one-time cleanup costs more than the slot waste saves
Defer until a "real" multi-corpus lands	Avoids the risk of OldTownCross-style outliers on the new codecs	Holds back vmaf-tune Phase B consumers that already encode SVT-AV1 + VT material; the Phase A corpus runner can already produce canonical-6 rows for these encoders today	Rejected — the corpus is not the bottleneck, the vocab is; deferring blocks usable predictions on shipped adapters

Consequences¶

Visible behaviour (this PR): zero. The schema scaffold lands as a parallel constant; existing v2 inference paths are unaffected.
Visible behaviour (follow-up retrain PR, gated on ship gate): fr_regressor_v2 predictions for SVT-AV1, VT-H.264, and VT-HEVC encodes stop falling through to the unknown one-hot and start receiving codec-aware lift.
Backlog opened: T-FR-V2-VOCAB-V3-RETRAIN — produce Phase A corpus rows for libsvtav1 + the two VT encoders, run LOSO, retrain, ship if ≥ 0.95 mean LOSO PLCC.
No upstream interaction. ai/scripts/train_fr_regressor_v2.py is fork-introduced (ADR-0272); upstream Netflix/vmaf has no equivalent surface.

References¶

req (2026-05-05, popup re-scope): drop the VT adapters from PR #373 (already landed via ADR-0283), keep only the 13 → 16 vocab expansion + retrain plan; ship as scaffold under a new ADR.
ADR-0235 — codec-aware FR regressor + LOSO PLCC ship gate + append-only CODEC_VOCAB invariant.
ADR-0272 — fr_regressor_v2 codec-aware smoke scaffold (smoke checkpoint shipped pending a real Phase A corpus).
ADR-0291 — flip from smoke to production; documents the v2 13-slot vocab and the 0.95 LOSO PLCC ship gate this ADR re-uses.
ADR-0283 — VT codec adapters that motivate slots 14/15.
ADR-0294 — libsvtav1 adapter that motivates slot 13.
Research-0078 — companion research digest with retrain plan, ship gate, reproducer.

Status update 2026-05-09: namespace collision resolved¶

Two parallel agent reports (abd6ed552ac8cae60, abda108c8263491da) surfaced a name collision: a future "feature-set v3" workstream (canonical-6 + encoder_internal + shot-boundary + hwcap) was unintentionally referring to itself as fr_regressor_v3 — the same id this ADR's retrain checkpoint already claims. The collision is resolved per ADR-0349: this ADR's fr_regressor_v3 registry row stays bit-identical (sha256 eaa16d23…, smoke: false) and the future feature-set work claims the reserved name fr_regressor_v3plus_features. No code change in this ADR; this appendix lands per ADR-0028 (immutability of Accepted-ADR bodies — append-only status updates only).