ADR-1105: fr_regressor_v2_ensemble production flip deferred to the one-shot post-RC retrain¶
- Status: Accepted
- Date: 2026-06-13
- Deciders: Lusoris
- Tags:
ai,models,rc,docs
Context¶
ADR-0321 flipped the five fr_regressor_v2_ensemble_v1_seed{0..4} rows in model/tiny/registry.json from smoke to production: it shipped real LOSO-validated ONNX weights (gate verdict PROMOTE, mean PLCC ≈ 0.997), per-seed sidecars, and smoke: false. Those weights were trained with a codec one-hot of width 14 (codec_vocab = 12 codec entries + 2 norm dims).
codec_vocab was subsequently trimmed to 6 (x264, x265, libsvtav1, libvvenc, libvpx-vp9, unknown; see model/tiny/fr_regressor_v2_ensemble_v1.json). This made the production ONNX input dimension stale: a model expecting [batch, 14] can no longer be fed the current [batch, 6] codec one-hot, which surfaced as an eval_probabilistic_proxy.py --smoke load failure.
PR #865 fixed the load path by regenerating the five ONNX files at the correct [batch, 6] width, but it did so with the trainer's --smoke mode — one epoch on a synthetic corpus, i.e. throwaway placeholder weights, not a production fit. The registry was correspondingly set to smoke: true with new sha256 values. Two side effects of that PR were not intended:
- The
license,license_url, andsigstore_bundlefields were dropped from the five rows. This is a pure regression — every registry entry (smoke or not) must carry license metadata, and thetest_every_entry_has_license_metadatainvariant enforces it. The fields are restored unconditionally in this PR. - The five rows now claim neither production nor a consistent provenance record: the on-disk ONNX are smoke (new sha), while the retained per-seed sidecars still describe the older
[batch, 14]production weights (old sha,PROMOTE). Thetest_fr_regressor_v2_ensemble_seed_rows_are_productioninvariant (added by ADR-0321) consequently fails onsmoke is False.
Producing real production weights at codec_vocab = 6 requires re-running export_ensemble_v2_seeds.py against the corpus — a full retrain/re-export. The operator has locked all model retraining into a single one-shot step to be run only after the toolchain reaches RC and the feature numbers are frozen, explicitly to avoid retraining models repeatedly. The ensemble is in scope for that one-shot retrain. Doing a piecemeal ensemble-only retrain now would contradict that decision and risk shifting numbers that the one-shot run is meant to freeze.
Decision¶
For the release candidate, ship the ensemble seed rows honestly as smoke placeholders and defer the production flip to the locked one-shot retrain:
- Restore
license,license_url, andsigstore_bundleon all five rows (the #865 regression). Keepsmoke: trueand the regenerated[batch, 6]sha256 values, which match the ONNX actually shipped. Update each row'snotesto state plainly that these are smoke placeholders pending the one-shot production re-export. - Keep the
test_fr_regressor_v2_ensemble_seed_rows_are_productionassertions verbatim (they remain the target production contract), but mark the test@pytest.mark.xfail(strict=True)with a reason citing this ADR.strict=Truemeans the test fails the suite the moment the one-shot retrain lands real weights (smoke: false+ a sidecar whose sha256 matches the shipped ONNX), forcing removal of the marker — so the deferral cannot silently outlive its cause. - When the one-shot retrain runs, it must re-run
export_ensemble_v2_seeds.pyso the ONNX bytes and sidecars regenerate together atcodec_vocab = 6, then flipsmoke: falseand remove the xfail marker — exactly the workflow ADR-0321's follow-ups already mandate (hand-flipping rows remains forbidden).
This does not modify any Netflix golden-data assertion (CLAUDE.md §8); model_registry_schema_test.py is a fork-local file.
Alternatives considered¶
- Retrain the five ensemble seeds now (production flip immediately). This is the eventual correct end state but contradicts the locked one-shot retrain decision (no piecemeal retraining before the toolchain is RC-frozen) and would risk moving numbers the one-shot run is meant to freeze. Rejected for RC; it is precisely what the one-shot retrain will do.
- Revert the ONNX to the old
[batch, 14]production weights. Restoressmoke: falseconsistency but re-breaks the load path under the currentcodec_vocab = 6, reintroducing theeval_probabilistic_proxyfailure #865 fixed. Rejected. - Leave the test failing as a known local red. The schema test is non-gating in CI (it runs under the
|| trueblock intests-and-quality-gates.yml), so this would not break master. But a bare red is noise that can mask a future regression in the same test and carries no self-healing signal. The strict-xfail marker is the honest, self-documenting, auto-alerting representation. Rejected in favour of strict xfail. - Delete the stale per-seed sidecars. They describe the older production weights, not the shipped smoke ONNX. Keeping them preserves the genuine
PROMOTEprovenance and the sha the one-shot retrain will supersede; the only consumer is the now-xfailed production test. Rejected (kept) to retain provenance.
Consequences¶
- Positive: The RC registry is internally consistent and honest — every entry has license metadata, and
smoke: truetruthfully describes the shipped weights. The deferral is tracked by a strict marker that fails loudly when resolved. - Negative: The probabilistic ensemble head ships at smoke quality in the RC. Any consumer that loads it gets placeholder predictions until the one-shot retrain. This is documented in the model card and state.md.
- Neutral / follow-ups: The one-shot retrain must (a) re-export the five seeds via
export_ensemble_v2_seeds.pyatcodec_vocab = 6, (b) flipsmoke: false, (c) remove the xfail marker inmodel_registry_schema_test.py. Tracked indocs/state.mdand the retrain plan.
Supply-chain impact¶
- New dependencies: none.
- Removed dependencies: none.
- Build-time fetches: none.
References¶
- Parent / superseded-context: ADR-0321, ADR-0303, ADR-0309.
- Regression source: PR #865 (
8b7ae731a) — dropped license metadata and regenerated ONNX in--smokemode. req— operator direction: retrain all models exactly once, after the toolchain reaches RC and feature numbers are frozen, to avoid repeating the retrain; the ensemble is in scope for that one-shot run.