ADR-0656: External-bench wrappers emit registry competitor keys¶
- Status: Accepted
- Date: 2026-05-20
- Deciders: Lusoris, Codex
- Tags: ai, testing, tooling, fork-local
Context¶
tools/external-bench/compare.py validates each wrapper payload before aggregation. The validator requires summary.competitor to equal the wrapper registry key from WRAPPERS, because that key is what the CLI accepts through --competitors and what the report table groups by.
The two fork-side shell wrappers emitted descriptive labels instead: fork-fr-regressor-v2-ensemble and fork-nr-metric-v1. Those labels are useful as display prose, but they violate the schema introduced by Research-0118 and cause the fork's own competitors to be skipped before aggregation. This weakens the external-bench path exactly where the signal-mix audit needs NR/MOS second-opinion comparisons.
Decision¶
The fork-side wrappers will emit the exact registry keys fork-fr-regressor and fork-nr-metric in summary.competitor. Model and version detail stays out of that identity field and belongs in optional metadata, wrapper logs, or future report columns. Shell-wrapper smoke tests will exercise the two fork wrappers with a fake vmaf-tune binary so this contract is pinned without installing external competitors.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Fix wrapper output to registry keys | Keeps the documented schema strict; makes table grouping deterministic; minimal user-visible change | Descriptive version labels are no longer shown in the identity column | Chosen. The identity field is a machine contract, not a display label. |
Relax validate_wrapper_output() to accept aliases | Preserves the current descriptive labels | Reopens schema ambiguity and needs alias tables everywhere aggregation compares keys | Rejected. It makes the validator weaker immediately after adding it. |
Change WRAPPERS keys to the descriptive labels | Keeps wrapper output unchanged | Breaks existing --competitors fork-fr-regressor fork-nr-metric usage and docs | Rejected. The public CLI keys are already documented. |
Consequences¶
- Positive:
compare.pycan aggregate the fork-side FR and NR wrappers instead of skipping them as invalid schema. - Positive: the tests now exercise the real shell wrappers at the schema boundary while still stubbing
vmaf-tuneand external binaries. - Negative: operators who read only the
summary.competitorfield get the stable key rather than the exact model version; future metadata can carry model IDs explicitly if needed. - Neutral / follow-ups: NR/MOS second-opinion feature materialisation should consume the same registry keys when it reuses external-bench wrapper output.
References¶
req: "well go on i guess we have enough backlog..."- ADR-0368 — wrapper-only external benchmark harness.
- Research-0118 — schema validation at the wrapper boundary.
- ADR-0650 — signal-mix audit identifies NR/MOS second opinions as a missing signal family.