Skip to content

ADR-0656: External-bench wrappers emit registry competitor keys

  • Status: Accepted
  • Date: 2026-05-20
  • Deciders: Lusoris, Codex
  • Tags: ai, testing, tooling, fork-local

Context

tools/external-bench/compare.py validates each wrapper payload before aggregation. The validator requires summary.competitor to equal the wrapper registry key from WRAPPERS, because that key is what the CLI accepts through --competitors and what the report table groups by.

The two fork-side shell wrappers emitted descriptive labels instead: fork-fr-regressor-v2-ensemble and fork-nr-metric-v1. Those labels are useful as display prose, but they violate the schema introduced by Research-0118 and cause the fork's own competitors to be skipped before aggregation. This weakens the external-bench path exactly where the signal-mix audit needs NR/MOS second-opinion comparisons.

Decision

The fork-side wrappers will emit the exact registry keys fork-fr-regressor and fork-nr-metric in summary.competitor. Model and version detail stays out of that identity field and belongs in optional metadata, wrapper logs, or future report columns. Shell-wrapper smoke tests will exercise the two fork wrappers with a fake vmaf-tune binary so this contract is pinned without installing external competitors.

Alternatives considered

Option Pros Cons Why not chosen
Fix wrapper output to registry keys Keeps the documented schema strict; makes table grouping deterministic; minimal user-visible change Descriptive version labels are no longer shown in the identity column Chosen. The identity field is a machine contract, not a display label.
Relax validate_wrapper_output() to accept aliases Preserves the current descriptive labels Reopens schema ambiguity and needs alias tables everywhere aggregation compares keys Rejected. It makes the validator weaker immediately after adding it.
Change WRAPPERS keys to the descriptive labels Keeps wrapper output unchanged Breaks existing --competitors fork-fr-regressor fork-nr-metric usage and docs Rejected. The public CLI keys are already documented.

Consequences

  • Positive: compare.py can aggregate the fork-side FR and NR wrappers instead of skipping them as invalid schema.
  • Positive: the tests now exercise the real shell wrappers at the schema boundary while still stubbing vmaf-tune and external binaries.
  • Negative: operators who read only the summary.competitor field get the stable key rather than the exact model version; future metadata can carry model IDs explicitly if needed.
  • Neutral / follow-ups: NR/MOS second-opinion feature materialisation should consume the same registry keys when it reuses external-bench wrapper output.

References

  • req: "well go on i guess we have enough backlog..."
  • ADR-0368 — wrapper-only external benchmark harness.
  • Research-0118 — schema validation at the wrapper boundary.
  • ADR-0650 — signal-mix audit identifies NR/MOS second opinions as a missing signal family.