ADR-0368: External-competitor benchmark harness — wrapper-only architecture¶
- Status: Accepted
- Status update 2026-05-15: implemented;
tools/external-bench/directory present withrun.shwrappers andcompare.pyorchestrator; wrapper-only architecture active. - Date: 2026-05-08
- Deciders: Lusoris, Claude
- Tags: ai, testing, license, tooling, fork-local
Context¶
The fork ships two perceptual-quality predictors that warrant side-by-side comparison against external open-source competitors:
fr_regressor_v2_ensemble_v1— a full-reference VMAF regressor ensemble (5 seeds; ADR-0319 / ADR-0321).nr_metric_v1— a no-reference MOS predictor.
Two competitors are publicly available and worth comparing against on the same corpus:
- Synamedia/Quortex
x264-pVMAF(github.com/quortex/x264-pVMAF, November 2024) — predicted-VMAF estimator integrated into a forkedx264encoder. Upstream licence: GPL-2.0. - DOVER-Mobile — no-reference video quality predictor distributed as a Python package. Upstream licence: Apache-2.0 (code) plus CC-BY-NC-SA 4.0 (weights).
The fork is BSD-3-Clause-Plus-Patent. GPL-2.0 cannot be combined with permissive-licensed redistributable code without relicensing the combined work. Vendoring x264-pVMAF source into the fork would force the entire fork to GPL-2.0 — a non-starter given the upstream Netflix/vmaf licence and every downstream consumer (FFmpeg filters, third-party tools, MCP server) the fork ships for.
Yet running a side-by-side benchmark against x264-pVMAF is the only way to substantiate claims of relative accuracy / runtime — and the user explicitly asked for that comparison.
Decision¶
We will land a benchmark harness at tools/external-bench/ under a wrapper-only architecture:
- Each external competitor lives in
tools/external-bench/<competitor>/run.sh— a thin bash wrapper that invokes a user-installed external binary (path via env var) and re-shapes its output into a normalised JSON schema. - The fork-side predictors get the same wrapper shape (
fork-fr-regressor,fork-nr-metric) socompare.pycan aggregate all four into a single comparison table. compare.pyis the orchestrator: it discovers a corpus (BVI-DVC test fold + Netflix Public Drop by default), runs each wrapper across every (ref, dis) pair, aggregates PLCC / SROCC / RMSE / runtime, and renders a fixed-width comparison table.- Tests under
tools/external-bench/tests/test_compare.pystubsubprocess.runso the suite never depends on external binaries being installed.
The fork redistributes only the wrapper scripts + comparison logic
- documentation. No GPL'd code is vendored, linked, or copied into this fork. Side-by-side benchmarking is permissible because the harness invokes the external binary as a subprocess and reads its (factual) numerical output — same posture as running
/usr/bin/ffmpegfrom a BSD-licensed test harness.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Vendor x264-pVMAF source | Reproducible build; no operator install step | Relicenses the entire fork to GPL-2.0; breaks every downstream consumer (FFmpeg filter integration, MCP server, BSD-licensed tiny-AI surfaces); upstream Netflix/vmaf licence terms forbid it | Existential licence break — fork loses its permissive posture and every downstream relicenses by association |
Skip x264-pVMAF, compare only against DOVER-Mobile | No GPL question | User explicitly asked for the Synamedia comparison; benchmarking against DOVER-Mobile alone leaves the most directly competitive predictor unmeasured | Drops the most informative comparison; the GPL boundary is solvable without dropping the comparison |
| Wrapper-only architecture (this ADR) | Zero GPL'd code in the fork; operator installs external binary themselves; same wrapper shape works for any future competitor (copyleft or not); tests stub the subprocess so CI never depends on external installs | Operator must install binaries themselves; CLI shapes drift across upstream versions and the wrapper's schema-shim has to track them | Chosen. The boundary cost (a documented env-var per competitor) is small; the licence safety is total |
| Build a separate GPL-licensed sibling repo | Vendoring is then legal in that repo | Doubles the maintenance surface for one comparison; bench harness has to live somewhere and "out-of-tree" means it rots; reviewers cannot easily inspect the apples-to-apples invocation | Operational cost outweighs any advantage over wrapper-only |
Consequences¶
Positive¶
- The fork stays BSD-3-Clause-Plus-Patent. Every downstream consumer (FFmpeg filter, MCP server, tiny-AI surfaces) is unaffected.
- Adding a new external competitor (Netflix VMAF NEG, GMSD, ITU-R BT.500-style models, future GPL'd predictors) follows the same recipe: drop in
run.sh, register inWRAPPERS, add a stubbed test. - The harness ships with deterministic stubbed tests (
tools/external-bench/tests/test_compare.py, 7 passing) so CI can verify schema + aggregation regressions without external installs.
Negative¶
- Operators have to install the external binaries themselves (
pipx install dover-mobile;git clone … && makeforx264-pVMAF). Documented intools/external-bench/README.md. - Wrapper schema-shims may drift across upstream versions of the external binaries. Mitigation: each
run.shhas a single Python heredoc that does the JSON re-shape, so an upstream CLI break needs at most a one-file fix.
Neutral / follow-ups¶
- The harness's BVI-DVC corpus default assumes the operator has the archive locally per ADR-0310. Failure mode is a clear stderr message naming the expected paths, not a silent zero-row run.
- A future ADR may layer a GPU-runtime gate on the comparison (e.g. require the runtime metric to come from the same backend as
/cross-backend-diff) but that is out of scope for this PR.
References¶
req: user direction to "build a side-by-side benchmark harness comparing the fork'sfr_regressor_v2_ensembleandnr_metric_v1against two external open-source competitors: Synamedia/Quortex x264-pVMAF (GPL-2.0 OSS, github.com/quortex/x264-pVMAF, Nov 2024) and DOVER-Mobile" with the explicit constraint "x264-pVMAF is GPL-2.0. The fork is BSD-3-Clause-Plus-Patent. The harness MUST NOT vendor, link, or copy any code from x264-pVMAF."- ADR-0024 — the Netflix golden numerical-correctness gate against which the fork's predictors are ultimately calibrated.
- ADR-0108 — deep-dive deliverables rule covering this PR.
- ADR-0310 — BVI-DVC corpus ingestion that this harness defaults to.
- ADR-0319 / ADR-0321 — fork-side
fr_regressor_v2_ensemble_v1lineage. - Synamedia/Quortex x264-pVMAF
- DOVER — DOVER / DOVER-Mobile upstream.