ADR-0495: MCP server probe-driven bug-fix cluster (2026-05-17)¶
- Status: Accepted
- Date: 2026-05-17
- Deciders: lusoris
- Tags: mcp, ai, testing, regression-recovery
Context¶
A scripted probe of the vmaf-dev-mcp-stdio container against vmaf 3.0.0 (driver at /tmp/mcp_probe.py, full triage at /tmp/mcp_probe_findings.md) surfaced five concrete defects in the MCP surface:
- Silent backend fallback.
vmaf_scorewithbackend="cuda"on a CPU-only build returned success with bit-identical CPU scores, no warning, nobackend_usedfield. Cross-backend parity tests built on top of MCP would silently report "CUDA == CPU bit-exact" while never running CUDA. - Schema enum drops
vulkan/hip/metal. Bothvmaf_scoreanddescribe_worst_framesrejected backends that the underlying binary and the server's own_BACKEND_DISABLEmap already supported. run_benchmarkreturnsexit_code=1with empty stdout AND stderr.testdata/bench_all.shaborts before flushing output underset -euo pipefail(typically when the vmaf binary is missing orsource setvars.shis non-zero). The MCP wrapper returned the success-shaped envelope with two empty strings, leaving the caller no way to diagnose the failure.ref == dis≠ 100. Scoring an 8-bit YUV against itself yields ~97.43 on 47/48 frames; per-feature integers are all 1.0. This is a property of thevmaf_v0.6.1model's negative-bias regressor — the trained polynomial does not saturate to 100 even when all feature inputs are perfect. Documented as a model artefact below; no code fix.vmaf_4k_v0.6.1on 576×324 returns 100 on every frame. The 4K model silently saturates on sub-1080p sources because its activation range was fit against 3840×2160 statistics; the MCP wrapper passed the bogus pool through with no hint.
Decision¶
We will:
- Probe the local vmaf binary's
--helpoutput at firstvmaf_scorecall (cached for the process lifetime) and refuse any caller-requested backend that is not advertised — instead of silently letting the binary fall back to CPU. The response now includesbackend_requestedandbackend_usedfields on every successful call so downstream parity tests can assert which backend actually ran. - Add
vulkan,hip,metalto thebackendenum of bothvmaf_scoreanddescribe_worst_frames. The_BACKEND_DISABLEmap already supported these — only the JSON-schema surface had drifted. - Extend the
run_benchmarkwrapper to detect the "non-zero exit + empty stdout + empty stderr" pattern and populate a meaningfulerrorfield with the most common root-cause shortlist (missing vmaf binary, oneAPI setvars, missing fixtures) plus thebash -xre-run hint. The benchmark script itself now exits2with a clear stderr message when the vmaf binary is missing, removing the most common silent-abort path. - Emit a
mismatched_model_warningfield in thevmaf_scoreresponse when the model's intended resolution preset (hd/4k/sd) disagrees with the source frame size's bucket. The classifier understands the in-tree models (vmaf_v0.6.1,vmaf_v0.6.1neg,vmaf_b_v0.6.3→hd; anything withvmaf_4k→4k) and stays silent on bespoke ONNX models we have no resolution metadata for. - Document Bug #4 as a model artefact rather than patching scores: the Netflix
vmaf_v0.6.1model is the golden-data ground truth (ADR-0024, CLAUDE.md §8) — clamping at the MCP layer would diverge the wrapper from the libvmaf CLI and from the Netflix assertions we are forbidden to modify.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Use --backend $name exclusive selector instead of the --no_* flag set | Single canonical CLI surface, matches bench_all.sh documentation | Doesn't work on older fork builds that predate the selector (2026-04-28); breaks vmaf-dev-mcp-stdio containers built against older snapshots | Probe-and-refuse via --no_* works on every fork build that ever shipped |
Clamp vmaf_v0.6.1 scores >100 → 100, <0 → 0 at the MCP layer (Bug #4) | Caller intuition "identical YUV ⇒ 100" satisfied | Would diverge MCP scores from the libvmaf CLI and the Netflix golden gate — exactly the assertion class CLAUDE.md §8 forbids modifying | Documented as a model artefact; agents and humans who need a 100-on-identical pair should pick the vmaf_v0.6.1neg model (which clips) |
| Hard-fail on 4K-model + SD-source instead of warning (Bug #5) | Loud, unmissable | Breaks legitimate use cases (smoke-testing the 4K model on tiny fixtures during dev); too rigid a policy for a per-call tool | Warning is the right register — surfaces the foot-gun without removing capability |
Consequences¶
- Positive: agents using
vmaf_scorefor cross-backend parity work now get an actionable error instead of a false "CUDA == CPU" conclusion;backend_usedlets parity tests assert against the binary's actual selection;run_benchmarkfailures are debuggable; resolution-mismatch foot-guns surface a warning early. - Negative: callers that previously relied on the silent-fallback behaviour ("just give me CPU scores even when I asked for CUDA") must now explicitly pass
backend='auto'or'cpu'. This is a breaking change in the wire contract; documented indocs/mcp/tools.md. - Neutral / follow-ups: 5 regression tests added under
mcp-server/vmaf-mcp/tests/test_probe_findings_2026_05_17.py. Doc updates todocs/mcp/tools.md. Changelog fragments underchangelog.d/fixed/. Probe driver and findings file are process-local (/tmp/) and not in-tree.