ADR-0501: vmaf-tune ladder cross-resolution scoring + report degraded flag¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, claude
- Tags: vmaf-tune, ladder, corpus, report, vmaf-cli, docs
Context¶
The BBB end-to-end v4 probe (after PR #1257 / ADR-0499 closed the v3 cluster) surfaced three findings:
- V4-A —
vmaf --backend vulkanagainst a Vulkan-less runtime prints the ADR-0498 strict-mode refusal and (per the v4 bug log) was reported as exiting0. A re-run on the current dev-mcp binary (built 2026-05-17) shows the propagation chaininit_gpu_backends() -> ret = -1 -> main() -> exit(255)is in fact working, but the contract had no regression test pinning it. The risk was a future refactor ofmain()'s ret-chain silently re-introducing the failure. - V4-B —
vmaf-tune ladder --resolutions 1920x1080,1280x720on the BBB 1080p mp4 produced a single rung withvmaf=21.3 @ 23 Mbps 720p, which is implausible (compareat the same CRF returns ~93). Root cause:vmaftune.corpus._maybe_decode_reference(added in ADR-0499) decodes the reference at the source's native geometry, but the rung'sScoreRequestthen tells the libvmaf CLI to read both legs at the rung target (1280x720). The binary silently mis-parses the 1080p planar bytes as a 720p frame and emits garbage VMAF — a value low enough that the post-hull select collapses the grid to a single rendition. The encoded distorted leg was already scaled correctly via the-vf scale=W:Hfilter added in ADR-0498; the reference leg never got the matching treatment. Separately, the JSON descriptor only emittedrenditions[]— thereportconsumer'sladder_samplescounter always read0because nosamples[]array was wired. - V4-C —
vmaf-tune reportaggregatedok=trueonly when every codec row wasok=true. The ADR-0498 bisect discriminator marks rows for unavailable encoders aserror="encoder unavailable (NAME): …"— an infrastructure gap, not a quality failure. Conflating the two flips the report took=falsepurely because the dev-mcp ffmpeg ships withoutlibsvtav1, masking real regressions in the same field.
Decision¶
- V4-A pin: add a Python integration test under
tools/vmaf-tune/tests/that runsvmaf --backend vulkanagainst the Netflix golden-checkerboard YUVs and asserts a non-zero exit byte plus the refusal message in stderr. The test skips when the binary isn't on$PATHor when the host has a working Vulkan device (the refusal path doesn't engage on real hardware). - V4-B: extend
_maybe_decode_reference(and its_decode_source_to_yuvbuilding block) with optionaltarget_width/target_heightkwargs that append-vf scale=W:Hto the ffmpeg decode argv and embed the dims in the sidecar filename so multi-rung sweeps don't collide. Wireiter_rowsto pass the rung target wheneverCorpusJob.src_width/src_heightdiffers fromwidth/height. Extendemit_manifest/_emit_jsonto accept asamples=sequence and emit a top-levelsamples[]array in thevmaf-tune-ladder/v1JSON; thread the pre-hull sampler cloud through frombuild_and_emit. - V4-C: split row-level failures in
_run_reportinto "real failures" (anything whoseerrordoes not start with"encoder unavailable") and "unavailable rows".okgates only on real failures + at-least-one-success; a new top-leveldegradedfield flipstruewhenever any unavailable row is present.codec_rows_unavailableis added so dashboards can show the count without parsing every row'serror.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| V4-B: decode reference at native geometry once and pass dims to libvmaf separately | One decode for all rungs; less I/O | Requires a new libvmaf CLI flag pair (--ref_width / --ref_height) — the existing single --width / --height apply to both legs; would gate on a separate libvmaf PR and break Netflix-golden CLI parity | Per-rung scale matches what we already do for the distorted leg; symmetric, no new flag |
V4-B: emit samples only when caller passes a non-empty list | Smaller JSON when uncalled | KeyError on downstream report consumers; inconsistent schema | Always-emit (possibly empty) array; stable schema beats marginal byte savings |
V4-C: ignore unavailable rows entirely (drop from codec_rows_failed) | Cleaner ok=true semantics | Hides the infrastructure gap from dashboards; operators can't see which encoder is missing | New degraded flag + dedicated counter — keeps the signal, removes the false negative |
V4-C: classify on a row-level error_kind enum instead of a string prefix | Stable taxonomy | Requires schema migration of every existing compare JSON consumer | String prefix is a pragmatic 1-PR fix; an enum migration can land later if more error kinds appear |
Consequences¶
- Positive:
vmaf-tune ladderproduces plausible VMAF on cross-resolution rungs against container sources.reportconsumers can render the Pareto cloud (samples) and a single PR no longer flipsok=falsepurely because an optional encoder isn't built into the local ffmpeg. - Negative: One extra ffmpeg decode per resolution (not per cell — the per-rung sidecar is reused across the CRF sweep) when the source dims differ from the rung target. For a typical 5-rung ladder against a 1080p source that's 4 extra decodes; dominated by the encode budget at < 1 % of wall time.
- Neutral / follow-ups:
test_bbb_e2e_v4_bug_cluster.pypins one regression per finding. The existing V2 cross-res scale-filter test was extended to also assert the reference-side scale invocation (the V2 fix didn't realise the reference was a parallel issue).docs/usage/vmaf-tune.mddocuments the cross-res reference decode under "Container and Y4M sources" and the newsamples[]/degradedoutputs under the JSON descriptor section.
References¶
- BBB e2e v4 bug log:
/tmp/bbb_e2e_bugs_v4.md(gitignored) - Predecessor (v3): ADR-0499
- Predecessor (v2): ADR-0498
- Predecessor (v1): ADR-0497
- Container build policy: ADR-0496
- Reference-decode helper:
tools/vmaf-tune/src/vmaftune/corpus.py:_maybe_decode_reference - Report aggregation:
tools/vmaf-tune/src/vmaftune/cli.py:_run_report - JSON emitter:
tools/vmaf-tune/src/vmaftune/ladder.py:_emit_json - Source:
req(direct user direction in the agent dispatch briefing on 2026-05-18 — paraphrased: "fix the v4 bug cluster: vmaf vulkan strict-mode must exit non-zero; ladder grid must not collapse and must produce plausible VMAF + a samples array; report must distinguish encoder-unavailable from encode-failed").