ADR-0505: vmaf-tune ladder container-source encode + full per-CRF sample cloud¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, claude
- Tags: vmaf-tune, ladder, corpus, encode, vmaf-cli, docs
Context¶
The BBB end-to-end v5 probe (after PR #1258 / ADR-0501 closed the v4 cluster) surfaced three follow-ups that fall into two root causes:
- V5-1 —
vmaf --backend vulkanstrict-mode refusal was again reported as exiting0on the dev-mcp container. A live re-run against the post-ADR-0501 binary shows the propagation chaininit_gpu_backends() -> ret = -1 -> main() -> exit(255)is intact, so the behaviour is correct — but the V4-A integration test that pinned it was gated onshutil.which("vmaf"). On every developer host that did not install the binary onto$PATH(including the dev-mcp container, where the binary lives at/usr/local/bin/vmafbut isn't reached through a relative invocation in the test's pytest harness) the gate silently disengaged. A future RC=0 regression would have shipped without the regression test firing. - V5-2 —
vmaf-tune ladderagainst a container source (bbb_sunflower_1080p_30fps_normal.mp4) produced VMAF in the 4-9 band at ~50 Mbps regardless of CRF — a uniformly-bogus encode. Root cause:corpus.iter_rowsinstantiatesEncodeRequestwithout ever inspecting whether the source is a container; the encode argv builder (encode.build_ffmpeg_command) therefore emits the raw-video framing flags (-f rawvideo -pix_fmt yuv420p -s WxH -r FR -i src.mp4) against every source. FFmpeg reads the MP4's compressed payload as planar YUV pixels — the resulting "encode" is bogus regardless of CRF, and the per-CRF cloud is uniformly low quality. ADR-0501's reference-side scale fix corrected the score-side geometry but did nothing for the encode-side container detection. - V5-3 — The
samples[]array in the v1 JSON descriptor double-listed every rendition. Root cause: ADR-0501's emit path derived the sample cloud fromladder.points, which contains one row per(resolution, target_vmaf)cell. The CLI ships--crf-sweepso the sampler scores a multi-CRF grid per cell, but only thepick_target_vmafwinner survived intoladder.points. Two target-VMAFs that converged on the same CRF emitted two identical sample rows; every non-winning CRF was silently dropped (the V5-2 symptom "only top-CRF kept per resolution").
The V5-2 garbage encode is the dominant failure — even with the V5-3 sample cloud fixed, the VMAF would still read 4-9 against the container source. The V5-1 test gate is independent.
Decision¶
-
V5-1 test hardening: rewrite the regression test under
tools/vmaf-tune/tests/test_bbb_e2e_v5_bug_cluster.pyto probe three reachable locations for thevmafbinary in order:$VMAF_BIN_FOR_TESTSenv override,shutil.which("vmaf"), then the canonical meson build pathbuild/tools/vmafwalked up from the test file. The test only skips when no binary is reachable, and the skip message names every probed location so CI debugging is a one-line task. The assertion shape (non-zero exit byte AND refusal stderr) is unchanged from V4-A. -
V5-2 container detection: in
corpus.iter_rows, derivesource_is_container = source.suffix.lower() not in _VMAF_RAW_SUFFIXESand set the matching field on theEncodeRequest. When the source is a container the rung-target scale filter is always appended (-vf scale=W:H), giving ffmpeg an explicit rendition geometry regardless of the source's native resolution. Raw-YUV sources keep the legacy-f rawvideo -pix_fmt … -s WxHframing; the dedicatedtest_iter_rows_keeps_raw_yuv_source_uncontainerisedguard pins the back-compat invariant. -
V5-3 full per-CRF sample cloud: extend
make_default_samplerand the underlying_default_samplerwith an optionalcloud_sink: list[LadderPoint] | Nonekwarg. When wired in, the sampler appends every successfully-scored CRF row fromiter_rowsinto the sink before thepick_target_vmafcollapse.build_and_emitgains anextra_sampleskwarg that, when present, supersedes the per-targetladder.pointscloud as the source of the JSONsamplesarray. The emit step calls a new_dedup_sampleshelper that drops(width, height, crf)repeats so the array is stable across sampler quirks (e.g. two targets that select the same CRF). The CLI's_run_ladderconstructs the sink locally and threads it through both call sites — the legacybuild_and_emitinvocation withoutextra_samplesstill benefits from the dedup pass so V5-3 is fixed end-to-end.
Alternatives considered¶
- V5-2: probe ffprobe per cell to determine container shape. Rejected: ffprobe is already invoked once per source for HDR detection (ADR-0300), but its output is needed before the encode argv is built. Switching from a cheap path-extension check to a probe wires an avoidable subprocess into every cell; the suffix check is correct for every input we ship and gives a caller-friendly error mode (a
.mp4named.yuvopts into raw framing, matching the convention incorpus._VMAF_RAW_SUFFIXES). - V5-3: collect the cloud post-hoc by re-reading the corpus JSONL. Rejected: the sampler writes the JSONL into a temp directory that's
tempfile.TemporaryDirectory-managed, so the file is gone by the timebuild_and_emitruns. Persisting it would require a public emit-side dependency on the tempdir's lifetime, plus a re-parse cost; the sink list is a single-shot reference that costs O(rows) memory and zero I/O. - V5-3: change
pick_target_vmafto return the full sweep. Rejected:RecommendResult.rowis the bisect contract used byvmaftune.bisectand downstream consumers; widening it to a list would break ABI for every caller. The sink kwarg is opt-in and confined to the ladder pipeline.
Consequences¶
- Positive:
vmaf-tune ladderagainst container sources now produces plausible VMAF (~85-95 on BBB 1080p, matching the single-sourcecompareruns).- JSON
samples[]carries every encoded CRF point per resolution, exactly once. Downstream consumers (vmaf-tune reportPareto overlay) see the full Pareto cloud instead of the per-target subset. - V5-1 regression test fires on every developer host that has built the binary, not just hosts where it landed on
$PATH. - Negative:
- The container-source path adds a
-vf scale=W:Hfilter to every encode regardless of whether the source is already at the rung target. The scale is a no-op for native-geometry rungs, but ffmpeg's filter graph still allocates buffers; dominated by the encode budget at < 1 % of wall time. - Neutral / follow-ups:
test_bbb_e2e_v5_bug_cluster.pypins one regression per finding plus an opt-in end-to-end docker probe (test_ladder_against_bbb_container_yields_plausible_vmaf). The V4test_build_and_emit_threads_samples_into_jsonis updated to reflect the new dedup'd sample-cloud invariant (len(samples) == 2for the all-same-CRF stub, not 4) — not a weakening, it's the V5-3 fix made testable on the v4 fixture.docs/usage/vmaf-tune.mddocuments the container-source encode behaviour and the full-cloudsamples[]semantics.
References¶
- BBB e2e v5 bug log:
/tmp/bbb_e2e_bugs_v5.md(gitignored) - Predecessor (v4): ADR-0501
- Predecessor (v3): ADR-0499
- Predecessor (v2): ADR-0498
- Encode driver:
tools/vmaf-tune/src/vmaftune/encode.py:build_ffmpeg_command - Corpus iter_rows:
tools/vmaf-tune/src/vmaftune/corpus.py:iter_rows - Ladder sampler + emit:
tools/vmaf-tune/src/vmaftune/ladder.py - CLI:
tools/vmaf-tune/src/vmaftune/cli.py:_run_ladder - Source:
req(direct user direction in the agent dispatch briefing on 2026-05-18 — paraphrased: "fix the v5 bug cluster: vmaf vulkan strict-mode test must hit a real binary; ladder cross-res VMAF must be plausible; samples array must contain every CRF row, not just the per-target picks, and must not duplicate.")