ADR-0509: vmaf-tune compare auto-probes container-source framerate / duration¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, claude
- Tags: vmaf-tune, compare, bisect, encode, vmaf-cli
Context¶
The BBB end-to-end v7 probe surfaced a sister bug to ADR-0505: the vmaf-tune compare CLI silently returned catastrophically wrong VMAF scores when --src was a container (mp4 / mkv / mov / ...). Reproducer (against the dev-mcp BBB MP4 at native 60 fps):
vmaf-tune compare \
--src /workspace/.corpus/bbb_e2e/bbb_sunflower_1080p_60fps_normal.mp4 \
--width 1920 --height 1080 --target-vmaf 92 \
--encoders libx264,libx265 --duration 5 --sample-clip-seconds 3 \
--max-iterations 3 --score-backend cuda --format json
Both encoders converged to ok=false: libx264 "closest miss VMAF=90.43 at CRF=6" (CRF 6 at 1080p is near-lossless and must score
= 98);
libx265capped at VMAF=90.05 at CRF=18. The numbers are not just low — they are physically impossible for the encode geometry, mirroring the ADR-0505 symptom shape (uniformly bogus VMAF regardless of CRF on container sources).
Root cause: the compare CLI threaded --framerate (argparse default 24.0) and --duration (default 0.0) into make_bisect_predicate verbatim. The per-iteration frame_skip_ref / frame_cnt (computed in bisect._sample_clip_window from the user-supplied framerate) no longer indexes the same source frames the encoder pulled — the encoder's input-side -ss 1.0 -t 3.0 clips a 3-second window of the container, but ffmpeg then decodes both the reference and the distorted MKV back to raw YUV at the container's native rate (60 fps), while vmaf --frame_skip_ref 24 (= 1.0 s * 24 fps) skips only 24 frames = 0.4 s of native-rate YUV in the reference, comparing misaligned content frame-by-frame. The result: a VMAF collapse to the 4 - 90 band regardless of CRF.
The ADR-0505 / V5-2 fix plumbed source_is_container=True through CorpusJob -> EncodeRequest -> build_ffmpeg_command for the ladder path. The compare path's encode plumbing was already correct (bisect._encode_and_score already builds the EncodeRequest with source_is_container=src_is_container); the remaining defect was that the per-iteration score-side frame-window alignment used the user-supplied --framerate which silently disagreed with the container's native rate.
Decision¶
We will auto-probe container sources in _run_compare and substitute the probed framerate / duration when the user left those flags at their argparse defaults. Explicit user values still win, with a one- line stderr warning on explicit mismatch.
Implementation:
- Add
_TrackedDefaultAction(argparse custom action) and a post-parse_args_stamp_tracked_default_sentinelspass so that_run_comparecan distinguish "user explicitly passed--framerate 24" from "argparse default 24" — the inverse pattern is needed because argparse never invokesAction.__call__when the user omits the flag. - Wire
_TrackedDefaultActiononto the compare subparser's--framerateand--durationflags. - Add
_resolve_compare_source_geometry(src, ..., probe_fn=...)that callsvmaftune.report.probe_sourceon a container source, substitutes probed framerate / duration when the user left them at defaults, fills in missing--width/--heightfrom the probe, and emits a stderr warning when an explicit user--frameratedisagrees with the probe by more than 0.01 fps. _run_comparecalls the helper before constructing themake_bisect_predicateclosure; the existing--width/--heightmandatory-check moves to the post-probe geometry.
The helper is pure (probe and warn-stream are injectable) so the regression test stubs both without shelling out to ffprobe.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Auto-probe + override user values when they mismatch the probe | Single source of truth; eliminates user-error footgun entirely | Breaks legitimate subsampling use-cases ("score this 60fps source as 30fps for a quick smoke run"); silent override is exactly the class of behaviour ADR-0505 was rolled back for | Rejected: probe wins by default but explicit user value must take precedence; warn on mismatch instead |
Require user to always pass --framerate for container sources | Forces operator awareness; no silent probe path | Hostile UX — ffprobe is right there and the answer is unambiguous for 99% of sources; doesn't actually fix the bug for callers who pass the wrong rate | Rejected: the default path is the dominant one; usability matters |
Change --framerate default to None and require it for raw YUV but auto-fill for containers | Cleaner sentinel semantics (no marker attribute needed) | Breaks every existing caller that relies on the 24.0 default for raw YUV; existing test test_cli_compare_binds_real_bisect_predicate already pins the 24.0 behaviour | Rejected: backward-compat for raw YUV trumps sentinel cleanliness |
Probe inside make_bisect_predicate rather than _run_compare | Centralises the fix at the bisect entry point so all callers (not just CLI) get it | make_bisect_predicate is a pure-Python adapter consumed by programmatic callers who may already have their geometry pinned; an unsolicited probe inside it would surprise tests and library users | Rejected: CLI is the right layer for an ffprobe-touching helper |
Consequences¶
- Positive:
vmaf-tune compare --src container.mp4 ...now returns sensible VMAFs out of the box on any container the system'sffprobeunderstands. The reproducer above now converges at libx264 CRF=26 (VMAF=94.9, 528 kbps) and libx265 CRF=32 (VMAF=93.6, 365 kbps).- Closes the compare half of the BBB e2e v7 cluster; sister to ADR-0505 (ladder) and ADR-0506 (V6-1 ladder duration clamp) / ADR-0508 (pass-1 stats duration clamp).
- Negative:
- Adds one
ffprobecall percompareinvocation against a container source. Cost is negligible (< 100 ms) compared to the per-iteration encode + score. - The
_TrackedDefaultActionsentinel pattern is mildly subtle — adding more--<flag>to the tracked list requires editing_stamp_tracked_default_sentinelstoo. The hardcoded tuple is intentional (cheap loop) but a future generalisation that introspectsparser._actionswould remove the duplication. - Neutral / follow-ups:
- The same auto-probe pattern could be lifted to
vmaf-tune ladderandvmaf-tune tune-per-shotfor symmetry — neither currently has the silent-misalignment failure in the same shape (ladder relies onCorpusJob.framerateset explicitly by the corpus builder), but a future container-default rollout might want the same helper. - The regression test stubs
vmaftune.report.probe_sourceviamonkeypatch.setattr; the helper is intentionallyprobe_fn-injectable to keep test coverage on the helper itself independent of the CLI wiring.
References¶
- ADR-0505: ladder container-source encode + sample-cloud dedup (sibling fix; ADR-0509 closes the compare half of the same bug class).
- ADR-0506: V6-1 duration-clamp on encoder side.
- ADR-0508: V8-A pass-1 stats duration-clamp.
- ADR-0501 / ADR-0498: prior BBB e2e clusters establishing the container-source / sample-window plumbing pattern.
- Source:
req(user-provided reproducer + root-cause hypothesis in the agent brief 2026-05-18).