ADR-0598: vmaf-tune workdir relocation — disk-space preflight + VMAFTUNE_WORKDIR env var¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Sonnet 4.6)
- Tags:
vmaf-tune,bugfix,cli,container,workspace
Context¶
vmaf-tune compare (and ladder, tune-per-shot) decode the reference source to raw YUV before the bisect loop so each encode can be scored against frame-aligned raw content. For a 634-second 1080p60 BBB source decoded to yuv420p, the raw file is approximately 118 GB (1920 × 1080 × 1.5 bpp × 38 040 frames). The dev-mcp container's /tmp is an 8 GB tmpfs, so ffmpeg silently exits with return code 228 (the unsigned 8-bit representation of −28 = −ENOSPC). All 44 rows in the BBB compare run failed with this opaque exit code; no human-readable diagnostic was emitted and the error was indistinguishable from a codec initialization failure.
Two independent problems required fixing:
-
No disk-space guard.
_encode_and_scorelaunchedffmpegunconditionally; the first signal that disk space was insufficient came fromffmpeg's exit code after several seconds of wasted I/O. -
No routing to a larger volume. Even if the caller knew to set a different workdir,
bisect.pyhad no API knob and allTemporaryDirectorycalls were rooted at the OS default (/tmp).
Decision¶
Three orthogonal changes land together:
-
Disk-space preflight. Before invoking the reference decode,
_estimate_yuv_bytescomputes an upper-bound byte count (width × height × bpp × max(frames, 1), wherebppis looked up from apix_fmttable that conservatively defaults to1.5). Ifshutil.disk_usage(workdir).free < estimated_bytes × 1.1, the bisect returnsBisectResult(ok=False, error=<human-readable message>)immediately, without touchingffmpeg. The error string names both the--workdirCLI flag and theVMAFTUNE_WORKDIRenvironment variable so the operator has an actionable remedy. -
VMAFTUNE_WORKDIRenvironment variable._workdir_parent()readsos.environ.get("VMAFTUNE_WORKDIR", "")and returns aPathwhen set.bisect_target_vmafpasses the result as thedir=argument totempfile.TemporaryDirectory, routing all scratch I/O to the operator-specified volume. Resolution order: explicitworkdir=kwarg >VMAFTUNE_WORKDIRenv > OS default. -
--workdir PATHCLI flag. Added to thecompare,tune-per-shot, andladdersubparsers. The flag's value is forwarded tobisect_target_vmafas theworkdir=kwarg. -
Container default.
dev/ContainerfilesetsENV VMAFTUNE_WORKDIR=/probes/vmaftune-work; the/probesbind-mount (source: ./.workingdir/dev-mcp-probes,target: /probes) provides approximately 435 GB of writable space on the host.dev/scripts/dev-mcp-entrypoint.shcreates the directory at startup withmkdir -p.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Raise ENOSPC to a Python exception | Cleaner caller semantics | Breaks the BisectResult contract; all callers check result.ok | The structured error return is the existing convention |
| Catch ffmpeg rc=228 and retry on a different path | No API change required | Fragile: rc=228 is not exclusively ENOSPC; also requires a retry path with bookkeeping | Preflight is more reliable and fires before wasting I/O |
Hard-code /probes/vmaftune-work in bisect.py | Zero configuration burden | Not portable outside the dev container | Env var gives operators control without code changes |
Use XDG_RUNTIME_DIR or TMPDIR instead of a new env var | Familiar convention | Changing TMPDIR redirects all temp I/O (Python, ffmpeg, system tools) — too broad | Dedicated VMAFTUNE_WORKDIR is scoped to vmaf-tune only |
Consequences¶
- Positive:
- BBB 634 s 1080p60 compare runs no longer silently fail with rc=228. The preflight fires immediately with a human-readable error and the
--workdir/VMAFTUNE_WORKDIRremedy. - Operators on space-constrained hosts can route scratch I/O to any mounted volume without modifying code.
-
The container environment is zero-configuration:
VMAFTUNE_WORKDIRis pre-set to the 435 GB/probesbind-mount. -
Negative:
-
Disk-space estimation is conservative (worst-case upper bound for the given
pix_fmt). A host with 90–100 GB free and a 118 GB estimate will be refused even though the actual raw YUV might be slightly smaller. The 10 % headroom margin is intentional. -
Neutral / follow-ups:
_estimate_yuv_bytes,_check_disk_space, and_workdir_parentare private helpers (not in__all__). Tests import them directly for unit coverage; they are not part of the stable public API.- The
/probesbind-mount's 435 GB capacity is shared across all vmaf-tune scratch runs; parallel BBB runs can collectively exceed it on a long-running dev session. No automatic cleanup is implemented; operators should periodically purge$VMAFTUNE_WORKDIR.
References¶
- ADR-0498 (BBB e2e v2 bug cluster — container-source decode root cause):
0498-vmaf-tune-bbb-e2e-v2-bug-cluster.md - Source:
req— user direction (2026-05-18, paraphrased): the BBB e2e v13 compare run failed all 44 rows with rc=228 (ENOSPC). The full 634 s 1080p60 source decodes to raw YUV of approximately 118 GB, but the dev-mcp container's/tmpis an 8 GB tmpfs. Fix: add a disk-space preflight, honourVMAFTUNE_WORKDIRfor workdir routing, add--workdir PATHtocompare,tune-per-shot, andladder, and setVMAFTUNE_WORKDIR=/probes/vmaftune-workin the container.