ADR-0543: ADR-0498 enforcement hardening — distinct exit code + structured JSON error + per-feature gate¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Opus 4.7)
- Tags:
cli,libvmaf,bugfix,backend,exit-code,extends-adr-0498
Context¶
ADR-0498 introduced the explicit-backend gate: when the user passes --backend NAME (anything other than auto / cpu), an init failure for the requested backend must surface as a hard error rather than silently falling back to CPU. The gate was wired into init_gpu_backends in core/tools/vmaf.c and returned -1 on failure, which propagates to main()'s return ret. POSIX truncates that return to an unsigned 8-bit exit status (255), which is technically non-zero.
A targeted v4-cluster regression test (ADR-0501 / V4-A) pinned the non-zero propagation for --backend vulkan, but three sharp edges remained that a downstream consumer (CI gate, vmaf-tune compare, MCP probe) had no clean way to handle:
-
Exit code 255 collides with generic error returns.
report_pooled_scores,vmaf_read_pictures, model loaders, and feature loaders all return non-zero on failure too. A wrapper that wants to distinguish "you asked for SYCL but it isn't there" from "scoring blew up mid-frame" has to grep stderr for the ADR-0498 marker string. That's brittle — a future i18n / log-format change silently breaks the predicate. -
The
--output X.jsonfile is empty (or 0-byte) on backend failure.vmaf_write_output_with_formatis only reached on the success path, so wheninit_gpu_backendsfails the file is never written (if stat-ed before the run) or left at whatever previous content sat at the path. Tooling that expects a JSON descriptor sees an empty file and crashes with a JSON-parse error rather than a structured signal. The empty file is especially misleading when the path is freshlytouch-ed by the wrapper before invocation: the wrapper sees a 0-byte JSON, can't decode anything, and reports a generic "vmaf failure" instead of "vmaf rejected your backend request". -
Per-feature pinning has no equivalent gate. Features named
*_cuda/*_sycl/*_vulkan/*_hip/*_metalare GPU-pinned variants of CPU features. If the user passes--feature motion_hipbut the matching backend isn't active in this run (not compiled in, not requested, or failed to init),vmaf_use_featuresilently falls back to the CPU twin. The resulting scores are bit-identical to the explicit-backend invocation, but were computed on the wrong silicon — exactly the bug ADR-0498 banned for--backend NAME. The asymmetry leaves a quiet path around the explicit-backend gate.
Decision¶
Extend ADR-0498 with three orthogonal hardening points, all landed in the same PR (fix/adr-0498-enforcement):
-
Dedicated exit code 100 for explicit-backend init failure. Define
VMAF_EXIT_BACKEND_INIT_FAILED = 100incore/tools/vmaf.c.init_gpu_backendsnow returns the sentinelVMAF_INIT_GPU_EXPLICIT_FAIL = -100(instead of-1) for the explicit-backend failure path;main()maps the sentinel toVMAF_EXIT_BACKEND_INIT_FAILEDfor the binary's exit code, while keeping the generic-1path for non-explicit failures (e.g.vmaf_*_import_stateerrors after a successfulstate_init). The value 100 is chosen because it's distinct from any errno-range value the binary might naturally emit and mirrors theEX_*convention from<sysexits.h>(without pulling that non-portable header in). -
Structured JSON error descriptor when
--outputis set and format is JSON. A newwrite_backend_error_jsonhelper overwrites the--outputpath with a single-line JSON object:
{
"error": "<human-readable reason>",
"backend_requested": "<sycl|cuda|vulkan|hip|metal>",
"errno": <int>,
"adr": "ADR-0498",
"exit_code": 100
}
Called from every explicit-backend failure path (not-compiled-in, state_init failed for SYCL / CUDA / Vulkan / HIP / Metal) and from the new per-feature gate. No-op when --output is unset or the requested format isn't JSON; non-JSON consumers still get the non-zero exit code.
- Per-feature backend gate. New helpers
feature_backend_suffix(returns the suffix-derived backend keyword for*_cuda/*_sycl/*_vulkan/*_hip/*_metalfeature names) andbackend_active(checks the relevant*_activeflag for the named backend). The feature-loading loop inmain()consults both before callingvmaf_use_featureand hard-fails with the same exit code + JSON descriptor when a GPU-pinned feature is requested but the matching backend isn't active. CUDA's active flag, previously local toinit_gpu_backends, is now propagated tomain()via a newbool *cuda_active_outparameter so the gate (and the existingbackend_usedJSON echo from ADR-0498) can see it.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Keep return -1 (exit 255), consumers parse stderr | Zero new behaviour | Couples every consumer to the exact marker string; breaks on i18n | The dedicated exit code is a one-line change in the consumer |
| Use exit code 200 instead of 100 | Higher = "more error" intuition | Collides with reserved 200+ range in some shells | 100 already mirrors <sysexits.h> convention; nothing in libvmaf CLI uses it |
| Emit JSON descriptor at every failure path (XML / CSV / SUB too) | Uniform error shape | Breaks format contract for non-JSON consumers | Exit code already conveys the signal; format-switch on failure surprises tooling |
| Use libvmaf's own error-formatting helper | Single source of truth | Expands public C API for a CLI-level concern | Inline fprintf is ~5 lines, stays local to vmaf.c |
| Libvmaf-side feature-registry refusal | Closer to the root cause | Requires threading backend-active state through vmaf_use_feature (API change) | CLI-side gate catches the same bug at the layer where intent is unambiguous |
Consequences¶
- Positive:
- CI gates can
[[ $rc -eq 100 ]]to distinguish backend failures from other errors without parsing stderr. vmaf-tune compare/ MCP probes get a structured JSON error to decode instead of an empty file — cleaner error reporting in downstream reports.- Per-feature gate closes the silent-fallback gap for users who pass
--feature *_hipwithout--backend hip. -
CUDA active flag is now visible in
main(), simplifying thebackend_usedJSON echo (drops the gpumask + no-flags inference). -
Negative:
- Exit code 100 is a new public contract; consumers that expected 255 from explicit-backend failures see a behaviour change. Mitigated by the fact that the only known consumer (vmaf-tune compare) was already checking
rc != 0, notrc == 255. -
The
--outputJSON file is now overwritten unconditionally on backend-init failure. A wrapper that pre-populated the file with metadata loses that content. No known consumer does this today. -
Neutral / follow-ups:
- The per-feature gate refuses some invocations that previously silently fell back to CPU. The behaviour change is fail-loud, not fail-silent, so no scoring drift can result.
- The Vulkan-empty-JSON case (
--backend vulkansucceeds at init, then frame loop / score writing emits a 0-byte JSON) is a separate root cause and is not addressed here. Tracked separately for follow-up — needs investigation intovmaf_write_output_with_format's behaviour when scores are missing.
References¶
- ADR-0498 (predecessor):
0498-vmaf-tune-bbb-e2e-v2-bug-cluster.md— introduced the explicit-backend gate and thebackend_usedJSON echo. This ADR extends, not supersedes. - ADR-0501 (V4-A):
0501-vmaf-tune-bbb-e2e-v4-bug-cluster.md— pinned the ADR-0498 strict-mode non-zero exit propagation as a regression test, providing the runway for the dedicated-exit-code work here. - Source:
req— user direction (2026-05-18, paraphrased to neutral English): "ADR-0498 says: when the user passes--backend NAME(not the defaultauto), an init failure for the requested backend must surface as a hard error. Today this is partially enforced … the run doesn't actually exit with an error code. The JSON output is empty too. Apply the same enforcement to all backends, with backend-specific exit codes (100 = backend init failed when explicitly requested), JSON error field, and a per-feature check for HIP-pinned feature names."