AI stub-and-silent-fallback audit¶

Date: 2026-05-18 ADR: ADR-0556 Status: Accepted (P0/P1 fixes in PR; P2/P3 queued as T-rows in docs/state.md)

Scope¶

Exhaustive read-pass across four surfaces for "stub, scaffold, dead wiring, silent fallback, half-fix" patterns.

Surface	Entry points audited
`tools/vmaf-tune/src/vmaftune/`	`score.py`, `auto.py`, `predictor_train.py`, `predictor.py`, `bisect.py`, `cli.py` (4094 lines)
`mcp-server/vmaf-mcp/src/vmaf_mcp/server.py`	All 7 MCP tools
`ai/scripts/`	`bvi_dvc_to_full_features.py`, `validate_model_registry.py`
`python/vmaf/`	`routine.py`, `core/train_test_model.py`, `core/local_explainer.py`
`scripts/dev/`	`permutation_importance.py`

Prior MCP audit (2026-05-15, .workingdir/audit-2026-05-15/D-mcp-and-backends.md) confirmed that the D.10 P0 findings (backend enum gaps, _list_backends returning 4 keys only) had already been resolved. This audit extends to the Python harness and AI scripts.

P0 - Silent correctness bugs (fixed in this PR)¶

P0-SCORE-JSON-CORRUPT tools/vmaf-tune/src/vmaftune/score.py:368 json.load had no JSONDecodeError guard. If vmaf exited 0 but wrote corrupt JSON (killed mid-write), the exception propagated uncaught, crashing the entire corpus run. Fix: wrap in try/except json.JSONDecodeError; set rc=65, payload=None, skip parse_vmaf_json and parse_feature_aggregates; corpus row receives NaN score.

P0-BVI-DIR-ZERO-CLIPS ai/scripts/bvi_dvc_to_full_features.py:545 (_run_dir_mode) _select_tier_entries_dir returning an empty list caused the loop to silently iterate zero times; _write_parquet wrote a zero-row parquet and main() returned 0. No warning visible to the operator. Same issue in _run_zip_mode (line 493). Fix: early-return 2 with a descriptive error message when entries is empty, in both modes.

P1 - Feature accepted but not fully plumbed (queued as T-rows)¶

P1-COMPARE-NO-BACKEND-PRECHECK tools/vmaf-tune/src/vmaftune/cli.py:2173,2729,2876 Three code paths in _build_per_shot_bisect_predicate, _run_compare (bisect branch), and _run_compare_crf_sweep (no-bisect branch) set score_backend = None if arg == "auto" else arg and pass the raw string to bisect_target_vmaf() without calling select_backend() first. If the user passes --score-backend cuda on a CPU-only binary, they get a cryptic vmaf binary error mid-bisect instead of the friendly BackendUnavailableError exit 2 that corpus, ladder, and fast subcommands produce. Tracked as T-PYTHON-COMPARE-NO-BACKEND-PRECHECK in state.md.

P2 - Half-finished implementation (queued as T-rows)¶

P2-AUTO-PLACEHOLDER-SILENT tools/vmaf-tune/src/vmaftune/auto.py:286 _load_calibrated_recipes() logged the F.4 placeholder fallback at DEBUG level. The default Python logging level is WARNING, so operators running vmaf-tune auto without an explicit recipes JSON silently received documented-placeholder values (not measured outcomes per Research-0067) with no visible indication. Fix (in this PR): promoted to _LOG.warning with an actionable hint.

P2-TRAIN-TEST-STD-ZERO python/vmaf/core/train_test_model.py:354 # FIXME: setting std to 0 may be misleading - zero std default when ys_label_stddev is absent from the stats dict. Can cause downstream division-by-zero or misleading uncertainty estimates. Tracked as T-PYTHON-TRAIN-TEST-STD-ZERO in state.md.

P2-ROUTINE-SWALLOWED-EXCEPTION python/vmaf/routine.py:604 except Exception as e: print("Warning: ..."); fallback to default stats - swallows the exception and continues with potentially incorrect stats. Tracked as T-PYTHON-ROUTINE-SWALLOWED-EXCEPTION in state.md.

P3 - Cleanup / cosmetic (queued as T-rows or fixed in this PR)¶

P3-SERVER-LIST-BACKENDS-DESC mcp-server/vmaf-mcp/src/vmaf_mcp/server.py:833 list_backends tool description string listed cpu / cuda / sycl / hip - missing vulkan and metal. Fix (in this PR): updated to cpu / cuda / sycl / vulkan / hip / metal.

P3-PERMUTATION-IMPORTANCE-HARDCODED-PATH scripts/dev/permutation_importance.py:22 REPO = Path("/home/kilian/dev/vmaf") - hardcoded developer-machine path; breaks on any other host. Tracked as T-PYTHON-PERMUTATION-IMPORTANCE-HARDCODED-PATH in state.md.

P3-VALIDATE-REGISTRY-MISLEADING-OK ai/scripts/validate_model_registry.py:183 After successful jsonschema validation, a second read_text + json.loads to count entries caught all exceptions and silently printed "OK: 0 registry entries valid" on any read failure, creating a misleading success message. Fix (in this PR): propagate count-read failure as rc=1 with an ERROR: message.

P3-LOCAL-EXPLAINER-HACKY python/vmaf/core/local_explainer.py:121 model = model[0] # HACKY, TODO: fix it - silently takes the first model from a list; may produce wrong explanation if multiple models are present. Tracked as T-PYTHON-LOCAL-EXPLAINER-HACKY in state.md.

Confirmed non-issues (investigated and ruled out)¶

Location	Finding	Verdict
`predictor_train.py:210`	Synthetic corpus rows use `schema_version: 2` vs current `SCHEMA_VERSION=3`	Not a bug - `_upgrade_row_in_place` fills missing v3 columns with NaN
`predictor.py:164`	Missing `onnxruntime` to analytical fallback	Intentional per docstring; not silent
`server.py:491`	`_compare_models` catches per-model exceptions into `errors` list	Intentional - caller sees all errors, not just first
`server.py:522`	`_load_vlm` `except Exception: continue` on VLM candidate scan	Intentional - VLM is best-effort
`server.py:699`	`finally: pass` in `_describe_worst_frames`	Correct - PNGs left for caller access until next invocation
`score_backend.py:205`	`select_backend()` raises `BackendUnavailableError` on explicit backend miss	Correct hard-fail; no silent fallback
`cli.py:3090`	`auto` subcommand `--execute`/`--runs-dir`/`--execute-all`	Fully consumed at lines 3090-3101
`cli.py:1699`	`predict` subcommand all args	Fully consumed
`cli.py:3875`	Report JSON loading	Guarded with `(OSError, json.JSONDecodeError)`

References¶

Prior MCP + backend audit: .workingdir/audit-2026-05-15/D-mcp-and-backends.md
Research-0067: calibrated recipe provenance
ADR-0543: exit-code-100 hard-fail for explicit GPU backend failures (C binary)
ADR-0498: strict-mode enforcement for backend selection
req: user requested exhaustive Python / MCP / AI audit matching the prior C-surface audit