ADR-0451: Local dev-MCP container for live probing¶
- Status: Accepted
- Date: 2026-05-15
- Deciders: lusoris, Claude (Anthropic)
- Tags:
infra,docker,mcp,gpu,hip,cuda,sycl,vulkan,dev,fork-local
Context¶
The VMAF fork now has four GPU backends (CUDA, SYCL, Vulkan, HIP), an embedded MCP server (ADR-0128 / ADR-0402), and a growing set of Python tools (vmaf-mcp, vmaf-train, vmaf-tune). During active development it became necessary to probe all backends simultaneously without disrupting the CHUG overnight training job running on the same machine.
The operator's primary workstation has an RTX 4090 (NVIDIA-only). The HIP backend is functionally complete at the toolchain level but there is no AMD GPU on the host. Build-regression catching for HIP requires the ROCm compiler toolchain to be available, but runtime kernel execution is not required for this purpose.
A persistent, always-on environment that:
- Builds libvmaf with all four backends enabled.
- Runs the embedded MCP UDS server continuously.
- Executes a periodic smoke probe (every 15 minutes) across all backends.
- Writes structured JSON probe records for CI / audit consumption.
…was identified as the correct next step. The alternative (a bare-metal install script) was rejected because it would conflict with the running CHUG job and because the Docker isolation boundary cleanly separates SDK versions from the host environment.
Decision¶
We will ship a multi-stage dev/Containerfile (Ubuntu 24.04 base) together with dev/docker-compose.yml and a set of convenience wrapper scripts under dev/scripts/.
The container installs all four GPU SDKs:
- CUDA 12.6 —
cuda-toolkit-12-6from the NVIDIA apt repository. - Intel oneAPI 2025.3 —
intel-basekit-2025.3from the Intel apt repository. - Vulkan + lavapipe —
libvulkan-dev,mesa-vulkan-drivers(software ICD for correctness testing without GPU passthrough). - ROCm 6.x / HIP —
rocm-hip-sdkfrom the AMD apt repository; toolchain only on NVIDIA hosts (no kernel execution).
libvmaf is built from the in-tree source with:
-Denable_cuda=true -Denable_sycl=true -Denable_vulkan=enabled
-Denable_hip=true -Denable_hipcc=true -Denable_metal=auto
-Denable_dnn=enabled -Denable_mcp=true -Denable_mcp_stdio=true
-Denable_mcp_uds=true -Denable_mcp_sse=auto
The MCP UDS server starts at container boot via dev/scripts/dev-mcp-entrypoint.sh. The smoke-probe-cron service runs dev/scripts/smoke-probe-loop.sh every 15 minutes against the golden pair testdata/ref_576x324_48f.yuv / testdata/dis_576x324_48f.yuv, writing structured JSON to .workingdir/dev-mcp-probes/.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Bare-metal install script | No container overhead; fastest startup | Conflicts with the running CHUG training job; pollutes host SDK environment; cannot isolate ROCm 6.x from existing CUDA | Rejected — the CHUG/RTX-4090 conflict is a hard constraint |
| No continuous probe; manual-only | Simpler; no cron noise | Regressions can silently accumulate between sessions; operators forget to probe | Rejected — the cron loop is cheap and the JSON records give a time-series audit trail |
| Single GPU backend only (CPU + CUDA) | Faster build; smaller image | HIP and SYCL toolchain regressions go undetected | Rejected — the user explicitly requested all 4 backends including HIP for build-regression catching even when kernel execution is not possible |
| Two containers per SDK to avoid image bloat | Smaller per-service image | Complex compose topology; hard to run cross-backend probes atomically | Rejected — all probes must share one libvmaf build to be comparable |
Consequences¶
- Positive: All four backends are exercised on every probe cycle, providing a 15-minute maximum detection window for regressions. The JSON probe records form a lightweight time-series audit trail without requiring a database. The Docker boundary prevents the container's SDK stack (ROCm 6.x, oneAPI 2025.3) from interfering with the host CUDA environment used by the CHUG training job.
- Negative: The container image is large (~8–12 GB compressed) due to bundling all four GPU SDK layers. The first build is slow (20–40 min). HIP kernel execution is not available on NVIDIA-only hosts; probe records will show
ENOSYSfor HIP feature extractors — this is expected and documented. - Neutral / follow-ups:
- The
smoke-probe-cronprobe schema (backend_results,mcp_results) is an internal format. If a dashboard is built on top of it, formalise the schema in a follow-up ADR. - Metal remains disabled on Linux (
-Denable_metal=auto). The Containerfile needs no update when Metal kernel work lands (macOS-only path). - The NVIDIA Container Toolkit must be installed on the host separately; it cannot be installed inside the container image.
References¶
req(2026-05-15 user direction): Docker container preferred; all 4 GPU backends including HIP for build-regression catching; continuous smoke probe loop every 15 minutes.- ADR-0128 — embedded MCP scaffold
- ADR-0402 — MCP runtime v2 UDS transport
- ADR-0212 — HIP backend scaffold
- ADR-0214 — GPU parity CI gate (ULP thresholds)
- ADR-0374 — ENOSYS contract for missing GPU backends
docs/development/dev-mcp.md— operator guide for this container