ADR-0514: dev-MCP container exposes every host GPU backend (CUDA + SYCL + Vulkan + HIP)¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Anthropic)
- Tags: container, dev-experience, gpu, sycl, vulkan, hip, cuda, fork-local
Context¶
The vmaf-dev-mcp container (dev/Containerfile / dev/docker-compose.yml) is the project-mandated default execution environment for vmaf / vmaf-tune / ai / MCP work (CLAUDE.md §12 rule 15, ADR-0496). Despite shipping CUDA, Intel oneAPI, ROCm, and Mesa Vulkan SDKs in its image layers, only CPU and CUDA were actually usable at run-time on the dev host:
backend=cpu ok VMAF 76.66783
backend=cuda ok VMAF 76.66783
backend=sycl fail "No device of requested type available"
backend=vulkan fail Intel-Arc-only enumeration; nvidia_icd hidden
backend=hip fail "built without hip support (ADR-0498)"
backend=metal fail "built without metal support" <-- Apple-only, expected on Linux
The host concurrently has 3 working Vulkan adapters (RTX 4090, Intel Arc A380, AMD gfx1036), a working SYCL/level-zero Arc target, and a working ROCm HIP target — none of which the container surfaced. Three independent root causes were responsible (Research-0138):
- SYCL — missing
libhwloc.so.15onLD_LIBRARY_PATH. The Intel oneAPI 2026.0 level-zero UR adapter (libur_adapter_level_zero.so.0) dlopenslibhwloc.so.15at adapter-load time to enumerate NUMA topology before touching a device. The library is present at/opt/intel/oneapi/tcm/latest/lib/libhwloc.so.15buttcm/latest/libwas absent from the image'sLD_LIBRARY_PATH, so the adapter failed to load and sycl-ls reported "No platforms found" even with the Arc device node visible. - SYCL — missing
/dev/dri/by-path/symlinks. The Intel compute-runtime (libze_intel_gpu.so.1) discovers Arc GPUs through the udev-managedpci-XXXX:YY:ZZ.W-rendersymlinks under/dev/dri/by-path/. Docker'sdevices: ["/dev/dri:/dev/dri"]directive passes the leaf device nodes but does not preserve the parent directory's symlinks — the container sawcard0..card2/renderD128..renderD130but only one orphaned by-path entry, collapsing device enumeration to zero platforms. - Vulkan —
VK_ICD_FILENAMESpinned to a non-existent lavapipe path. The image setVK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json, but the file mesa ships in this layer islvp_icd.json(no.x86_64suffix). The Vulkan loader honoured the env var, found no matching ICD, and reported zero devices — hiding the NVIDIA ICD that the NVIDIA Container Toolkit had correctly bind-mounted into/etc/vulkan/icd.d/nvidia_icd.json, and the Intel/AMD ICDs the mesa-vulkan-drivers package installs into/usr/share/vulkan/icd.d/. -
HIP —
core/tools/meson.buildmissing-DHAVE_HIP=1cflag. The Containerfile passes-Denable_hip=trueto meson, the meson summary reportsenable_hip : true, andlibvmaf.so.3links againstlibamdhip64.so.6. But the vmaf CLI's HIP support gate is#ifdef HAVE_HIP(compile-time guards around#include "libvmaf/libvmaf_hip.h", theVmafHipStateinit/cleanup, and the--backend hipstrict-mode arm).core/tools/meson.buildconditionally appends-DHAVE_CUDA=1/-DHAVE_SYCL=1/-DHAVE_VULKAN=1tovmaf_tool_cflagsbut had no matching-DHAVE_HIP=1branch — so the CLI compiled with the "built without hip support" strict-mode error path active even though libvmaf itself was built with HIP enabled. The fix is one 3-line meson conditional. Verified post-fix:vmaf --backend hipreturns VMAF 76.66783 / rc=0 against the golden CPU pair on the host's AMD gfx1036. -
VK_ICD_FILENAMES —
ENVcannot truly unset. Initial attempt to clear the env vars viaENV VK_ICD_FILENAMES= / ENV VK_DRIVER_FILES=in the Containerfile produced empty-string values at runtime, which the Vulkan loader treats the same as a non-existent file (ERROR_INCOMPATIBLE_DRIVER). The fix is to not set the env var in the Containerfile at all, and explicitlyunset VK_ICD_FILENAMES VK_DRIVER_FILESindev/scripts/dev-mcp-entrypoint.shon container startup. Operators that need to force a single ICD can still pass it per-invocation withdocker exec -e VK_ICD_FILENAMES=….
The Vulkan backend has a separate fp64 enable-bit issue tracked under ADR-0492 (fix/vulkan-fp64-gate-relax branch) that prevents Intel Arc from passing the parity gate. This ADR addresses container exposure only — once both PRs land, Vulkan on Arc / RADV / NVIDIA all probe green inside the container.
Decision¶
Patch dev/Containerfile + dev/docker-compose.yml so every GPU backend libvmaf builds for is also runnable inside the container on hosts that have the matching silicon. Concretely:
dev/Containerfile— append${ONEAPI_ROOT}/tcm/latest/libtoLD_LIBRARY_PATHin the gpu-sdks stage so the level-zero UR adapter resolveslibhwloc.so.15.dev/Containerfile— delete the previousENV VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.jsonline and do not replace it.ENV VK_ICD_FILENAMES=(empty string) is not a fix — the Vulkan loader treats an empty value the same as a non-existent file and bails withERROR_INCOMPATIBLE_DRIVER.dev/scripts/dev-mcp-entrypoint.sh—unset VK_ICD_FILENAMES VK_DRIVER_FILESon container start, so the Vulkan loader falls back to its default search of/etc/vulkan/icd.d/(NVIDIA Container Toolkit's bind-mount) and/usr/share/vulkan/icd.d/(mesa). The entrypoint is the right place because Docker'sENVcannot truly unset, and operators that need to force a single ICD can still pass it per-invocation viadocker exec -e VK_ICD_FILENAMES=….core/tools/meson.build— add-DHAVE_HIP=1tovmaf_tool_cflagswhenenable_hipis true, mirroring the existing CUDA / SYCL / Vulkan conditionals. Without this the vmaf CLI's HIP gate (#ifdef HAVE_HIPinvmaf.c) is inactive even when libvmaf itself is built with HIP enabled, andvmaf --backend hipreturns the "built without hip support" strict-mode error.dev/Containerfile— add a build-time probe loop after the libvmaf install step that iteratescpu cuda sycl vulkan hipand runs vmaf against the Netflix golden CPU pair. GPU backends without device passthrough are soft-failures (|| echo WARN); the line we actually care about isbuilt without X support, which trips on the precise HIP regression that motivated this ADR.dev/docker-compose.yml— add a read-only bind-mount of/dev/dri/by-pathto bothdev-mcpandsmoke-probe-cronservices. The leaf device nodes are still passed viadevices: ["/dev/dri:/dev/dri", "/dev/kfd:/dev/kfd"]so Docker writes the cgroup rwm rules; the bind-mount carries the udev symlinks thatdevices:cannot.dev/AGENTS.md— document the four invariants (LD path includes tcm/latest/lib, no VK_ICD_FILENAMES pin, by-path bind-mount, build probe) so future rebases preserve them.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| (Chosen) Patch Containerfile env + compose volumes; keep ROCm/oneAPI install layers as-is | Minimal change; preserves all CUDA/oneAPI/Vulkan/HIP investment from earlier ADRs (0496/0498); future operators don't need new host packages | Two files to keep in sync; relies on host having /dev/dri/by-path (true on any modern systemd-udev host) | Smallest blast radius; touches the exact lines responsible for the breakage. |
| Drop oneAPI/Vulkan/HIP from the image and require operators to install on host | Smaller image | Defeats the purpose of ADR-0496 (default-to-container); host-build debt returns | Direct contradiction of ADR-0496. |
Bind-mount the entire /dev/dri directory via volumes: instead of devices: | Single mount entry; symlinks preserved automatically | volumes: does not generate cgroup device.allow rules — render-node opens get EPERM on default-cgroup hosts | Functional regression; only works if container runs --privileged. |
Run the dev-MCP container as --privileged and let everything in /dev be visible | Simplest cgroup story | Defeats CERT C ENV32 (already protected by the non-root vmaf user); broad attack surface for an everyday dev container | Disproportionate trust escalation. |
Pin a single Vulkan ICD via VK_ICD_FILENAMES (e.g. only NVIDIA) | Reproducible; one device per probe | Cross-backend parity testing (the whole point of the container) needs concurrent visibility of NVIDIA + Arc + AMD | Defeats /cross-backend-diff workflow. |
Install host-side intel-compute-runtime + level-zero packages and mount the host's /opt/intel into the container | One fewer image layer | Pins container behaviour to the host's exact oneAPI version; portability lost; CI builds break (CI runners don't have oneAPI on host) | Anti-pattern for reproducible containers. |
Consequences¶
- Positive: All four GPU backends (
cuda,sycl,vulkan,hip) are runnable insidevmaf-dev-mcpon hosts with the matching silicon, with CPU-parity scores (5-place tolerance per ADR-0214). Cross-backend numeric-diff workflows (/cross-backend-diff) run end-to-end without leaving the container. The build-time probe surfaces silent backend-disable regressions in the image build log, so a future commit that accidentally flips-Denable_hip=trueoff printsbuilt without hip supportnext to the probe line forbackend=hipon every rebuild instead of waiting for a user to notice. - Negative: The
/dev/dri/by-pathbind-mount adds a host-path dependency. Operators on systems without systemd-udev (rare; minimal container hosts) may need to comment it out. The Vulkan loader now enumerates every available ICD on every vmaf invocation — a few milliseconds of startup overhead per run. - Neutral / follow-ups:
- ADR-0492 (
fix/vulkan-fp64-gate-relax) must merge before Vulkan on Intel Arc actually returns a score; this ADR only exposes the device. vmaf_hip_import_statereturns-ENOSYS(library-side gap): once this PR ships,vmaf --backend hipproceeds past the CLI compile-time gate, initialises HIP state successfully, and then fails atvmaf_hip_import_statebecause the function incore/src/hip/common.c:149still returns-ENOSYSwith the comment "stays unwired until the first feature kernel lands" (despite ADR-0468 having landed the first HIP feature kernel). Wiringvmaf_hip_import_stateto stash the HIP state on the VmafContext and routing HIP-capable extractors through it is a separate library-side change out of scope here; the container-side plumbing this PR ships is the prerequisite. Tracked separately indocs/state.mdas the HIP runtime gap.docs/development/dev-mcp.mdupdated with the new backend matrix and env-var requirements.dev/AGENTS.mdgains four new invariants.- State.md row added under "Recently closed" closing finding 8 of
SESSION_FINDINGS_v9_GPU_PROBE.md.
References¶
- Research-0138:
docs/research/0138-dev-mcp-container-backend-exposure-2026-05-18.md - ADR-0492:
0492-vulkan-vif-shader-fp64-g-computation.md(separate Vulkan Arc fp64 fix) - ADR-0496:
0496-prefer-dev-mcp-container-rule.md(CLAUDE rule 15 — default to dev-MCP container) - ADR-0498:
0498-vmaf-tune-bbb-e2e-v2-bug-cluster.md(origin of the strict-mode "built without X" error string the probe scans for) - Source:
req— user-provided briefing identifying the three container-side gaps (Vulkan ICD, SYCL runtime, HIP build) and the Containerfile patches required to close them.