ADR-0542: Full GPU backend plumbing in the dev-mcp container¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, claude
- Tags:
dev-container,cuda,vulkan,sycl,hip,rocm
Context¶
The vmaf-dev-mcp container (CLAUDE.md §12 r15) is the default execution surface for every libvmaf / vmaf-tune / ai / MCP probe on the dev machine. Its goal is reproducible cross-vendor GPU coverage: NVIDIA via CUDA, Intel Arc via SYCL (Level Zero), Vulkan across all three vendors, AMD via HIP. After ADR-0509 / ADR-0514 / ADR-0528 landed the bind-mounts and the level-zero ICD packages, the matrix still produced silent CPU fallback or software emulation on three of the four GPU lanes when empirically measured against a host with NVIDIA RTX 4090 + Intel Arc A380 + AMD gfx1036 (Raphael iGPU):
- Vulkan:
vmaf --backend vulkan --vulkan_device 0landed onlvp_icd.json(lavapipe, software). The mesa lavapipe ICD sorts lexicographically beforenvidia_icd.json/intel_icd.x86_64.json/radeon_icd.x86_64.jsonand wins thevulkan_device=0index race when the loader enumerates the default ICD search path. The user-visible symptom was a Vulkan run that either returned lavapipe numbers (3–5× slower than CPU, no diagnostic) or segfaulted on shaders lavapipe does not implement. - SYCL on Intel Arc:
zeInit()returned0x70010000andsycl-lsreportedPlatforms: 0. The level-zero loader was correctly installed but the VA-API user-space driver (iHD_drv_video.so) the Intel compute-runtime dlopens during GPU capability probing was missing. - HIP on AMD
gfx1036:rocminforeturned "Unable to open /dev/kfd read-write: Invalid argument" andhsa_init()failed withHSA_STATUS_ERROR_OUT_OF_RESOURCES. The host's iGPU is not on the ROCm 6.x supported-GPU allowlist thatlibhsa-runtime64.soconsults; withoutHSA_OVERRIDE_GFX_VERSIONthe runtime refuses to initialise even though/dev/kfdis bind-mounted and thevideo+rendergroups are joined. - NVIDIA Vulkan ICD bind-mount: depends on
NVIDIA_DRIVER_CAPABILITIES=compute,graphics,utility,video(thegraphicstoken is what causes the Container Toolkit to bind-mountnvidia_icd.jsoninto/etc/vulkan/icd.d/). The compose file already carried that token, but no documentation pinned the invariant for future edits.
Decision¶
We will close all four gaps in dev/Containerfile, dev/docker-compose.yml, and dev/scripts/dev-mcp-entrypoint.sh:
- Add
intel-media-va-driver-non-free+mesa-va-driversto the stage-1 apt list so VA-API codec drivers are present for both Intel and AMD. - Have the entrypoint dynamically rewrite
VK_DRIVER_FILESto a colon-separated list of all non-lavapipe ICD JSONs visible under/etc/vulkan/icd.d/+/usr/share/vulkan/icd.d/. Lavapipe stays on disk as the CPU-only fallback but is filtered out whenever any real GPU ICD is present. - Pin
HSA_OVERRIDE_GFX_VERSION=10.3.0,HSA_ENABLE_SDMA=0, andROCR_VISIBLE_DEVICES=0in the docker-composecommon-envblock. - Document the
NVIDIA_DRIVER_CAPABILITIESgraphics-token invariant inline indev/docker-compose.ymland indev/AGENTS.md.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Delete /usr/share/vulkan/icd.d/lvp_icd.json from the image | Most direct fix — loader cannot pick what is not on disk | Couples us to mesa's packaging (file moves between minor releases); breaks the CPU-only fallback path useful for hosts without GPU passthrough | Entrypoint-time VK_DRIVER_FILES rewrite gives the same outcome without coupling to mesa internals |
Use VK_LOADER_DRIVERS_DISABLE=lvp_icd (Vulkan 1.3.250+ feature) | Single-line env var; documented Vulkan loader flag | Requires loader ≥ 1.3.250 baked into the image; Ubuntu 24.04 ships 1.3.275 so technically available but the disable token matches by filename glob in some loader versions and full-path in others — fragile across loader updates | VK_DRIVER_FILES allowlist semantics are stable across every loader release we ship |
Switch to ROCm 7 to get gfx1036 on the supported-GPU allowlist | No env-var override needed | ROCm 7 noble apt repo is not generally available yet (2026-05-18); switching means losing HIP coverage entirely until 7.x ships for noble | HSA_OVERRIDE_GFX_VERSION=10.3.0 is the documented escape hatch and works on every ROCm 6.x release |
Drop mesa-vulkan-drivers entirely (NVIDIA-only Vulkan) | Removes lavapipe by removing its package source | Loses Intel Arc + AMD Vulkan coverage — the whole point of cross-vendor parity | Cross-vendor coverage is a non-negotiable for the parity gate |
Consequences¶
- Positive:
- All five
--backend {cpu, cuda, sycl, vulkan, hip}lanes return a real GPU score with no silent CPU/software fallback on the dev machine. - The
vmaf-tune comparesweep can dispatch across every GPU backend in parallel without device-multiplex risk (CLAUDE.md §12 r15 last bullet). - Negative:
HSA_OVERRIDE_GFX_VERSION=10.3.0is a lie to the ROCm runtime; kernels that exercise gfx1030 features absent ongfx1036(e.g. wavefront-size-64 specific instructions) could in principle miscompile. Mitigation: the libvmaf HIP feature kernels stay on the ADR-0214 places=4 cross-backend gate; any regression surfaces in the regular CI matrix.- The entrypoint-time
VK_DRIVER_FILESrewrite happens once at container start; new ICD JSONs installed viadocker exec apt install …after startup will not be picked up until the next restart. Acceptable — package installs at runtime are an anti- pattern this container already discourages. - Neutral / follow-ups:
- When ROCm 7 ships in the noble apt repo, drop the
HSA_OVERRIDE_GFX_VERSIONoverride and re-evaluate whethergfx1036reaches the supported-GPU allowlist natively. - The lavapipe lane remains as the CPU-only fallback; a future ADR could add a
--vulkan-allow-softwareopt-in flag in libvmaf for users who explicitly want software emulation (no current consumer).
References¶
- ADR-0509 — prior round that unset
VK_ICD_FILENAMES/VK_DRIVER_FILESin the entrypoint. - ADR-0514 / ADR-0528 — GPU backend exposure invariants,
/dev/driwhole-directory bind-mount. - ADR-0530 / ADR-0538 — HIP runtime works against
gfx1036once initialised. - ADR-0540 — FFmpeg encoder matrix companion change to the GPU matrix.
- ROCm supported-GPU list: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
- Vulkan loader env-var reference: https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderInterfaceArchitecture.md
- Source:
req— user reported the 4-backend silent-fallback matrix in the dispatch brief that opened this PR (paraphrased: "fix the remaining dev-mcp container GPU plumbing gaps so all four GPU backends actually run on real hardware, not lavapipe and not CPU fallback").