ADR-1053: Default docker-compose runtime to nvidia and expand GPU capabilities¶
- Status: Accepted
- Date: 2026-06-04
- Deciders: Lusoris
- Tags:
dev,cuda,docker,build
Context¶
dev/docker-compose.yml previously defaulted runtime: to runc for both the dev-mcp and smoke-probe-cron services, meaning GPU passthrough was only active when an operator remembered to pass CONTAINER_RUNTIME=nvidia explicitly. On any standard NVIDIA Container Toolkit host the containers silently lost all NVIDIA access (no libcuda.so, no nvidia-smi, no NVDEC/NVENC), breaking CUDA scoring, MCP probes, and the FFmpeg encode paths without any obvious error message.
Additionally, the deploy.resources.reservations.devices[].capabilities list was restricted to [gpu]. The NVIDIA Container Toolkit uses this list to decide which device files and library bind-mounts to inject at runtime. Omitting compute, utility, video, and graphics prevented the container from receiving libcuda.so (compute), nvidia-smi / NVML (utility), NVDEC/NVENC (video), and the Vulkan ICD JSON path needed for CUDA+NVTX interop (graphics — see x-common-env comment in docker-compose.yml).
The smoke-probe-cron service also lacked a deploy block entirely, so the CDK never injected any NVIDIA resources there regardless of the runtime setting.
Decision¶
Change the default value of ${CONTAINER_RUNTIME} from runc to nvidia in both service definitions. Hosts without the NVIDIA Container Toolkit are instructed to override via CONTAINER_RUNTIME=runc docker compose up -d (documented in an inline comment). Expand capabilities to [gpu, compute, utility, video, graphics] in both services. Add a deploy block to smoke-probe-cron matching the one in dev-mcp.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Keep runc default, document manual override | No change for non-NVIDIA hosts | Silently breaks CUDA on every NVIDIA host unless operator knows the override | Too error-prone; NVIDIA is the primary target |
| Environment-variable-only (no deploy block) | Simpler compose file | deploy.resources is required by Docker Swarm and compose v2 for device scheduling; runtime: alone does not guarantee NVML injection on newer Toolkit versions | Less portable across compose v2 variants |
Separate compose override file (docker-compose.nvidia.yml) | Clean separation | Adds operational complexity; users must remember to pass -f docker-compose.nvidia.yml | Worse UX than a sensible default |
Consequences¶
- Positive: CUDA scoring, NVDEC/NVENC-accelerated FFmpeg,
nvidia-smi, and NVML all work out-of-the-box on NVIDIA hosts. The smoke probe correctly validates the CUDA backend on every 15-minute tick. - Negative: Hosts without the NVIDIA Container Toolkit will see
Error response from daemon: unknown or invalid runtime name: nvidiaunless they setCONTAINER_RUNTIME=runc. This is documented in the inline comment. - Neutral / follow-ups:
dev/Containerfileanddocs/development/dev-mcp.mdmay want a note about the override for non-NVIDIA CI environments; left as a follow-up item.
References¶
req— user direction: wire NVIDIA GPU passthrough in docker-compose.yml; change runtime default to nvidia and expand capabilities for both services.- Related: ADR-0509, ADR-0541 (GPU passthrough rationale), ADR-0528 (/dev/dri whole-directory bind-mount).
- NVIDIA Container Toolkit capabilities reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html