Skip to content

ADR-1053: Default docker-compose runtime to nvidia and expand GPU capabilities

  • Status: Accepted
  • Date: 2026-06-04
  • Deciders: Lusoris
  • Tags: dev, cuda, docker, build

Context

dev/docker-compose.yml previously defaulted runtime: to runc for both the dev-mcp and smoke-probe-cron services, meaning GPU passthrough was only active when an operator remembered to pass CONTAINER_RUNTIME=nvidia explicitly. On any standard NVIDIA Container Toolkit host the containers silently lost all NVIDIA access (no libcuda.so, no nvidia-smi, no NVDEC/NVENC), breaking CUDA scoring, MCP probes, and the FFmpeg encode paths without any obvious error message.

Additionally, the deploy.resources.reservations.devices[].capabilities list was restricted to [gpu]. The NVIDIA Container Toolkit uses this list to decide which device files and library bind-mounts to inject at runtime. Omitting compute, utility, video, and graphics prevented the container from receiving libcuda.so (compute), nvidia-smi / NVML (utility), NVDEC/NVENC (video), and the Vulkan ICD JSON path needed for CUDA+NVTX interop (graphics — see x-common-env comment in docker-compose.yml).

The smoke-probe-cron service also lacked a deploy block entirely, so the CDK never injected any NVIDIA resources there regardless of the runtime setting.

Decision

Change the default value of ${CONTAINER_RUNTIME} from runc to nvidia in both service definitions. Hosts without the NVIDIA Container Toolkit are instructed to override via CONTAINER_RUNTIME=runc docker compose up -d (documented in an inline comment). Expand capabilities to [gpu, compute, utility, video, graphics] in both services. Add a deploy block to smoke-probe-cron matching the one in dev-mcp.

Alternatives considered

Option Pros Cons Why not chosen
Keep runc default, document manual override No change for non-NVIDIA hosts Silently breaks CUDA on every NVIDIA host unless operator knows the override Too error-prone; NVIDIA is the primary target
Environment-variable-only (no deploy block) Simpler compose file deploy.resources is required by Docker Swarm and compose v2 for device scheduling; runtime: alone does not guarantee NVML injection on newer Toolkit versions Less portable across compose v2 variants
Separate compose override file (docker-compose.nvidia.yml) Clean separation Adds operational complexity; users must remember to pass -f docker-compose.nvidia.yml Worse UX than a sensible default

Consequences

  • Positive: CUDA scoring, NVDEC/NVENC-accelerated FFmpeg, nvidia-smi, and NVML all work out-of-the-box on NVIDIA hosts. The smoke probe correctly validates the CUDA backend on every 15-minute tick.
  • Negative: Hosts without the NVIDIA Container Toolkit will see Error response from daemon: unknown or invalid runtime name: nvidia unless they set CONTAINER_RUNTIME=runc. This is documented in the inline comment.
  • Neutral / follow-ups: dev/Containerfile and docs/development/dev-mcp.md may want a note about the override for non-NVIDIA CI environments; left as a follow-up item.

References