ADR-0699: VMAFX Helm Chart and Kubernetes Manifests with 3-Vendor GPU Device-Plugin Support¶
- Status: Proposed
- Date: 2026-05-28
- Deciders: Lusoris
- Tags: deploy, kubernetes, helm, gpu, cuda, hip, sycl, vulkan, fork-local
Context¶
The VMAFX fork rebrand (ADR-0686) and cloud-native redesign (ADR-0697) identified Kubernetes deployment as a required delivery surface. Operators need a standardised way to deploy the VMAFX scoring server, one-shot batch scoring jobs, and the MCP server inside Kubernetes clusters, with support for all three GPU vendors that the fork already supports at the binary level (NVIDIA/CUDA, AMD/HIP, Intel/SYCL).
Kubernetes GPU scheduling is mediated by vendor-specific device-plugin daemonsets that advertise extended resources (e.g. nvidia.com/gpu, amd.com/gpu, gpu.intel.com/i915). A workload requests one of these resources in its resources.limits; the scheduler places the pod on a compatible node and the kubelet allocates the physical device. Vulkan is not a separate resource — it runs through whichever vendor's device is allocated.
Three distinct deployment patterns are required:
- Long-running server (Deployment): HTTP scoring endpoint; horizontal scale via replicas; rolling update.
- One-shot batch job (Job):
vmaf-tune compare/vmaf-tune ladder; CI pipelines, nightly sweeps. - Stateful MCP server (StatefulSet): sticky session state, stable DNS identity for socket-based MCP clients.
Decision¶
Ship a Helm chart at deploy/helm/vmafx/ with:
-
A
gpublock invalues.yamlthat mapsvendor(nvidia | amd | intel | cpu) to the correct Kubernetes extended-resource key via avmafx.gpuResourcenamed template in_helpers.tpl. -
A companion
vmafx.backendEnvValuehelper that setsVMAFX_BACKENDautomatically, so the VMAFX runtime picks the right backend without operator intervention. -
Conditional workload templates (Deployment, Job, StatefulSet) selected via
values.workload, sharing common pod-spec configuration. -
A
prometheus-pushgatewayoptional dependency for Job workloads that push metrics rather than exposing a scrape endpoint. -
helm lintmust pass with no errors;helm templateoutput must be valid YAML for all fourgpu.vendorvalues and all threeworkloadtypes. -
Human-readable documentation in
docs/development/k8s-deployment.mdanddocs/development/gpu-scheduling.md.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Raw Kubernetes YAML (kustomize) | No Helm dependency; straightforward | No values templating, no helm lint, no dependency management; verbose for multi-vendor GPU logic | Helm is the de-facto operator standard and provides upgrade --install idempotency |
| Operator (controller-runtime) | Most powerful; CRD-driven lifecycle | Large scope, requires Go controller project, far exceeds the deployment requirement | Premature; a Helm chart is the right baseline; an operator can be layered later |
| Single workload type only | Simpler chart | Forces users to write custom manifests for batch or MCP scenarios | All three deployment patterns are user-discoverable and documented in ADR-0697 |
| Separate charts per vendor | Maximally explicit | Triplicates templates; GPU vendor is a runtime concern, not a chart concern | One chart with a vendor selector is idiomatic; keeps the operator mental model simple |
Consequences¶
- Positive:
- Operators install VMAFX on any GPU vendor with a single
helm upgrade --install+--set gpu.vendor=<vendor>. VMAFX_BACKENDis set automatically — no manual env-var wiring.- All three workload types (Deployment, Job, StatefulSet) are supported from day one.
- Security context is hardened by default (non-root, read-only rootfs, all caps dropped).
-
helm lintpasses with no errors for all vendor/workload combinations. -
Negative:
- Helm dependency on
prometheus-pushgatewayrequireshelm dependency buildbefore first install (even if pushgateway is disabled). -
The chart ships without a real container image; operators must supply
image.repository+image.tagmatching their registry. -
Neutral / follow-ups:
- A
values.schema.jsonfor strict schema validation can be added in a follow-up. - An Argo CD
Applicationor FluxHelmReleaseexample can be added to docs. - A GitHub Actions workflow for
helm package+ OCI push toghcr.io/lusoris/chartsis a natural follow-up (ADR-0697 §Phase 2).
References¶
- ADR-0697 (vmafx-cloud-native-redesign) — parent cloud-native umbrella
- ADR-0698 (vmafx-production-dockerfile) — sibling production container
- ADR-0686 (vmafx-rebrand-aggressive-modernization) — project rebrand umbrella
- req: user direction 2026-05-28 — Helm chart + K8s manifests with 3-vendor GPU device-plugin support