ADR-1047: Helm chart schema and values.yaml correctness fixes (R9 batch)¶
- Status: Accepted
- Date: 2026-06-04
- Deciders: Lusoris
- Tags:
helm,k8s,bug
Context¶
The R9 bug hunt identified four correctness gaps in the Helm chart that cause silent misconfiguration:
storagekey defined invalues.schema.json(withmodeandrclonesub-fields) but absent fromvalues.yaml. A user settingstorage.mode=rcloneat install time would pass schema validation but produce an undefined-key structure in the templates.- Three top-level keys —
networkPolicy,auth, andotelCollector— are documented invalues.yamlbut absent from the schema. Typos in those blocks are silently accepted athelm lint/installtime. gpu.countcarries"minimum": 0, making a value of 0 withgpu.enabled: trueschema-valid. Every vendor device plugin treats 0 units as a silent no-op, so a misconfigured chart deploys a pod that requests no GPU and runs CPU-only without any warning.gpu.enabledis not listed in thegpuobject'srequiredarray; a user who deletes the key gets no validation error fromhelm lint.
Decision¶
Fix all four gaps in a single atomic commit:
- Add
storagewith correct defaults (mode: "http-serve",rclone.config: "") tovalues.yamlso the key exists and the documented default is explicit. - Add
networkPolicy,auth, andotelCollectortovalues.schema.jsonwithadditionalProperties: trueso nested sub-keys pass validation while still surfacing the top-level key in the validated surface. - Change
gpu.count.minimumfrom0to1. - Add
"enabled"togpu.required.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
additionalProperties: false on networkPolicy/auth/otelCollector | Tighter validation | Would require exhaustive enumeration of every allow.* and tenant.* sub-field, high maintenance burden | Not chosen; additionalProperties: true catches top-level-key typos without requiring full enumeration |
Conditional if gpu.enabled then minimum:1 | More precise | JSON Schema 2020-12 if/then is supported but adds complexity | Simple minimum:1 is sufficient; count:0 with enabled:false is an operator error anyway |
Consequences¶
- Positive:
helm lintandhelm install --dry-runwill now catch the four classes of misconfiguration before they reach a cluster. - Negative: Any user who was relying on
gpu.count: 0as a valid value will receive a schema validation error. This is intentional — 0 GPUs is a misconfiguration. - Neutral / follow-ups: The
storagedefault value (http-serve) matches the existing controller behaviour; no template change required.
References¶
- R9 bug hunt report (2026-06-04).
deploy/helm/vmafx/values.yaml,deploy/helm/vmafx/values.schema.json.