ADR-0870: Helm chart values.schema.json + dev-MCP Containerfile rebuild audit¶
- Status: Accepted
- Date: 2026-05-30
- Deciders: Lusoris
- Tags: helm, k8s, deploy, devx, dev-mcp, container, rebase-hygiene
Context¶
Two adjacent operational-hygiene gaps surfaced during the audit window on 2026-05-30:
-
Helm chart had no
values.schema.json.deploy/helm/vmafx/values.yamlcarries ~25 top-level keys and three load-bearing enums (workload∈ {Deployment, Job, StatefulSet},gpu.vendor∈ {nvidia, amd, intel, cpu},storage.mode∈ {http-serve, rclone}). Without a schema, a typo like--set gpu.vendor=qualcommor--set workload=Daemonsetis accepted silently at install time; the chart renders a manifest that either schedules onto a non-existent device-plugin resource or selects no workload branch at all, and the operator only discovers the mistake when pods fail to schedule (NVIDIA selector) or when no workload is created (silent template branch). The standard mitigation is a JSON Schema co-located withvalues.yaml, whichhelm install/helm upgrade/helm lintconsult automatically. -
dev/Containerfiledrifted from the post-ADR-0700 layout. ADR-0700 renamedlibvmaf/→core/(andpython/vmaf/→compat/python-vmaf/with a shim). The dev-MCP image's stage 3 was still doingCOPY --chown=vmaf:vmaf libvmaf/ /build/vmaf/libvmaf/andcd libvmaf && meson setup build. Against current master thatCOPYfails with"libvmaf": not found, breaking the entire container build. The drift went unnoticed because the image hadn't been rebuilt since the rename merged. CLAUDE.md §15 says rebuild "when its image predates the last master sync that touched anything undercore/,mcp-server/,ai/,tools/vmaf-tune/, ordev/", but until now nothing forced the audit on a schedule.
The audit also surveyed .dockerignore (still excluded libvmaf/build*/ only; missing the core/build*/ parallel) and hadolint findings on the Containerfile (8 pre-existing DL3003/DL4006/DL3002/DL3009 warnings, none HIGH severity).
Decision¶
Land both fixes in one PR because they share a single audit cycle and the deliverables (changelog fragments, docs/state.md row, ADR-0108 checklist) collapse into one set:
-
Add
deploy/helm/vmafx/values.schema.jsoncovering every top-level key invalues.yaml. Enforceenumconstraints on the three load-bearing fields (workload,gpu.vendor,storage.mode) plusimage.pullPolicy,service.type,persistence.accessMode,operator.logLevel,statefulSet.podManagementPolicy, andmonitoring.serviceMonitor.scheme. UseadditionalProperties: falseon the top level and on each typed sub-object to catch sibling-key typos (e.g.,replicaCountsinstead ofreplicaCount). Leaveaffinity,tolerations,nodeSelector,podSecurityContext,securityContext,livenessProbe,readinessProbe,env,envFrom,podAnnotations,config, andtopologySpreadConstraintsas genericobject/arraytypes — they are pass-through structures that the chart hands to Kubernetes verbatim, and constraining their shape would conflict with upstreamkubectlsemantics every time the API server adds a field. -
Fix the
dev/ContainerfileADR-0700 path drift: COPY libvmaf/ … libvmaf/→COPY core/ … core/, plus addCOPY compat/ … compat/because the editable Python install pulls fromcompat/python-vmaf/via thepython/shim.cd libvmaf && meson setup build→cd core && meson setup build(two occurrences: the configure step and theninja installstep).- Update the two comment blocks that still call out
libvmaf/src/meson.buildand "meson.build lives in libvmaf/". - Extend
.dockerignorewithcore/build*/siblings to the pre-existinglibvmaf/build*/entries, with a comment naming ADR-0700.
The hadolint warnings remain unchanged. They are pre-existing non-HIGH-severity advisories (DL3003 cd in RUN is the standard multi-stage build idiom for ROCm/NEO/SVT-AV1 source builds where WORKDIR would break the &&-chained cleanup; DL3002 last-USER-root is forced by the FFmpeg make install step that needs root for /usr/local/; DL3009 / DL4006 occur on apt-mark-style verification RUNs that don't fetch lists). Suppressing them line-by-line is a separate cleanup pass tracked in BACKLOG; the audit's scope is the ADR-0700 drift, not a hadolint zero-warnings push.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Schema only, defer Containerfile to a separate PR | Smaller diff, narrower review | Two ADR-0108 deliverable rounds for related work; the Containerfile drift is a hard build break, not a nice-to-have | Both fixes share the audit cycle; bundling halves the process overhead |
| Containerfile only, schema deferred | Tighter scope | Schema is the higher-leverage fix (operator-facing typo-catching), and it costs ~5 minutes once the values.yaml has been read | Same logic in reverse: bundling is cheaper |
Use helm schema-gen plugin to auto-generate the schema | No hand-curation effort | The plugin emits the loosest possible schema (every field optional, no enums); misses the typo-catching value entirely | Manual schema is ~250 lines and explicit, which is the point |
Generate schema from values.yaml comments via a build step | Future-proof against values.yaml changes | New tooling dependency for one chart; the chart's top-level surface is stable enough that a hand-written schema is fine for now | Defer until a second chart appears |
Replace COPY libvmaf/ with COPY . . (whole repo) | Simplest fix | Re-introduces the .corpus/ (781 GB) leak the explicit COPY list was added to prevent (see comment block above the COPY lines) | Hard regression of an earlier fix |
Consequences¶
- Positive:
helm install vmafx … --set gpu.vendor=qualcommfails fast with a clear error message (at '/gpu/vendor': value must be one of 'nvidia', 'amd', 'intel', 'cpu') before any manifest is rendered. The dev-MCP container builds clean against current master without manual path rewrites. - Positive:
additionalProperties: falseon typed sub-objects catchesreplicaCounts/repostiory/maxSurgtypos at install time. - Negative: every future addition to
values.yamlnow requires a matching schema entry, orhelm lintwill fail. The schema needs to be kept in lockstep with the chart; a CI check could enforce this in a follow-up. - Neutral / follow-ups:
- Add a CI gate that diffs
values.yamlkeys againstvalues.schema.jsonproperties on PRs that touch the helm chart. - Address hadolint advisories (DL3003 / DL4006 / DL3002 / DL3009) in a dedicated cleanup pass; they are non-HIGH-severity and out of scope here.
References¶
- See ADR-0700 for the
libvmaf/→core/rename that the Containerfile drift came from. - See ADR-0703 for the chart's
image.repositorytarget (ghcr.io/vmafx/vmafx-server). - See ADR-0714 for the operator deployment that the schema's
operatorblock validates. - See ADR-0719 for the
storage.modeenum and the rclone config secret surface. - See ADR-0726 —
gpu.vendorenum omitsvulkanbecause the backend was removed. - See ADR-0738 for the R610.43.02 driver floor referenced in
values.yamlGPU comments. - See
CLAUDE.md §15for the dev-mcp container rebuild policy this audit enforces. - Source: req (audit dispatch 2026-05-30, "Audit Helm chart values schema enforcement + verify the vmaf-dev-mcp container image still builds clean against current master").