ADR-1094: Helm chart rolling-update correctness — node strategy, PDB default, probe fix, grace period¶
- Status: Accepted
- Date: 2026-06-07
- Deciders: Lusoris
- Tags:
helm,kubernetes,deploy,fork-local
Context¶
Four correctness gaps in deploy/helm/vmafx/ were discovered during a targeted rolling-update audit:
-
templates/node.yaml— nospec.strategy. The vmafx-node worker Deployment had no explicit strategy block. Kubernetes defaults toRollingUpdatewithmaxUnavailable: 25%andmaxSurge: 25%, which means up to 25% of node pods can be evicted before their replacements are ready. For GPU-attached workers executing in-flight scoring jobs, this silently drops work (SIGTERM before any replacement starts serving). The controller Deployment already hadmaxUnavailable: 0— the same treatment was missing from the node. -
templates/node.yaml— probes hit a phantom HTTP endpoint. BothlivenessProbeandreadinessProbeusedhttpGeton container port 9090 (namedmetrics). The vmafx-node binary (cmd/vmafx-node/main.go) exposes only a gRPC server (default:50052). There is no HTTP listener on 9090. Every probe returnedECONNREFUSED, leaving all node pods permanently unready and never eligible to receive traffic. Cluster operators enablingnode.enabled=truesaw Deployments stuck at 0/N ready replicas. -
templates/pdb.yaml—minAvailable: 1blocks single-replica drain. The previous PDB default wasminAvailable: 1. WithreplicaCount: 1(the chart default), enabling the PDB permanently prevents voluntary disruptions: Kubernetes cannot satisfyminAvailable: 1while draining the node that hosts the only pod. Node drains, cluster upgrades, and AKS/GKE node pool rotations all stall indefinitely. The correct default for arbitrary replica counts ismaxUnavailable: 1(allow one voluntary disruption at a time);minAvailableshould be opt-in for operators who need a hard lower-bound on serving capacity. -
terminationGracePeriodSecondsabsent from Deployment + StatefulSet. Both workload templates relied on Kubernetes' 30-second default. A full-feature VMAF score pass on a 1080p segment can take 15–40 s; vmaf-tune encode + score passes on long segments regularly exceed 30 s. Mid-frame SIGKILL corrupts output JSON and forces a full re-run of the segment. 60 s is chosen as a conservative default that covers the vast majority of single-segment jobs; the value is configurable for operators running long CHUG extractions.
A secondary issue (StatefulSet updateStrategy.rollingUpdate implicit defaults) was also addressed by making maxUnavailable: 1 and partition: 0 explicit, removing silent reliance on Kubernetes internal defaults.
Decision¶
We will apply the following changes to deploy/helm/vmafx/:
-
templates/node.yaml: addspec.strategysourced fromnode.strategy(default:maxUnavailable: 0, maxSurge: 1); replacehttpGetprobes on phantom port 9090 withtcpSocketprobes on the gRPC port (default 50052, configurable vianode.grpcPort); rename thevmafx-node-metricsService tovmafx-nodeand wire it to the gRPC port instead of the phantom metrics port; addterminationGracePeriodSecondsfromterminationGracePeriodSeconds(default 60 s). -
templates/pdb.yaml: flip default precedence fromminAvailable-first tomaxUnavailable-first, matching the newvalues.yamldefault ofmaxUnavailable: 1.minAvailableis still rendered when explicitly set. -
templates/deployment.yamlandtemplates/statefulset.yaml: addterminationGracePeriodSecondsfrom the sharedterminationGracePeriodSecondsvalue (default 60 s). -
values.yaml: addnode.strategy,node.grpcPort,terminationGracePeriodSeconds; change PDB default fromminAvailable: 1tomaxUnavailable: 1; add explicitmaxUnavailableandpartitiontostatefulSet.updateStrategy. -
values.schema.json: add schema entries forterminationGracePeriodSeconds,node.strategy, andnode.grpcPort.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Add HTTP /readyz endpoint to vmafx-node binary | Real application-layer readiness | Requires Go binary change; out-of-scope for helm-only audit | Defer to a separate PR when node metrics endpoint is added |
Use grpc probe type (k8s 1.24+) | Native gRPC health protocol | Requires implementing gRPC health service in the node binary | Defer to the same metrics endpoint follow-up |
Keep minAvailable: 1 as PDB default with a warning comment | No behaviour change | Still blocks single-replica drains when enabled | Incorrect; the default must not be a footgun |
Set terminationGracePeriodSeconds: 300 | Covers CHUG extractions | Too long for rolling upgrades on small clusters | Operators raising the value is the documented path; 60 s is the sensible default |
Consequences¶
- Positive: node Deployments now become Ready (probes were always failing before this fix); rolling upgrades no longer SIGKILL in-flight scoring jobs; enabling PDB on single-replica dev clusters no longer permanently blocks drain operations; grace period is visible and configurable.
- Negative: the
vmafx-node-metricsService name changes tovmafx-node— operators who have hard-coded the old service name in external tooling must update. The PDB default changes fromminAvailabletomaxUnavailable; operators who explicitly setpodDisruptionBudget.enabled: true(non-default) and relied onminAvailable: 1semantics must explicitly setpodDisruptionBudget.minAvailable: 1and removemaxUnavailable. - Neutral / follow-ups: add a Prometheus metrics HTTP endpoint and
/readyzhandler to the vmafx-node binary so the probe can be upgraded fromtcpSockettohttpGet; align NetworkPolicycontrollerToNode.nodePort(currently 50051 in the docs) with the actual gRPC default (50052).
References¶
- ADR-1058: Helm chart security hardening (PDB, RBAC, NetworkPolicy schema).
- ADR-0930: Helm NetworkPolicy + PSS hardening.
- ADR-0713: vmafx-node Go worker binary.
- Related:
deploy/helm/vmafx/templates/node.yaml,pdb.yaml,deployment.yaml,statefulset.yaml,values.yaml,values.schema.json. - Source: agent audit round 4 — rolling-update strategy correctness.