ADR-1058: Helm chart security hardening — PDB, RBAC split, metrics NetworkPolicy, schema tightening¶
- Status: Accepted
- Date: 2026-06-06
- Deciders: Lusoris
- Tags:
helm,k8s,rbac,security,networkpolicy
Context¶
A focused audit of deploy/helm/vmafx/ identified six actionable gaps:
-
No PodDisruptionBudget — no
pdb.yamltemplate exists. WhenreplicaCount >= 2(HA mode), Kubernetes can drain all pods simultaneously during a node drain or cluster upgrade, causing a full service outage. -
RBAC over-permissive —
operator-rbac.yamlissued a singleClusterRole+ClusterRoleBindingcovering both CRD access (legitimately cluster-scoped) and namespaced resources (pods, events, leases). TheClusterRolegranted the operatorcreate/update/patch/deleteonpodscluster-wide, violating the principle of least privilege on multi-tenant clusters. -
VmafxTenant missing from operator RBAC — the operator reconciler watches
VmafxTenantCRs to enforce per-tenant auth policy (ADR-0794), but thevmafxtenantsresource was absent from the RBAC rules, causing the controller-runtime watch setup to silently fail at startup. -
NetworkPolicy Prometheus gap — the
allow-http-ingresspolicy opens onlyservice.targetPort(8080). The vmafx-node metrics port (9090) had no allow-rule, so Prometheus scraping was silently blocked undernetworkPolicy.enabled=true. -
values.schema.jsonnetworkPolicy.allowusesadditionalProperties: true— typos in sub-keys (e.g.enableinstead ofenabled,controllerToNodesinstead ofcontrollerToNode) were silently accepted, defeating the purpose of the schema gate. -
No
podDisruptionBudgetin schema — after adding the PDB values key, the schema must be updated to validate it.
Decision¶
Fix all six gaps in a single PR:
-
Add
deploy/helm/vmafx/templates/pdb.yamlwithpolicy/v1 PodDisruptionBudgetresources for the controller, node pool, and operator. AddpodDisruptionBudget.enabled/minAvailable/maxUnavailabletovalues.yaml(disabled by default for single-replica dev deployments). -
Split the operator
ClusterRoleinto: ClusterRole+ClusterRoleBindingfor CRD access only (vmafxjobs,vmafxnodes,vmafxmodeltrainings,vmafxtenants).-
Role+RoleBinding(namespace-scoped) for pods, events, and leader-election leases in the release namespace. -
Add
vmafxtenants: [get, list, watch]+vmafxtenants/status: [get, update, patch]to the CRD ClusterRole (fixing the silent watch failure). -
Add
allow-node-metrics-ingressNetworkPolicy with an ingress rule on port 9090 for vmafx-node pods, guarded bynetworkPolicy.allow.nodeMetrics.enabled(defaulttruewhen networkPolicy is enabled).fromPodSelectorallows narrowing the scraper identity. -
Change
networkPolicy.allowschema fromadditionalProperties: truetoadditionalProperties: falsewith exhaustively enumerated sub-keys (controllerToNode,nodeEgressObjectStore,operatorToApiserver,nodeMetrics,dns). -
Add
podDisruptionBudgetproperty tovalues.schema.jsonwithadditionalProperties: false, validatingenabled,minAvailable, andmaxUnavailable(both accept integer or percentage string).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep single ClusterRole, add namespace filter in rules | Simpler manifest | Kubernetes RBAC has no namespace filter on ClusterRole rules — the filter is only on RoleBindings | Not valid; only a Role scoped to a namespace achieves the required restriction |
maxUnavailable: 1 as PDB default | More permissive, survives single pod loss | On a 2-replica deployment minAvailable: 1 and maxUnavailable: 1 are equivalent; minAvailable is more intuitive | Chose minAvailable: 1 as default; operators can switch via values |
Leave networkPolicy.allow as additionalProperties: true | Lower maintenance burden | Silently swallows typos — undermines the schema gate | Rejected; the exhaustive enumeration is already small and stable |
Consequences¶
- Positive: PDB prevents full outage during voluntary disruptions in HA mode. RBAC now follows least-privilege. Prometheus scraping works under default-deny NetworkPolicy. Schema catches
networkPolicy.allowtypos. - Negative: The RBAC split introduces a second binding object per
operator.enabled=trueinstall;helm diffwill show the rename from*-operator-roleto*-operator-crds/*-operator-ns. Operators must runhelm upgrade(not in-place patch) to pick up the new ClusterRole name. - Neutral / follow-ups:
podDisruptionBudget.enableddefaults tofalse, so existing single-replica installs are unaffected. Follow-up: document the recommended PDB settings for HA deployments indocs/development/k8s-deployment.md.
References¶
- Helm chart deep-audit (2026-06-06 parallel-push agent).
- ADR-0714: vmafx-operator kubebuilder skeleton.
- ADR-0794: multi-tenant auth gateway (VmafxTenant CRs).
- ADR-0930: Helm NetworkPolicy + Pod Security Standards baseline.
- ADR-1047: Helm chart schema and values.yaml correctness fixes (R9 batch).
deploy/helm/vmafx/templates/operator-rbac.yamldeploy/helm/vmafx/templates/pdb.yamldeploy/helm/vmafx/templates/networkpolicy.yamldeploy/helm/vmafx/values.yamldeploy/helm/vmafx/values.schema.json