ADR-1069: Operator CRD status-schema gaps and VmafxNode LastHeartbeat ownership¶
- Status: Accepted
- Date: 2026-06-06
- Deciders: Lusoris
- Tags:
operator,crd,k8s,bug
Context¶
Two correctness bugs were identified in the vmafx-operator (helm + operator deep audit, R14):
Bug 1: controllerJobID missing from VmafxJob CRD schema¶
VmafxJobStatus.ControllerJobID is declared in api/vmafx/v1/vmafxjob_types.go and is the field the VmafxJobReconciler reads to determine whether a job has been accepted by the vmafx-controller scheduler. However, the field was absent from both config/crd/bases/vmafx.dev_vmafxjobs.yaml and deploy/helm/vmafx/crds/vmafx.dev_vmafxjobs.yaml.
When the CRD OpenAPI schema does not include a field, the Kubernetes API server's structural schema pruning silently drops the field on every write. An external scheduler writing controllerJobID to the status subresource would have the value stripped, causing the reconciler to loop forever in the "waiting for scheduler" branch — visible only as a perpetual Pending phase with no error in the logs.
Bug 2: VmafxNode reconciler unconditionally overwrites LastHeartbeat¶
The VmafxNodeReconciler.Reconcile function performed node.Status.LastHeartbeat = &nowMeta on every reconcile cycle, setting the field to the operator's local clock time.
status.lastHeartbeat is intended to be written exclusively by the node agent via the vmafx-controller's Heartbeat RPC — the operator reads it to detect stale nodes (nodes whose agent has stopped sending heartbeats). Overwriting the field every 30 s caused:
- The stale-threshold check (
nodeStaleThreshold = 60 s) could never fire after the first detection cycle: after detecting stale, the operator resetLastHeartbeattonow, making the field appear fresh on the next reconcile. - The
Healthyfield would oscillate betweenfalseandtrueevery 30 s for a node whose agent had stopped — one reconcile marks it unhealthy (stale), the next marks it healthy (fresh timestamp written last cycle, probe succeeds).
Decision¶
Fix 1¶
Add controllerJobID to the status.properties block of both CRD YAML files with type: string, matching the Go struct definition.
Fix 2¶
Remove the node.Status.LastHeartbeat = &nowMeta write from VmafxNodeReconciler.Reconcile. The reconciler must only write status.healthy. status.lastHeartbeat is written exclusively by the node agent; the operator only reads it.
Update the controller's doc comment, the tests, and drop the now-unused metav1 import.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Add x-kubernetes-preserve-unknown-fields: true to the VmafxJob status schema | Simpler, no field listing | Opts out of structural schema for the entire status; loses all other validation | Not chosen — field-by-field schema is the correct pattern |
Have the operator track its own probe time in a separate lastProbeTime field | Preserves both timestamps | Requires a new status field and CRD schema update | Out of scope; the key fix is stop clobbering the node-owned field |
Consequences¶
- Positive: External schedulers can reliably set
controllerJobID; the reconciler will correctly detect when the field is present and begin polling. - Positive:
status.lastHeartbeatnow accurately reflects the node agent's last reported heartbeat, not the operator's probe time. Stale detection fires and stays fired until the agent resumes. - Test impact: Three tests in
vmafxnode_controller_test.gowere updated to reflect the correct ownership semantics:LastHeartbeatis nil for nodes that have never had an agent heartbeat, and unchanged after reconcile for nodes with a stale heartbeat.
References¶
- R14 helm + operator deep audit (2026-06-06).
api/vmafx/v1/vmafxjob_types.go,config/crd/bases/vmafx.dev_vmafxjobs.yaml,deploy/helm/vmafx/crds/vmafx.dev_vmafxjobs.yaml,cmd/vmafx-operator/internal/controller/vmafxnode_controller.go.- ADR-0786: vmafx-operator Stage 2 (parent).
- ADR-0714: vmafx-operator kubebuilder skeleton + CRDs (grandparent).