ADR-1069: Operator CRD status-schema gaps and VmafxNode LastHeartbeat ownership¶

Status: Accepted
Date: 2026-06-06
Deciders: Lusoris
Tags: operator, crd, k8s, bug

Context¶

Two correctness bugs were identified in the vmafx-operator (helm + operator deep audit, R14):

Bug 1: `controllerJobID` missing from VmafxJob CRD schema¶

VmafxJobStatus.ControllerJobID is declared in api/vmafx/v1/vmafxjob_types.go and is the field the VmafxJobReconciler reads to determine whether a job has been accepted by the vmafx-controller scheduler. However, the field was absent from both config/crd/bases/vmafx.dev_vmafxjobs.yaml and deploy/helm/vmafx/crds/vmafx.dev_vmafxjobs.yaml.

When the CRD OpenAPI schema does not include a field, the Kubernetes API server's structural schema pruning silently drops the field on every write. An external scheduler writing controllerJobID to the status subresource would have the value stripped, causing the reconciler to loop forever in the "waiting for scheduler" branch — visible only as a perpetual Pending phase with no error in the logs.

Bug 2: VmafxNode reconciler unconditionally overwrites `LastHeartbeat`¶

The VmafxNodeReconciler.Reconcile function performed node.Status.LastHeartbeat = &nowMeta on every reconcile cycle, setting the field to the operator's local clock time.

status.lastHeartbeat is intended to be written exclusively by the node agent via the vmafx-controller's Heartbeat RPC — the operator reads it to detect stale nodes (nodes whose agent has stopped sending heartbeats). Overwriting the field every 30 s caused:

The stale-threshold check (nodeStaleThreshold = 60 s) could never fire after the first detection cycle: after detecting stale, the operator reset LastHeartbeat to now, making the field appear fresh on the next reconcile.
The Healthy field would oscillate between false and true every 30 s for a node whose agent had stopped — one reconcile marks it unhealthy (stale), the next marks it healthy (fresh timestamp written last cycle, probe succeeds).

Decision¶

Fix 1¶

Add controllerJobID to the status.properties block of both CRD YAML files with type: string, matching the Go struct definition.

Fix 2¶

Remove the node.Status.LastHeartbeat = &nowMeta write from VmafxNodeReconciler.Reconcile. The reconciler must only write status.healthy. status.lastHeartbeat is written exclusively by the node agent; the operator only reads it.

Update the controller's doc comment, the tests, and drop the now-unused metav1 import.

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Add `x-kubernetes-preserve-unknown-fields: true` to the VmafxJob status schema	Simpler, no field listing	Opts out of structural schema for the entire status; loses all other validation	Not chosen — field-by-field schema is the correct pattern
Have the operator track its own probe time in a separate `lastProbeTime` field	Preserves both timestamps	Requires a new status field and CRD schema update	Out of scope; the key fix is stop clobbering the node-owned field

Consequences¶

Positive: External schedulers can reliably set controllerJobID; the reconciler will correctly detect when the field is present and begin polling.
Positive: status.lastHeartbeat now accurately reflects the node agent's last reported heartbeat, not the operator's probe time. Stale detection fires and stays fired until the agent resumes.
Test impact: Three tests in vmafxnode_controller_test.go were updated to reflect the correct ownership semantics: LastHeartbeat is nil for nodes that have never had an agent heartbeat, and unchanged after reconcile for nodes with a stale heartbeat.

References¶

R14 helm + operator deep audit (2026-06-06).
api/vmafx/v1/vmafxjob_types.go, config/crd/bases/vmafx.dev_vmafxjobs.yaml, deploy/helm/vmafx/crds/vmafx.dev_vmafxjobs.yaml, cmd/vmafx-operator/internal/controller/vmafxnode_controller.go.
ADR-0786: vmafx-operator Stage 2 (parent).
ADR-0714: vmafx-operator kubebuilder skeleton + CRDs (grandparent).