ADR-0961: Controller queue — roll back PullWork on post-update Get failure (round-25 audit B.1)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris, Claude (Anthropic)
- Tags: go, controller, correctness, queue, phase4b, fork-local
Context¶
Round-25 scheduler-correctness audit (bug B.1) identified a permanent job-stranding scenario in cmd/vmafx-controller/queue/queue.go, PullWork.
The sequence that triggers the bug:
PullWorkremoves the job frompendingFIFO(line 273).- The SQL
UPDATE jobs SET status='running', assigned_node=<nodeID>commits successfully (line 277). q.runningSet[matchID]is populated (line 287).q.getUnlocked(matchID)is called to return the fully-hydrated job to the caller. If this read fails (transient I/O error, disk full, corrupt page), the function returned an error at line 291 — but the in-memory and SQL states were not restored.
Net result: the job's SQL status is running with assigned_node set; it is absent from pendingFIFO; no caller ever receives it; and it can only be recovered by restarting the controller (which runs reload(), which resets running → pending).
This is a correctness hole: a transient failure permanently strands a job until the next controller restart, which may be hours or days away in a production cluster.
The fix applies the "guess + check → act" principle from the project-wide correctness rule (feedback_correctness_first) — once we confirmed the rollback path was entirely missing, we added it rather than papering over with a looser timeout or retry at a higher layer.
Decision¶
On getUnlocked failure inside PullWork, after the SQL UPDATE to running has committed:
- Execute
UPDATE jobs SET status='pending', assigned_node=NULL WHERE id=?to restore the SQL row. delete(q.runningSet, jobID)to remove the in-memory running entry.- Re-prepend the job ID to
pendingFIFOso the nextPullWorkcall sees it. - Return the original
getUnlockederror to the caller.
If the rollback SQL itself fails, log a CRITICAL-level message identifying the job as stranded and return a wrapped error carrying both the original error and the rollback error so operators see both in the log stream.
A getUnlockedHook func(id string) error field (nil in production) is added to SQLiteQueue to allow tests to inject failures via SetGetUnlockedHookForTest. The hook is an in-package escape hatch; it is not part of the Queue interface.
Alternatives considered¶
| Alternative | Pros | Cons |
|---|---|---|
Wrap the entire PullWork body in a SQLite transaction | True atomicity; no rollback code needed | modernc.org/sqlite in WAL mode serialises writes; holding a transaction open across the getUnlocked read adds contention; complexity of BEGIN/COMMIT/ROLLBACK scattered across methods |
| Accept the orphan; rely on controller restart | Zero code change | Job is unavailable until restart; unacceptable in a long-running cluster |
Add a periodic reload()-style scrub | Self-healing without explicit rollback | Introduces timer goroutine + lock contention; delay is unbounded (scrub interval); root cause still present |
| Close the DB to inject test failure | Clean simulation | Causes all subsequent calls to fail; makes post-rollback assertions impossible in the same test |
The explicit rollback (chosen approach) is the minimal, correct fix: it closes the exact gap with no new goroutines, no transaction scope creep, and a clear audit trail.
Consequences¶
Positive
- Jobs are never permanently stranded due to a transient post-UPDATE read failure.
- The CRITICAL log message provides an unambiguous operator signal when the rollback itself fails.
- Test coverage via
TestPullWork_GetUnlockedFailure_RollsBackToPendingexercises all three rollback assertions (SQL status, runningSet, FIFO).
Negative
- The rollback SQL adds one extra write on the failure path; this is acceptable because the path is already an error condition.
- The
getUnlockedHookfield is a production-visible (exported setter) test escape hatch; it must not be called outside tests. The naming convention (ForTestsuffix) makes the restriction clear.
Neutral follow-ups
- A future ADR may wrap
PullWorkin a transaction once a performance baseline for WAL-mode write serialisation is established.
References¶
req: round-25 audit bug B.1 — "After SQL UPDATE succeeds and runningSet is populated, getUnlocked failure returns error without rolling back SQL state or runningSet"- ADR-0711: vmafx-controller Phase 4b.1 scope (original queue implementation)
feedback_correctness_first: project-wide rule — investigate root cause, never lower thresholds or skip gates- PR: fix(queue): roll back PullWork RUNNING state when post-update Get fails (round-25 audit B.1)