ADR-1012: Go queue state-machine guards — PullWork AND-status, ReportResult idempotency¶
- Status: Accepted
- Date: 2026-06-04
- Deciders: Lusoris
- Tags:
go,controller,queue,correctness,concurrency
Context¶
Two related state-machine correctness defects in cmd/vmafx-controller/queue/queue.go identified by the r3/r4 audit:
-
PullWorkUPDATE lacksAND status=?guard (line 299) —PullWorkdequeues a job frompendingFIFOwhile holdingq.mu, then updates SQLite toRUNNING. If a concurrentCancelRPC wins the SQLite write first (setting the job toCANCELLED) whilePullWorkholdsq.mu, the subsequentPullWorkUPDATE overwrites theCANCELLEDstatus withRUNNINGand the cancellation is silently lost. AddingAND status=?withStatusPendingmakes the UPDATE conditional and returns zeroRowsAffectedwhen a race occurred. -
ReportResultUPDATE overwrites any existing status (line 378) — A node that retriesReportResultafter a transient gRPC error (server received, client timed out) will execute the UPDATE twice. If aCancelarrived between the two attempts the second UPDATE silently reinstates the result, overwriting theCANCELLEDstatus. AddingAND status NOT IN ('completed','failed','cancelled')makesReportResultidempotent: the second call hits zero rows and logs an informational message.
Note: the getUnlocked failure rollback path was fixed in ADR-0961 (rollbackTopending). This ADR closes the two remaining r3-state-machine / r4-retry-idempotency findings for PullWork and ReportResult.
Decision¶
-
Change the
PullWorkUPDATE to:UPDATE jobs SET status=?,… WHERE id=? AND status=?withStatusPendingas the second WHERE argument. On zeroRowsAffected, re-prepend the job topendingFIFOand return an error so the caller retries. -
Change the
ReportResultUPDATE to:UPDATE jobs SET status=?,… WHERE id=? AND status NOT IN (?,?,?)withStatusCompleted, StatusFailed, StatusCancelled. On zero rows, log at INFO level and return nil (idempotent success).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Distributed lock / advisory lock | Fully serialised | Overkill; SQLite is single-writer | Rejected |
| Read-then-check before UPDATE | Simple | TOCTOU race between read and write | Rejected |
| Conditional UPDATE + RowsAffected check | Atomic, idiomatic SQLite | Slightly more code | Chosen |
Consequences¶
- Positive:
Cancelcan never be silently overwritten by a racingPullWork;ReportResultretries are safe. - Neutral: Callers of
PullWorkthat receive the "job was cancelled before assignment" error are expected to retry the pull — normal queue behaviour. - Negative: A
PullWorkcaller that does not retry on this error class will see a spurious error. The controller's own retry loop already handles this.
References¶
- r3/r4 audit findings:
[r3-state-machine],[r4-retry-idempotency] - ADR-0961 (PullWork rollback on post-update Get failure)