ADR-0962: Controller fixes — implement StreamJobs snapshot and add reaper stop signal (round-25 audit B.3 + B.4)¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris
- Tags: controller, grpc, go, correctness, goroutine, phase4b, fork-local
Context¶
A round-25 audit of cmd/vmafx-controller identified two correctness bugs:
B.3 — controllerServer.StreamJobs is a no-op
StreamJobs (file: grpc_server.go:180–185) logs a single line and returns nil. The server-streaming RPC therefore sends zero messages to every caller. A client that invokes StreamJobs to inspect queue state will receive an immediately-closing stream and silently infer "no jobs queued" — the same false-success pattern noted for MCP tools in the project memory (analogous to isError=False on an error path).
B.4 — nodes.Registry.reaper goroutine has no stop signal
NewRegistry spawned go r.reaper() with no cancellation path. The reaper(_ ...context.Context) signature accepted a variadic context but the call site passed nothing, so the goroutine looped on ticker.C forever. In production this is benign (goroutine lifetime == process lifetime), but in tests each NewRegistry call leaked one goroutine.
Decision¶
B.3 — implement the snapshot: call queue.ListAll(ctx, statusFilter) to obtain a point-in-time list of jobs matching the optional proto status filter, then iterate and call stream.Send(queueJobToProto(j)) for each job. Add ListAll(ctx, []string) ([]*Job, error) to the Queue interface and SQLiteQueue (parameterised SELECT … WHERE status IN (…) with a fallback unconditional query when no filter is supplied). Add protoStatusToQueue helper (inverse of existing queueStatusToProto) to translate the proto enum filter into queue status strings.
B.4 — change NewRegistry signature to NewRegistry(ctx context.Context, log *slog.Logger) *Registry. Store a derived cancellable context internally; add a Close() method that cancels it. Change reaper(_ ...context.Context) to reaper(ctx context.Context) and replace for range ticker.C with a select { case <-ctx.Done(): return; case <-ticker.C: ... } loop. Propagate from main.go (shutdown context arrives from observability.NewShutdownContext which fires on SIGTERM/SIGINT). All call sites in tests updated to pass context.Background() + t.Cleanup(func() { r.Close() }).
Alternatives considered¶
| Option | Pros | Cons |
|---|---|---|
B.3: Return codes.Unimplemented | Honest, caller gets a clear gRPC error | Feature remains unusable; breaks any client that expected the docstring contract |
| B.3: Implement snapshot (chosen) | Fulfils the documented Phase 4b.1 promise; unblocks CLI/MCP usage | Requires adding ListAll to the Queue interface — minor surface growth |
B.4: Internal Stop() channel | No public API change | Requires storing a separate chan struct{}; context.Context is the idiomatic Go pattern |
B.4: NewRegistry(ctx, log) (chosen) | Idiomatic; propagates lifetime from caller; Close() is an idempotent double-cancel no-op | Breaks call-site signature — one-time mechanical update |
Consequences¶
Positive
StreamJobsnow delivers actual queue state; CLI and MCP clients that call it will receive correct data instead of false-empty responses.- Reaper goroutines stop cleanly on SIGTERM / test teardown; no goroutine leaks in
go test -count=Nruns. ListAllexposes a generally useful snapshot capability that future HTTP endpoints (GET /v1/jobs) can reuse.
Negative
Queueinterface grows by one method (ListAll). Any in-tree mock implementations must be updated — currently none exist; all tests use the realSQLiteQueue.
Neutral
NewRegistrysignature change is a compile-time break; all call sites in tree were updated in the same PR.
References¶
- Bug report: round-25 infrastructure audit B.3 + B.4 (2026-05-31 session)
- Memory:
project_mcp_iserror_must_be_true— analogous silent-success anti-pattern (B.3) - ADR-0711: vmafx-controller Phase 4b.1 scope expansion (origin)
- ADR-0709: VMAFX Phase 4b distributed platform (umbrella)