ADR-1009: Fix Go shutdown / goroutine correctness — WaitForShutdown unconditional block, unbounded GracefulStop¶
- Status: Accepted
- Date: 2026-06-04
- Deciders: Lusoris
- Tags:
go,server,controller,shutdown,correctness
Context¶
Two related signal-handling / shutdown correctness defects identified by the r3/r4 audit:
-
pkg/observability/observability.go:198—WaitForShutdownends with an unconditional<-time.After(timeout)(the "drain window"). When the signal arrives and all in-flight requests complete in milliseconds, the function still stalls for the fullGracefulShutdownTimeout(30 s) before returning, making every clean shutdown artificially slow. Thectx.Done()arm is only checked in the wait-for-signal select above; the drain window had no such arm. -
cmd/vmafx-server/grpc_server.go:188andcmd/vmafx-controller/grpc_server.go:434—srv.GracefulStop()is called with no timeout inside the shutdown goroutine. A stuck streaming RPC (e.g., a client that never sends EOF) will blockGracefulStopindefinitely. Becausesrv.Servewaits for all connections to close before returning, the server process never exits after SIGTERM, blockingdefer scorer.Close(),defer jobQueue.Close(), and any pod-termination hook.
Note: NewShutdownContext was already fixed in ADR-0978 to use signal.NotifyContext (no goroutine leak). The reaper goroutine in cmd/vmafx-controller/nodes/registry.go was fixed in ADR-0962. This ADR addresses the remaining two issues from the audit.
Decision¶
-
WaitForShutdown: replace the bare<-deadlinewithselect { case <-time.After(timeout): case <-ctx.Done(): }so the function returns as soon as the context is done (servers stopped) without waiting for the full timeout. -
runGRPCWithServer/RunGRPC: wrapGracefulStop()in a goroutine and race it againsttime.After(GracefulShutdownTimeout). On timeout, callsrv.Stop()(hard stop) and log a warning. This bounds the worst-case shutdown time to2 × GracefulShutdownTimeout(once for the drain window inWaitForShutdown, once for the gRPC hard-stop fallback).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Keep unconditional drain window | Simple | 30 s stall on every clean shutdown | Rejected |
| Remove drain window entirely | Instant exit | In-flight requests get no drain time | Rejected |
Use grpc.Server.GetServiceInfo to detect stuck streams | Precise | Complex introspection | Overkill for this use case |
Hard-stop fallback at GracefulShutdownTimeout | Bounds worst-case | Forcibly aborts stuck RPCs | Chosen — correct trade-off |
Consequences¶
- Positive: Clean shutdowns no longer stall for 30 s; stuck streaming RPCs cannot prevent process exit after SIGTERM.
- Neutral: The drain window is still respected (
ctx.Done()arm only fires after the caller cancels the context, which happens after servers stop). - Negative: A truly hung streaming RPC will be hard-stopped after 30 s instead of waiting forever — this is the desired behaviour.
References¶
- r3/r4 audit findings:
[r3-signal]cluster - ADR-0978 (NewShutdownContext goroutine-leak fix)
- ADR-0962 (reaper goroutine ctx propagation)