ADR-0978: vmafx-server + pkg/score bug-audit — shutdown leak, gRPC Send-EOF surfacing, HTTP body cap, panic recovery¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: Lusoris
- Tags:
security,bug,audit,go,grpc,http,server
Context¶
A deep-dive audit of cmd/vmafx-server/ (Go gRPC + HTTP scoring service — ADR-0703) and pkg/score/ (gRPC client wrapper — ADR-0703 / ADR-0933) found four reachable defects plus one defensive cleanup. None are score-correctness bugs; all are reliability / robustness defects on production-server surfaces.
pkg/observability.NewShutdownContextgoroutine + signal-handler leak. The previous implementation spawned a goroutine blocked on<-chwith no<-ctx.Done()arm:
func NewShutdownContext() (context.Context, context.CancelFunc) {
ctx, cancel := context.WithCancel(context.Background())
go func() {
ch := make(chan os.Signal, 1)
signal.Notify(ch, syscall.SIGTERM, syscall.SIGINT)
defer signal.Stop(ch)
<-ch
cancel()
}()
return ctx, cancel
}
Documented contract: "the returned stop function must be called when the context is no longer needed to release resources." But calling stop() (the returned cancel func) only cancelled the parent context — it did NOT unblock the goroutine's <-ch, nor did it signal.Stop(ch). Each NewShutdownContext / stop() cycle that exited via stop() first (i.e. any code path that did not receive an actual SIGTERM/SIGINT) leaked one goroutine plus one signal-handler subscription for the rest of the process lifetime. Reachable on every early-return path in cmd/vmafx-server/main.go (scorer, err := libvmaf.New(...) returning non-nil err → os.Exit(1) skips the defer stop()); the more subtle case is repeated test usage of NewShutdownContext as a primitive, where the leak accumulates per call.
-
pkg/score.Client.OpenScoreStream+ScoreStream.PushFramesurface meaninglessio.EOFinstead of the server's real status. Per thegrpc.ClientStreamcontract,Sendreturnsio.EOF"whenever the stream is done from the server's perspective" — including when the server has already half-closed the stream with a non-OK status (e.g.codes.InvalidArgumentfor a malformedStreamConfig). The actual status only surfaces onRecv. The pre-fix wrappers wrapped the bare EOF infmt.Errorf("score: send StreamConfig: %w", err), so a caller sending a malformed config saw"score: send StreamConfig: EOF"instead of"InvalidArgument: ScoreStream: first message must set the config oneof". Triggers any time the server rejects an opening frame before reading client input — including the existing ScoreStreamwidth/height == 0/pixel_format == UNSPECIFIED/ wrong-oneof validators ingrpc_server.go:107-138. The drag-along bug is inPushFrame, which has the same Send call shape and the same EOF path. -
cmd/vmafx-server/http_server.handleScorehas no request-body size limit.json.NewDecoder(r.Body).Decode(&req)reads until EOF. Combined with a public, unauthenticated POST endpoint (the current default — TLS / auth is deferred to a follow-up ADR), an attacker can POST a multi-GB body and balloon the decoder's read buffer until the process OOMs. The legitimate use case is a few-hundred-byte JSON blob (three string fields, longest field bounded byPATH_MAX = 4096). The cap that fits production-grade comfortably is 1 MiB. -
cmd/vmafx-server/grpc_serverhas no panic-recovery interceptor. A panic inside any handler (the cgo libvmaf call path is the obvious risk surface, but any handler counts) tears down the gRPC worker goroutine; the gRPC library re-raises it in the connection's read loop, crashing the entire server process and dropping every in-flight request and stream on the floor. Production gRPC servers conventionally installgrpc.UnaryInterceptor+grpc.StreamInterceptorrecovery handlers that translate the panic intocodes.Internalto the offending client while keeping the server alive. -
Defensive:
pkg/score.ScoreStream.Recvcompareserr == io.EOF. The current gRPC client surfaces io.EOF as the bare sentinel, so==works today. The defensive formerrors.Is(err, io.EOF)keeps the wrapper's contract stable if a future gRPC release wraps the sentinel inside a status. Cheap to fix; rolled into the same PR.
The audit deliberately did not modify Netflix golden assertions (CLAUDE.md §8), did not weaken any test threshold, and did not change any scoring math. All five fixes are pure reliability / hardening.
Out of scope, deliberately deferred (would force a multi-package signature change touching pkg/libvmaf + cmd/vmafx-controller + cmd/vmafx-node): the scorer.Score() API takes no context.Context, so a vmaf CLI subprocess keeps running after the HTTP / gRPC client disconnects. Tracked separately.
Decision¶
Apply five coordinated fixes in one PR:
-
pkg/observability.NewShutdownContext: delegate tosignal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT). The stdlib idiom (Go 1.16+) unwinds correctly on both paths — signal arrival ANDstop()called first — releasing both the goroutine and the signal-handler subscription. -
pkg/score.OpenScoreStream+PushFrame: introduce arecvStatusOnEOFhelper that detects anio.EOFfromSend, callsRecvto drain the underlying gRPC status, and returns it to the caller. Pass-through for any non-EOF Send error and for the case where Recv yields no usable status (fall back to the original EOF). -
cmd/vmafx-server/http_server.handleScore: cap the request body at 1 MiB viahttp.MaxBytesReader, map*http.MaxBytesErrorto HTTP 413 Request Entity Too Large with a typederrorResponse{Error: ...}body. ConstantmaxScoreRequestBodyBytes = 1 << 20lives at the package scope so the test can reference it. -
cmd/vmafx-server/grpc_server.runGRPC: installrecoveryUnaryInterceptor+recoveryStreamInterceptoron the gRPC server. Eachdefer-recovers, logs the panic at ERROR with full stack, and returnsstatus.Errorf(codes.Internal, "internal server error"). The server stays alive across panics. -
pkg/score.ScoreStream.Recv: replaceerr == io.EOFwitherrors.Is(err, io.EOF). Defensive; behaviour-preserving today.
New regression tests:
TestNewShutdownContext_NoGoroutineLeak(pkg/observability) — 100 cycles of NewShutdownContext / stop(), then assertsruntime.NumGoroutine()returns to baseline (tolerance ≤ baseline+2 for scheduler noise). Pre-fix would leak monotonically and trip.TestOpenScoreStream_SurfaceServerStatusOnSendEOF(pkg/score) — drives a stub server that immediately rejects ScoreStream with InvalidArgument and asserts the client wrapper surfaces InvalidArgument rather than EOF on either OpenScoreStream or PushFrame.TestRecv_EOFAfterAggregateUsesErrorsIs(pkg/score) — exercises Recv against the existing Phase-1 stub and asserts the Unimplemented status surfaces correctly (covers the errors.Is path change).TestScoreEndpoint_RejectsOversizedBody(cmd/vmafx-server) — POSTs a ~1.001 MiB body, asserts 413.TestScoreEndpoint_AcceptsBoundedBody(cmd/vmafx-server) — POSTs a 64 KiB body, asserts NOT-413 (cap is permissive enough for realistic loads).TestPanicRecovery_Unary/TestPanicRecovery_Stream(cmd/vmafx-server) — register a panicking handler against the same interceptor stack and assert codes.Internal + server stays alive across a second call.
All tests pass under go test -race -count=1.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Fix the observability leak in isolation (single-file PR) | Small surface, easy review | Three of the five defects share the same root cause (production-server hardening on Go surfaces nobody had audited end-to-end); shipping one fix at a time fragments the audit story. CLAUDE feedback ("PR hygiene: bigger PRs over micro-PRs") explicitly prefers the bundle | Bundle matches the user's stated PR-sizing preference |
Use oklog/run or errgroup's shutdown semantics instead of signal.NotifyContext | Slightly more featureful (timer-bounded shutdown) | Adds a dependency; existing main.go already uses errgroup for the runHTTP/runGRPC pair and that part stays unchanged | signal.NotifyContext is stdlib, zero-dep, and the bug is specifically about the signal-context primitive — keep the fix narrow |
| Translate Send-EOF inside the generated gRPC code (codegen patch) | Affects every Send call uniformly | Forks the protoc-gen-go-grpc generator; massive maintenance cost, breaks make proto-gen reproducibility | Translate at the wrapper layer only — that's exactly what pkg/score exists for |
Run a panic-recovery middleware via go-grpc-middleware/recovery package | Battle-tested, single-line install | One more dependency (and a transitive on the v2 module that pulls in zap) | Hand-rolled 15-line recovery is trivial; avoiding a transitive zap pull keeps the binary smaller |
Configure HTTP body cap globally via http.Server.MaxBytesReader | One config knob, applies to every handler | Applies to readyz/healthz too where the body is empty anyway; conceptually wrong for differently-shaped endpoints. The score endpoint is the only one that reads a body | Apply the cap at the only endpoint that consumes a body |
| Add TLS + bearer-token auth at the same time | Removes the unauthenticated-attacker assumption that motivates the body cap | Scope creep; TLS/auth is a separate decision that deserves its own ADR | Body cap is still right defence-in-depth even with auth (a compromised credential should not OOM the server) |
// nolint the Send-EOF + Recv == lines instead of fixing | Zero risk of behaviour change | Surfaces leak to every caller; CLAUDE.md §12 rule 12 requires NOLINT only for load-bearing invariants and the EOF wrap is neither | Refactor over nolint |
Consequences¶
- Positive:
vmafx-serverno longer leaks one goroutine + one signal-handler subscription perNewShutdownContext/stop()cycle; ScoreStream clients see the server's actual gRPC status when framing is rejected; the/v1/scoreendpoint can't be OOM-DoSed by an unauthenticated POST; any handler panic translates tocodes.Internalrather than crashing the server; the Recv error-comparison is robust to a future gRPC release wrappingio.EOF. - Negative: Six new test additions; one new package-scope constant (
maxScoreRequestBodyBytes). The gRPC server now carries two interceptor wrappers; per-RPC cost is a singledefer recover()(sub-microsecond). - Neutral / follow-ups:
- Scorer subprocess cancellation (
exec.CommandContext) — deferred to a separate PR; requires extendingScorer.Scoreto take acontext.Context, touchingpkg/libvmaf,cmd/vmafx-controller/{http_server,grpc_server}, andcmd/vmafx-node/executor. New entry added todocs/state.md → Open bugsasT-LIBVMAF-SCORE-NEEDS-CTX-2026-05-31. - TLS + bearer-token auth on the public HTTP + gRPC endpoints — already tracked as future work in
pkg/score/grpc_client.gocomments (Dial defaults to insecure). - Optional follow-up: add a similar body cap to the
cmd/vmafx-controllerHTTP score endpoint (same shape; out of scope for this targeted audit).
References¶
- Source: bug-audit task brief, 2026-05-31 session.
- ADR-0703 (vmafx-server Go gRPC + HTTP service) — the parent surface these defects live on.
- ADR-0933 (gRPC streaming multi-frame scoring) — the streaming surface whose Send/Recv ordering motivates the EOF-translation fix.
- gRPC Go docs: https://pkg.go.dev/google.golang.org/grpc#ClientStream — definitive statement that Send returns
io.EOFwhen "the stream is done" and the underlying status comes via Recv. - Go stdlib docs: https://pkg.go.dev/os/signal#NotifyContext — the idiomatic shutdown-context primitive (Go 1.16+).
- Companion audit shipped earlier in the same session: PR #495 (
fix(core/tools)input-reader safety, ADR-0977).