ADR-1085: MCP streaming backpressure — kill child processes on client disconnect¶
- Status: Accepted
- Date: 2026-06-06
- Deciders: Lusoris
- Tags:
mcp,security,go,python
Context¶
The MCP server exposes long-running tools (vmaf_score, run_benchmark, run_compare, run_ladder, run_tune_per_shot, describe_worst_frames, eval_model_on_split) that delegate work to child processes (vmaf CLI, vmaf-tune CLI, Python evaluation script, ffmpeg).
Two separate bugs caused orphan child processes when an MCP client disconnected mid-stream:
Bug 1 — Go runVmafScore and delegateToPythonEval ignored the request context. Both functions received a ctx context.Context from the MCP dispatch layer but launched child processes via exec.Command(...) instead of exec.CommandContext(ctx, ...). A client disconnecting during a long vmaf or onnxruntime evaluation left the subprocess running to completion on CPU or GPU.
Bug 2 — Python _communicate_with_timeout did not handle asyncio.CancelledError. The function handled asyncio.TimeoutError correctly (killed the child), but if the MCP framework cancelled the outer coroutine (the standard mechanism for a client disconnect on the stdio or HTTP transport), asyncio.wait_for raised CancelledError which propagated upward without killing the child process. The child ran to completion as an orphan consuming CPU, GPU memory, and open file descriptors.
Decision¶
-
Go: thread the caller's
ctx context.ContextthroughrunVmafScore,runVmafScoreDirect, anddelegateToPythonEval; change all internalexec.Command(...)calls in those functions toexec.CommandContext(ctx, ...)so that OS-level SIGKILL is delivered to the child when the context is cancelled. -
Python: add an
except asyncio.CancelledErrorbranch to_communicate_with_timeoutthat kills the child process and then re-raises, mirroring the existingasyncio.TimeoutErrorhandling. Re-raising is mandatory so asyncio can propagate the cancellation to the MCP framework.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
asyncio.shield on each proc.communicate() call | Prevents cancellation mid-read, simpler error semantics | Orphan process stays alive indefinitely regardless of client state | The whole point is to kill the orphan |
Per-tool cancel wrappers instead of fixing _communicate_with_timeout | More granular | Code duplication across 10+ call sites | Central fix in the shared helper is cleaner and covers all tools atomically |
| Leave as-is, rely on OS to reap orphans at server shutdown | No code change | Orphan vmaf/ffmpeg processes consume GPU VRAM for the full lifetime of the server (hours in container deployments) | Unacceptable in long-lived container deployments |
Consequences¶
- Positive: client disconnect during any long-running MCP tool call now terminates the underlying child process promptly, freeing CPU, GPU, and file resources.
- Positive: symmetric cancel handling across Go and Python implementations.
- Negative: none. The change is purely additive — contexts that are never cancelled behave identically to before.
- Neutral / follow-ups:
runVmafScoreDirectnow also receives a context and passes it tolibvmaf.ScoreDirect, which already accepted one. The oldcontext.Background()comment citing "future ADR" is retired by this ADR.
References¶
- ADR-1018 (
r5-mcp-streamingGPU probe CommandContext finding, same class of bug). cmd/vmafx-mcp/impl.go—runVmafScore(line 224),delegateToPythonEval(line 530).cmd/vmafx-mcp/impl_direct.go—runVmafScoreDirect(line 83 context.Background() removed).mcp-server/vmaf-mcp/src/vmaf_mcp/server.py—_communicate_with_timeout.- Source: automated workflow audit (mcp-streaming-backpressure).