ADR-1095: Fix OTel trace context propagation across gRPC boundaries¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-06-07 |
| Deciders | Lusoris |
| Tags | observability, otel, grpc, bug |
Context¶
ADR-0927 introduced pkg/observability.InitOTel, which installs the global OTel TracerProvider, MeterProvider, and a W3C TextMapPropagator (TraceContext + Baggage) on startup for all four Go binaries.
ADR-0782 defined the span schema and instrumented the vmafx-controller gRPC server with otelgrpc.NewServerHandler() so incoming RPCs create server spans parented to the caller's trace.
Two gaps were left unaddressed:
-
vmafx-servergRPC server (cmd/vmafx-server/grpc_server.gorunGRPCWithServer) — theotelgrpc.NewServerHandler()stats handler was never installed. Anytraceparentheader arriving from a client was silently discarded; every server span appeared as an unrooted trace root rather than a child span of the calling process. -
pkg/scoregRPC client (pkg/score/grpc_client.goDial) — theotelgrpc.NewClientHandler()stats handler was never attached to thegrpc.NewClientcall. Outgoing RPCs carried notraceparentheader, so the controller-sideotelgrpc.NewServerHandler()received no parent span ID and could not link the two sides of the call into one trace. -
ObserveScoreLatencycontext (pkg/observability/otel_instruments.go) — the helper passedcontext.Background()to the SDK histogramRecordcall, discarding baggage and preventing exemplar attachment (OTel SDK uses the context to read the active span ID for attaching trace exemplars to histogram data points).
The net effect: every scoring trace appeared as a forest of disconnected roots. The controller's vmafx.job.submit span, the server's (missing) span, and the node's vmafx.scoring / vmafx.frame.extraction spans were all unlinked, making distributed tracing unusable for debugging end-to-end latency.
Decision¶
Fix all three gaps in a single PR:
-
Add
grpc.StatsHandler(otelgrpc.NewServerHandler())to thegrpc.NewServercall insiderunGRPCWithServerincmd/vmafx-server/grpc_server.go. This mirrors the pattern already present incmd/vmafx-controller/grpc_server.go. -
Add
grpc.WithStatsHandler(otelgrpc.NewClientHandler())to thegrpc.NewClientcall insideDialinpkg/score/grpc_client.go. -
Change
ObserveScoreLatencyto accept acontext.Contextas its first argument and forward it toScoreLatency.Record. All call sites are in test code only (no production callers existed before this PR).
No new ADR is required for options 1 and 2 — they are straightforward application of the pattern already documented in ADR-0927. An ADR is filed here because the combined gap represents an architectural correctness defect (distributed tracing silently broken across all scoring paths) that another engineer could revisit.
Consequences¶
- Distributed traces that cross the HTTP/gRPC boundary between a score client and
vmafx-serverare now correctly linked. - The controller, server, and node spans for a single scoring job appear in the same trace waterfall in Jaeger / Grafana Tempo.
ObserveScoreLatencyis a breaking API change to the function signature; the function is package-internal (no callers outsidepkg/observabilityand its tests), so no downstream impact.
Alternatives considered¶
| Alternative | Reason rejected |
|---|---|
| Apply only the client-side fix (option 2) and leave the server without the stats handler | The server still creates no span, so Jaeger shows the trace ending at the client and the controller's span appearing disconnected. Both sides must be instrumented. |
| Add a middleware interceptor instead of the stats handler | otelgrpc recommends the StatsHandler API over the deprecated UnaryInterceptor wrappers. The interceptor approach works but produces lower-quality metadata (missing internal event attributes that the stats handler captures). |
Keep ObserveScoreLatency with context.Background() | The only downside is missing exemplar linking — a minor quality issue, not a correctness defect. However the fix is trivial (one arg) and adds semantic correctness at zero cost. |
References¶
- ADR-0927: OpenTelemetry traces + metrics Phase 1 (provider initialisation).
- ADR-0782: OpenTelemetry tracing and metrics schema (span names, attributes).
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpcv0.53.0 (already ingo.mod).