vmafx-controller gRPC service¶
vmafx-controller is the distributed platform controller for VMAFX Phase 4b. It is a single Go binary that exposes VMAF scoring and job orchestration over both gRPC and HTTP/JSON.
This page covers the gRPC interface. See http-transport.md for the HTTP endpoints (/healthz, /readyz, /metrics, /v1/score).
The controller exposes two gRPC services on the same port:
| Service | Purpose |
|---|---|
VmafxScoring | Direct scoring (retained from Phase 4a, ADR-0703) |
VmafxController | Job queue + node API (Phase 4b.1, ADR-0711) |
Quick start¶
# Local dev (requires core/build-cpu to exist)
go run ./cmd/vmafx-controller \
--vmaf-binary core/build-cpu/tools/vmaf \
--model-dir model/ \
--port 8080 \
--grpc-port 50051 \
--db /tmp/vmafx-controller.db
# Docker
docker build -f docker/Dockerfile.controller -t vmafx-controller:dev .
docker run --rm \
-e VMAFX_VMAF_BINARY=/usr/local/bin/vmaf \
-e VMAFX_MODEL_DIR=/usr/local/share/vmafx/model \
-e VMAFX_DB_PATH=/data/vmafx-controller.db \
-p 8080:8080 -p 50051:50051 \
vmafx-controller:dev
Configuration¶
All settings accept CLI flags and 12-factor environment variables. CLI flags take precedence over environment variables.
| Flag | Env var | Default | Description |
|---|---|---|---|
--port | VMAFX_PORT | 8080 | HTTP listen port |
--grpc-port | VMAFX_GRPC_PORT | 50051 | gRPC listen port |
--log-level | VMAFX_LOG_LEVEL | INFO | slog level (DEBUG/INFO/WARN/ERROR) |
--vmaf-binary | VMAFX_VMAF_BINARY | (PATH lookup) | Path to the vmaf CLI binary |
--model-dir | VMAFX_MODEL_DIR | (none) | Directory containing VMAF .json model files |
--db | VMAFX_DB_PATH | vmafx-controller.db | Path to the SQLite job-persistence database |
VmafxScoring service (direct scoring)¶
Retained from Phase 4a for backward compatibility. Clients that already talk to vmafx-server continue to work against vmafx-controller without changes.
service VmafxScoring {
rpc Score(ScoreRequest) returns (ScoreResponse);
rpc Health(HealthRequest) returns (HealthResponse);
}
Proto source: proto/vmafx.proto. Generated stubs: gen/go/vmafxv1/.
Example: direct score¶
grpcurl -plaintext \
-d '{"reference":"/data/ref.yuv","distorted":"/data/dis.yuv","model":"vmaf_v0.6.1"}' \
localhost:50051 vmafx.v1.VmafxScoring/Score
VmafxController service (job queue + node API)¶
Phase 4b.1 distributed orchestration surface. Proto source: cmd/vmafx-controller/proto/controller.proto. Generated stubs: gen/go/controller/.
Client API¶
Used by the CLI, MCP server, and future Web UI to submit and track jobs.
service VmafxController {
rpc SubmitJob(SubmitJobRequest) returns (SubmitJobResponse);
rpc GetJob(GetJobRequest) returns (Job);
rpc CancelJob(CancelJobRequest) returns (CancelJobResponse);
rpc StreamJobs(StreamJobsRequest) returns (stream Job); // snapshot in Phase 4b.1
...
}
Submit a job¶
grpcurl -plaintext \
-d '{"scoring":{"reference":"/data/ref.yuv","distorted":"/data/dis.yuv","backend":"cuda"}}' \
localhost:50051 vmafx.controller.v1.VmafxController/SubmitJob
# → {"jobId": "550e8400-e29b-41d4-a716-446655440000"}
Poll job status¶
grpcurl -plaintext \
-d '{"jobId":"550e8400-e29b-41d4-a716-446655440000"}' \
localhost:50051 vmafx.controller.v1.VmafxController/GetJob
# → {"id":"...","status":"COMPLETED","scoring":{...},"assignedNode":"node-abc"}
Node API¶
Used by vmafx-node worker processes to pull and report work.
service VmafxController {
...
rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse);
rpc PullWork(PullWorkRequest) returns (PullWorkResponse);
rpc ReportResult(ReportResultRequest) returns (ReportResultResponse);
}
Node lifecycle:
- On startup, the node calls
RegisterNodewith its capability (GPU vendor, available backends, concurrency slots). The controller returns anode_idand asession_token. - The node calls
Heartbeatevery ~10 s with thenode_idandsession_token. A node that misses heartbeats for 60 s is evicted; its in-flight jobs return toPENDING. - When the node has capacity, it calls
PullWork. The controller assigns the oldestPENDINGjob whosebackendrequirement matches the node's capabilities. - After the job completes (or fails), the node calls
ReportResultwithfinal=true.
Job lifecycle¶
Backend capability matching¶
A job's scoring.backend field specifies which backend the job requires (e.g. "cuda", "sycl", "cpu"). If empty, any node can accept the job. A node must list the required backend in its capability.backends to receive the job.
Job persistence¶
The controller persists jobs in a SQLite database at --db / VMAFX_DB_PATH. On controller restart:
PENDINGjobs are reloaded and re-queued in submission order.RUNNINGjobs are reset toPENDING(their assigned nodes are gone).COMPLETED,FAILED, andCANCELLEDjobs are retained for audit.
Schema: cmd/vmafx-controller/queue/schema.sql.
Prometheus metrics¶
/metrics exposes the following in Prometheus exposition format.
| Metric | Type | Description |
|---|---|---|
vmafx_controller_score_requests_total | Counter | Direct Score requests (HTTP + gRPC VmafxScoring) |
vmafx_controller_score_errors_total | Counter | Direct Score requests that returned an error |
vmafx_controller_score_duration_seconds | Histogram | Direct scoring latency |
vmafx_controller_health_requests_total | Counter | Health / /healthz calls |
vmafx_controller_ready_requests_total | Counter | /readyz calls |
vmafx_controller_jobs_pending | Gauge | Current number of PENDING jobs |
vmafx_controller_jobs_running | Gauge | Current number of RUNNING jobs |
vmafx_controller_nodes_registered | Gauge | Current live node registrations |
vmafx_controller_jobs_submitted_total | Counter | Jobs submitted via SubmitJob RPC |
vmafx_controller_jobs_completed_total | Counter | Jobs completed successfully |
vmafx_controller_jobs_failed_total | Counter | Jobs that ended in failure |
vmafx_controller_jobs_cancelled_total | Counter | Jobs cancelled |
Graceful shutdown¶
The controller listens for SIGTERM and SIGINT. On receipt it:
- Stops accepting new connections.
- Waits up to 30 seconds for in-flight requests to drain.
- Exits with code 0.
Jobs remain in the SQLite database; the next controller instance reloads them.
Further reading¶
- ADR-0711 — Phase 4b.1 decision record.
- ADR-0709 — Phase 4b umbrella architecture.
- ADR-0703 — Phase 4a origin (vmafx-server).
- HTTP transport docs —
/healthz,/readyz,/metrics,/v1/score. - k8s deployment guide — Helm chart configuration.