ADR-0709: VMAFX Phase 4b — Distributed Video-Quality, Encoding, and ML Platform¶
- Status: Proposed
- Date: 2026-05-28
- Deciders: Lusoris
- Tags:
architecture,go,k8s,operator,controller,node,ffmpeg,rclone,ebpf,onnx,training,abi,platform,phase4b,fork-local
Context¶
VMAFX reached a phase-transition point with Phase 4a (ADR-0702): the repository now has Go and Rust workspaces, C++23 migration policy, and the MCP server, vmafx-server, vmafx-mcp, and vmafx-tune binaries in development as independent Go modules.
Phase 4b is the architectural pivot from single-binary scoring tool to distributed video-quality, encoding, and ML platform. The forces driving this pivot:
- Scale requirement: batch scoring sweeps (CHUG, K150K, BVI-DVC) currently run as long-lived single processes; horizontal scaling requires a controller/worker split.
- Heterogeneous GPU pools: the fork already supports CUDA, SYCL, HIP, Vulkan, and Metal backends. Scheduling work to the right GPU vendor pool requires a cluster-aware orchestration layer — not ad-hoc
--backendflags. - Online training demand: encoding a video is the ideal moment to collect
(ref, dis, score, metadata)triples for continuous model refinement. The existing Python-only offline training loop inai/cannot consume real-time encoder output. - Storage costs: materializing full YUV frames to disk before scoring defeats the point of cloud-native deployment. A zero-copy storage layer (rclone-mount) eliminates the intermediate disk write.
- Platform ambition: VMAFX targets production video-quality measurement at CDN scale — encoder-ladder tuning, batch transcoding QA, and real-time quality monitoring. None of these use cases fit the single-binary model.
Phase 4b defines the target architecture. The in-flight Phase 4a agents (vmafx-server, vmafx-mcp, vmafx-tune, vmafx-sys Rust bindings, C++23 internals) finish first; Phase 4b layers on top of their output.
Decision¶
We will transform VMAFX into a cloud-native distributed platform with the following components:
-
vmafx-controller(Go) — the cluster brain. Exposes gRPC + HTTP API, owns the job queue, node registry, and work scheduler. Exposes/healthz,/readyz,/metrics(Prometheus). The existingvmafx-server(in-flight Phase 4a agent) is renamed and extended with controller-specific scope (job queue, node registry) in Phase 4b.1. -
vmafx-node(Go) — the execution worker. Pulls work items from the controller, runs encoders (via ffmpeg subprocess), scores via libvmaf (cgo againstbindings/rust/vmafx-sys), runs AI inference via Go ONNX Runtime (onnxruntime-gowith CUDA EP + ROCm EP + OpenVINO EP). Reports results back to the controller. GPU pool affinity via k8snodeSelector/nodeAffinity(vendor-keyed:nvidia.com/gpu,amd.com/gpu,gpu.intel.com/i915). -
vmafx-operator(Go,controller-runtime/ kubebuilder) — the Kubernetes Operator. Watches theVmafxJob,VmafxNode, andVmafxModelTrainingCRDs, reconciles pod lifecycle, scales nodes via HPA against queue depth. Deployed alongside the controller in the Helm chart (ADR-0699). -
Thin clients —
vmafx-mcpandvmafx-tune(in-flight Phase 4a agents) are rewired to talk to the controller's gRPC API instead of running libvmaf directly. -
ffmpeg integration — ffmpeg (latest pinned release) is bundled into the
vmafx-nodedistroless image. Encoding is done via ffmpeg subprocess; scoring via libvmaf cgo directly. The existingffmpeg-patches/stack continues to apply inside the container. -
rclone storage — rclone is bundled into the node image. The node mounts the source bucket at
/mnt/sourceviarclone mountorrclone-vfs, exposing a POSIX view of S3 / GCS / Azure Blob / SSH / SFTP. ffmpeg and libvmaf read directly from the mount; no intermediate disk materialization. -
eBPF optimizations — research-first. A research digest identifies ONE concrete eBPF optimization target (I/O hot path, scheduling signal, XDP gRPC acceleration, or profiling) with measurable baseline before any implementation PR ships.
-
AI inference in the node — Go ONNX Runtime (
onnxruntime-go) insidevmafx-node. Single Go binary, GPU-aware (CUDA EP, ROCm EP, OpenVINO EP). Same image runs scoring and AI inference. Training continues inai/(Python / PyTorch + Lightning) for now. -
Sidecar training — Python sidecar container (v1) co-located with each node. The Go node captures
(ref, dis, score, metadata)triples and ships them to the Python sidecar for continuous model refinement via the existing PyTorch + Lightning stack. A dedicatedvmafx-training-nodepool (v2) is deferred until scale demands it. -
C ABI break — the libvmaf public C API is no longer preserved as a stable external contract. The fork rewrites the public API surface toward C++23, Rust, and Go bindings. The in-tree
ffmpeg-patches/stack is updated in the same PR to consume the new API. Downstream consumers external to this repository are not a constraint. -
Native build scope tightening — libvmaf continues to exist inside the container (the Go node loads it via cgo). The
ffmpeg-patches/stack continues to apply against ffmpeg-in-container. External publication of native.so/.deb/.rpmpackages is intentionally dropped. The user-facing release artifacts are Docker images and the Helm chart only.
Alternatives considered¶
Architecture pattern¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Single-binary with goroutines | Zero operational complexity; straightforward Go concurrency | No horizontal scale; GPU pool affinity impossible; no k8s lifecycle control | Does not meet scale or multi-vendor GPU requirements |
| Simple job queue (Redis/NATS) + workers | Lower operational overhead than a full Operator | No native k8s integration; requires external queue infrastructure; no CRD-based lifecycle | Adding an Operator on top later is harder than starting with one |
| Controller/node + custom k8s Operator (chosen) | Native k8s experience; CRD lifecycle; HPA; GPU nodeSelector per vendor; Helm-bundled | More moving parts; kubebuilder learning curve | Best fit for production multi-GPU multi-tenant cluster deployment |
| Upstream solutions (Argo Workflows, Tekton) | Mature; avoid custom Operator code | Heavyweight; require significant platform investment; do not understand VMAFX-specific CRDs | Per user direction: "of course this has to be fully connected to a ffmpeg worker as well" — bespoke CRDs are necessary |
ffmpeg integration¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Separate vmafx-ffmpeg-node worker type | Clean separation; can scale encoder fleet independently | Two worker types to manage; controller complexity | Fold into standard node for v1; split later if scale demands it |
Fold ffmpeg into vmafx-node runtime (chosen) | Single image; simpler scheduler; ffmpeg and libvmaf share the same process lifecycle | Node image is larger | Simpler v1; per user: "latest of course" ffmpeg pinned |
| Use libavcodec directly instead of ffmpeg subprocess | No subprocess overhead | Significant C complexity; lose ffmpeg filter graph | ffmpeg subprocess reuses the existing ffmpeg-patches integration path |
Storage layer¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Mount k8s PersistentVolumeClaim | Simple; native k8s | Requires CSI driver + provisioner per cloud; forces full materialization to PVC | Cloud-provider lock-in; no zero-copy |
| Copy files to node ephemeral storage | Simplest code path | Wastes disk; wastes cluster I/O budget; not viable at CHUG scale | Not zero-copy |
| rclone-mount / rclone-vfs (chosen) | Zero-copy POSIX view; supports S3, GCS, Azure Blob, SSH, SFTP; one integration point | Adds rclone binary to node image; FUSE mount lifecycle complexity | Per user direction and per recommendation: "use rclone for using files without copying to disk/RAM first" |
eBPF scope¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Skip eBPF entirely | No kernel-level complexity | Leaves observable performance on the table | User explicitly requested eBPF: "if possible do ebpf optimizations" |
| Implement all eBPF use cases (XDP, scheduling, profiling) at once | Maximum optimization coverage | Massive scope; research territory; high risk of over-engineering | Research-first: measure baseline, identify one target, ship one PR |
| Research-first: one concrete target (chosen) | Controlled scope; measurable baseline; reversible | Defers potential gains | Best practice for kernel-level work; per memory file |
AI inference in the node¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Python subprocess for ONNX inference | Reuses existing ai/ Python stack | Per-frame subprocess overhead; two process runtimes per node | Unacceptable latency for real-time scoring |
| C-cgo calling libvmaf DNN path directly | Minimal new dependencies | Bypasses Go-native EP selection; CUDA/ROCm EP harder to wire | Go ONNX Runtime has native EP support |
Go ONNX Runtime (onnxruntime-go) (chosen) | Single Go binary; GPU EP selection (CUDA, ROCm, OpenVINO) native; same image runs scoring + inference | onnxruntime-go is less mature than C/Python ORT | Per user popup answer: "Inside vmafx-node via Go ONNX Runtime (Recommended)" |
Sidecar training¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Python sidecar container per node (chosen v1) | Reuses existing PyTorch + Lightning stack; pragmatic; no new ML framework | Requires sidecar container lifecycle management; data shipping between Go node and Python sidecar | Chosen for v1; per user: "ml training in python only is wrong — we want sidecar training while encoding" |
| Go-native online learning (Gorgonia / pure-Go SGD) | Single binary; no Python dependency | Go ML ecosystem not viable for full PyTorch fine-tune path | Not viable for full model training |
Dedicated vmafx-training-node pool | Cleanest separation; dedicated GPU + PyTorch | More moving parts; overkill for v1 scale | Deferred to v2 per recommendation |
C ABI break¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Preserve libvmaf public C ABI | Downstream consumers (FFmpeg mainline, GStreamer, third-party) remain unbroken | Constrains C++23 rewrite; prevents idiomatic Go/Rust/C++ surface | Per user direction: "we rewrite and we update the patches and then we are fine? because I don't care about what others do with my project or not" |
| Break ABI; update ffmpeg-patches (chosen) | Enables idiomatic C++23 + Rust + Go public surface; no legacy compatibility debt | ffmpeg-patches must be updated in the same PR; one-time migration cost | External downstream consumers are not a constraint for this fork |
Native build publishing¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Continue publishing .deb/.rpm/.so | Wider reach for non-k8s users | Maintenance overhead; diverges from container-first mental model | Per user direction: "now that I concentrate on docker images/k8s, do we still need to build native things?" — external packages intentionally dropped |
| Docker images + Helm chart only (chosen) | Single release artifact type; aligns with container-first model | No native package install path | The Go node requires libvmaf inside the container anyway; external packages add cost with no user base |
Implementation plan¶
The following phase sequence produces mergeable PRs in dependency order. Each phase produces its own ADR + PR. This ADR is the umbrella.
| Phase | Title | Scope |
|---|---|---|
| 4b.1 | vmafx-server → vmafx-controller | Rename post-Phase-4a vmafx-server binary; add job queue (Redis or k8s Job CRDs), node registry, scheduler API |
| 4b.2 | vmafx-node Go binary | New cmd/vmafx-node/ module; libvmaf cgo via vmafx-sys, ffmpeg subprocess, Go ONNX Runtime inference, result reporting |
| 4b.3 | vmafx-operator kubebuilder skeleton | kubebuilder init + CRDs: VmafxJob, VmafxNode, VmafxModelTraining; stub reconcile loops; RBAC |
| 4b.4 | ffmpeg latest bundled in node layer | Pin latest ffmpeg release; apply ffmpeg-patches/ series inside container build; distroless node image update |
| 4b.5 | rclone integration | Bundle rclone in node distroless layer; rclone mount source bucket at /mnt/source; investigate rclone-vfs for true streaming |
| 4b.6 | eBPF research digest + ONE optimization | Research digest: measure baseline (I/O hot path / scheduling / XDP); select one concrete target; implement; gate on measurable improvement |
| 4b.7 | Sidecar training v1 | Python sidecar container spec; Go node triple-capture API; sidecar PyTorch + Lightning continuous training loop; CRD VmafxModelTraining reconciler |
| 4b.8 | C ABI break + ffmpeg-patches update | Public API rewrite to C++23 + Rust + Go surface; ffmpeg-patches/ series updated to consume new API in the same PR |
| 4b.9 | Native build sunset | Release pipeline: publish Docker images + Helm chart only; drop .deb / .rpm / .so publication steps |
Out of scope¶
This ADR does NOT cover:
- Specific model architectures — handled by per-model ADRs (e.g., ADR-0682, future tiny-AI model ADRs).
- eBPF specifics — the concrete optimization target, kernel program design, and performance gate are all handled in the Phase 4b.6 research digest and its accompanying ADR.
- Sidecar training algorithm choice — architecture selection (Python sidecar v1 vs dedicated training nodes v2), loss function, fine-tune strategy, and data schema are handled by the Phase 4b.7 research digest.
- External native package publishing — intentionally out of scope per user direction. No
.deb,.rpm, or standalone.sopublication path. This is a permanent removal, not a deferral. - Netflix pipeline function backlog — a parallel research digest will inventory Netflix-upstream functions not yet ported to the fork and produce a porting backlog. That backlog integrates into the Phase 4b workstream but is its own research artifact.
- Helm chart GPU node pool specifics — per-vendor
nodeSelectordetails and HPA thresholds are handled inside Phase 4b.3 and the Phase 3 Helm chart ADR-0699.
Consequences¶
Positive:
- Horizontal scaling: add
vmafx-nodepods to process more jobs in parallel. - k8s-native deployment: CRDs, RBAC, HPA, Helm chart — standard operator pattern.
- GPU-vendor-agnostic pools: controller dispatches to NVIDIA / AMD / Intel nodes by vendor-keyed nodeSelector; same job definition runs on any pool.
- Online learning: sidecar training closes the encode → score → train loop; model quality improves continuously as the platform processes real workloads.
- Zero-copy storage: rclone-mount eliminates intermediate disk writes; significant I/O cost reduction at CHUG / K150K / BVI-DVC scale.
- Idiomatic multi-language public surface: C++23 + Rust + Go instead of a C11 API frozen for downstream compatibility reasons.
Negative:
- Significant engineering effort: controller, node, and operator are three new Go binaries with distinct responsibilities.
- ffmpeg-patches series must be updated when the C ABI break lands (Phase 4b.8); one-time migration cost.
- No external native package install path after Phase 4b.9; users outside Docker / k8s must build from source.
- kubebuilder / controller-runtime learning curve for contributors unfamiliar with Kubernetes Operator patterns.
- rclone FUSE mount adds a kernel-level FUSE dependency inside the node container; must be validated against distroless image constraints.
Neutral / follow-ups:
- The in-flight Phase 4a agents (vmafx-server, vmafx-mcp, vmafx-tune, vmafx-sys Rust bindings, C++23 internals) finish before Phase 4b sweeps start. Each completed agent output becomes an input dependency for the corresponding Phase 4b phase.
- Each Phase 4b.N sweep ships its own child ADR, research digest (where applicable), changelog fragment, and
docs/rebase-notes.mdentry. - Netflix pipeline function audit runs in parallel as a research-only agent; its output integrates into Phase 4b prioritization.
- Thin clients (vmafx-mcp, vmafx-tune) are rewired to the controller gRPC API in follow-up PRs after Phase 4b.1 lands.
References¶
Parent ADRs:
- ADR-0686 — VMAFX rebrand and aggressive modernization umbrella (Phase 1 + 2).
- ADR-0701 — VMAFX cloud-native redesign (Phase 3: server-mode, Dockerfile, Helm chart, observability).
- ADR-0702 — VMAFX Phase 4 multi-language modernization foundation (Phase 4a: Go workspace, Rust workspace, C++23 policy).
- ADR-0699 — Helm chart + k8s manifests (Phase 3).
- ADR-0706 — Rust
vmafx-sysFFI crate (Phase 4a).
Memory files consulted:
project_vmafx_phase4b_distributed_platform.md— locked Phase 4b decisions, popup answers verbatim, in-flight agent status.project_vmafx_k8s_cloud_native.md— Phase 3 cloud-native redesign decisions.project_vmafx_phase4_language_modernization.md— Phase 4a language modernization.project_vmafx_rebrand_plan.md— Phase 1+2 rebrand plan.
Verbatim user popup answers (req):
req— "of course this has to be fully connected to a ffmpeg worker as well (latest of course)... and I think it was (thanks lawrence) that we should use rclone for using files without copying to disk/ram first? and if possible do ebpf optimizations..." (architecture popup, 2026-05-28)req— "Inside vmafx-node via Go ONNX Runtime (Recommended)" (AI inference popup, 2026-05-28)req— "ml training in python only is wrong as well -> we want sidecar training while encoding etc.... (look at the 1000 things our software can do)" (training popup, 2026-05-28)req— "option one but: we also are still missing (i think there was an audit file somewhere) the rests of the netflix pipeline functions" (in-flight agents popup, 2026-05-28)req— "we rewrite and we update the patches and then we are fine? because I don't care about what others do with my project or not" (C ABI break popup, 2026-05-28)req— "now that I concentrate on docker images/k8s, do we still need to build native things?... the only thing we still need is the patches for ffmpeg?" (native builds popup, 2026-05-28)