Skip to content

VMAFX Phase 4b — Distributed Platform Architecture¶

Status: Proposed (2026-05-28). Locks when ADR-0709 moves to Accepted. See ADR-0709 for the full decision record.

This document describes the target component layout introduced in Phase 4b. The diagram below replaces the single-binary model from Phase 3 (ADR-0701) and Phase 4a (ADR-0702) with a controller/node/operator split designed for horizontal scale on heterogeneous GPU clusters.

Component diagram (Mermaid)¶

graph TB
    subgraph "Thin clients"
        CLI["vmafx CLI\n(Go)"]
        MCP["vmafx-mcp\n(Go / JSON-RPC)"]
        Tune["vmafx-tune\n(Go)"]
    end

    subgraph "Control plane"
        CTRL["vmafx-controller\n(Go)\ngRPC + HTTP API\nJob queue · Node registry\nScheduler · /metrics · /healthz"]
        OP["vmafx-operator\n(Go · controller-runtime)\nCRDs: VmafxJob\nVmafxNode\nVmafxModelTraining\nHPA reconciler"]
    end

    subgraph "Worker pool — NVIDIA nodes"
        NODE_NV["vmafx-node\n(Go · CUDA EP)\nlibvmaf cgo\nffmpeg subprocess\nGo ONNX Runtime\nrclone mount"]
        SC_NV["training-sidecar\n(Python · PyTorch + Lightning)\ncontinuous fine-tune\nv1: per-node sidecar"]
    end

    subgraph "Worker pool — AMD nodes"
        NODE_AMD["vmafx-node\n(Go · ROCm EP)\nlibvmaf cgo\nffmpeg subprocess\nGo ONNX Runtime\nrclone mount"]
        SC_AMD["training-sidecar\n(Python · PyTorch + Lightning)"]
    end

    subgraph "Worker pool — Intel nodes"
        NODE_INT["vmafx-node\n(Go · OpenVINO EP)\nlibvmaf cgo\nffmpeg subprocess\nGo ONNX Runtime\nrclone mount"]
    end

    subgraph "Storage"
        S3["Object store\nS3 / GCS / Azure Blob\nSFTP / SSH\n(via rclone-mount)"]
        MODELREG["model/ registry\n.onnx + registry.json"]
    end

    subgraph "Kubernetes"
        K8S["k8s API server\nCRD watch"]
    end

    CLI -->|gRPC| CTRL
    MCP -->|gRPC| CTRL
    Tune -->|gRPC| CTRL

    CTRL -->|work items| NODE_NV
    CTRL -->|work items| NODE_AMD
    CTRL -->|work items| NODE_INT

    OP -->|reconcile| K8S
    K8S -->|CRD events| OP
    OP -->|pod lifecycle| CTRL

    NODE_NV -->|triples\n(ref,dis,score,meta)| SC_NV
    NODE_AMD -->|triples| SC_AMD

    SC_NV -->|updated .onnx| MODELREG
    SC_AMD -->|updated .onnx| MODELREG

    NODE_NV -->|rclone-vfs read| S3
    NODE_AMD -->|rclone-vfs read| S3
    NODE_INT -->|rclone-vfs read| S3

    NODE_NV -->|ONNX load| MODELREG
    NODE_AMD -->|ONNX load| MODELREG
    NODE_INT -->|ONNX load| MODELREG

Component responsibilities¶

Component	Language	Image	Key responsibilities
`vmafx-controller`	Go	distroless/cc	gRPC + HTTP API, job queue, node registry, scheduler, `/healthz /readyz /metrics`
`vmafx-operator`	Go (controller-runtime)	distroless/cc	Watches `VmafxJob` / `VmafxNode` / `VmafxModelTraining` CRDs; reconciles pod lifecycle; drives HPA
`vmafx-node`	Go	distroless/cc + ffmpeg + rclone	Pulls work, runs ffmpeg subprocess, scores via libvmaf cgo, AI inference via Go ONNX Runtime, captures training triples
`training-sidecar`	Python (PyTorch + Lightning)	pytorch base	Consumes `(ref, dis, score, metadata)` triples from co-located node; continuously fine-tunes ONNX model; writes updated `.onnx` to model registry
`vmafx-mcp`	Go	distroless/cc	MCP JSON-RPC server; delegates scoring to controller gRPC API
`vmafx-tune`	Go	distroless/cc	Encoder-ladder optimizer; submits jobs to controller
`vmafx` CLI	Go	— (binary)	User-facing CLI; submits jobs to controller

CRD summary¶

CRD	Group	Scope	Purpose
`VmafxJob`	`vmafx.io/v1alpha1`	Namespaced	Describes a scoring / encoding / QA job (source, models, encoder params, target node pool)
`VmafxNode`	`vmafx.io/v1alpha1`	Cluster	Describes a node pool (GPU vendor, count, image, resource limits)
`VmafxModelTraining`	`vmafx.io/v1alpha1`	Namespaced	Describes a sidecar training run (base model, training config, output target)

Storage flow (zero-copy via rclone)¶

Object store (S3 / GCS / Azure Blob / SFTP)
        │
        │  rclone mount / rclone-vfs (FUSE)
        ▼
  /mnt/source/   (inside vmafx-node pod)
        │
        │  POSIX read — no intermediate disk write
        ├──► ffmpeg subprocess (encode → encoded stream)
        └──► libvmaf cgo (score → result JSON)

GPU pool affinity¶

Node pods are scheduled via k8s nodeSelector / nodeAffinity resource keys:

Vendor	Resource key	Backend
NVIDIA	`nvidia.com/gpu`	CUDA EP
AMD	`amd.com/gpu`	ROCm EP + HIP
Intel	`gpu.intel.com/i915`	OpenVINO EP + SYCL

Each backend runs through whichever GPU device plugin is allocated to the pod (per ADR-0701). (The Vulkan backend was removed in ADR-0726.)

Phase 4b sweep sequence¶

Phase	Description	Input dependency
4b.1	`vmafx-server` → `vmafx-controller` (job queue, node registry, scheduler)	Phase 4a vmafx-server PR merged
4b.2	`vmafx-node` Go binary (libvmaf cgo, ffmpeg, Go ONNX Runtime)	vmafx-sys Rust bindings (Phase 4a)
4b.3	`vmafx-operator` kubebuilder skeleton + CRDs	Phase 4b.1
4b.4	ffmpeg latest + `ffmpeg-patches/` bundled in node image	Phase 4b.2
4b.5	rclone integration (node distroless layer + mount lifecycle)	Phase 4b.2
4b.6	eBPF research digest + ONE concrete optimization	Phase 4b.2 (baseline measurement)
4b.7	Sidecar training v1 (Python sidecar + triple-capture API)	Phase 4b.2 + Phase 4b.3
4b.8	C ABI break + ffmpeg-patches update	Phase 4b.4
4b.9	Native build sunset (Docker + Helm only release artifacts)	Phase 4b.8

ADR-0709 — umbrella decision record
ADR-0702 — Phase 4a foundation
ADR-0701 — Phase 3 cloud-native redesign
ADR-0699 — Helm chart + k8s manifests
c4-container.md — C4 Level 2 container view (to be updated as Phase 4b components stabilize)