Sidecar Online Training¶
The VMAFX sidecar trainer continuously fine-tunes a tiny-AI ONNX model while vmafx-node processes encoding jobs. Every scored job contributes a (features, true_score) training pair that is shipped to a co-located Python sidecar container over a Unix socket. No round-trip to a training cluster is required; checkpoints are available within minutes.
This surface is part of the VMAFX Phase 4b distributed platform (ADR-0781, ADR-0709). It is separate from the on-host bias-correction sidecar used by vmaf-tune (ADR-0394), which is a single-host, non-k8s, non-ONNX surface.
Quick start¶
Enable the sidecar in your Helm values:
sidecar:
trainer:
enabled: true
image:
repository: ghcr.io/vmafx/vmafx-sidecar
tag: latest
baseModelPath: /mnt/vmafx-models/base/fr_regressor_v3.onnx
checkpointDir: /mnt/vmafx-models/online
nFeatures: 80
resources:
requests:
cpu: "0.5"
memory: "512Mi"
limits:
cpu: "2"
memory: "1Gi"
persistence:
models:
enabled: true
mountPath: /mnt/vmafx-models
The sidecar container starts alongside vmafx-node in the same pod, binds /tmp/vmafx-sidecar.sock, and begins accepting training pairs immediately.
Architecture¶
vmafx-node (Go) vmafx-sidecar (Python)
───────────────── ──────────────────────────
scorer.Score(ref, dis) ──→ ReplayBuffer (10 000 cap)
│ FeedbackClient.Send() │
│ (non-blocking, 1000-entry │ batch (32 samples, 50% replay)
│ ring buffer) ↓
│ SGDEMATrainer.step()
│ │ EMA update (β=0.999)
│ │
│ every 10 min + 1000 new samples:
│ export_onnx() → /mnt/vmafx-models/online/
│ write SHA-256 sidecar
│
↓ next restart: node loads new ONNX checkpoint
Why Unix socket? The vmafx-node and vmafx-sidecar containers share an emptyDir volume mounted at /tmp. Unix domain sockets over a shared volume have lower per-message overhead than TCP loopback for the expected message rate (< 100 pairs/s) and avoid an extra container port declaration.
Why SGD + EMA? Standard SGD is lower overhead per step than Adam for the tiny two-layer regression heads used here. The EMA shadow (Polyak averaging, beta=0.999) smooths out noisy gradient steps from short-content bursts (ADR-0781 §Decision; Mean Teacher paper, 2017).
Why a replay buffer? Without replay, a sudden wave of narrow-content jobs (e.g., one hour of HDR animation) overwrites the model's knowledge of other content types. The 10 000-sample ring buffer (≈ 3.2 MB at 80 float32 features/sample) retains ~200 hours of a 50-encode/hour workload.
Configuration reference¶
All settings are Helm values under sidecar.trainer.* and are passed to the sidecar container as environment variables.
| Helm key | Env var | Default | Description |
|---|---|---|---|
enabled | — | false | Enable the sidecar container. |
image.repository | — | — | Sidecar container image. |
image.tag | — | chart AppVersion | Image tag. |
baseModelPath | VMAFX_BASE_MODEL_PATH | /mnt/vmafx-models/base/model.onnx | ONNX or PyTorch state-dict to fine-tune. |
checkpointDir | VMAFX_SIDECAR_CHECKPOINT_DIR | /mnt/vmafx-models/online | Directory for versioned ONNX checkpoint output. |
replayBufferSize | VMAFX_SIDECAR_REPLAY_CAPACITY | 10000 | Replay buffer capacity (samples). |
batchSize | VMAFX_SIDECAR_BATCH_SIZE | 32 | Mini-batch size per gradient step. |
replayMixRatio | VMAFX_SIDECAR_REPLAY_MIX | 0.5 | Fraction of each batch drawn from replay buffer. |
learningRate | VMAFX_SIDECAR_LR | 0.0001 | SGD/Adam learning rate. |
emaDecay | VMAFX_SIDECAR_EMA_DECAY | 0.999 | EMA decay beta. |
checkpointIntervalSeconds | VMAFX_SIDECAR_CKPT_INTERVAL_S | 600 | Minimum seconds between checkpoints. |
minSamplesPerCheckpoint | VMAFX_SIDECAR_MIN_SAMPLES_CKPT | 1000 | Minimum new samples before checkpoint. |
nFeatures | VMAFX_SIDECAR_N_FEATURES | 80 | Feature vector dimension from vmafx-node. |
The socket path (VMAFX_SIDECAR_SOCKET) defaults to /tmp/vmafx-sidecar.sock and should not be changed unless the volume mount path also changes.
Checkpoint format¶
Each checkpoint export writes two files atomically:
/mnt/vmafx-models/online/model_v000042.onnx # EMA model
/mnt/vmafx-models/online/model_v000042.onnx.sha256 # SHA-256 digest
The ONNX file uses opset 17 (matching ADR-0249 and the rest of the tiny-AI export stack). Input shape: (batch, n_features). Output shape: (batch, 1).
On next vmafx-node restart, set VMAFX_BASE_MODEL_PATH (or the Helm baseModelPath value) to the new checkpoint path to pick it up. Automated live hot-reload is a follow-up (requires the VmafxModelTraining CRD, Research-0733 §3.4).
Kubernetes sidecar lifecycle (k8s 1.29+)¶
On clusters running Kubernetes 1.29 or later the sidecar container is declared with restartPolicy: Always (native sidecar KEP-753 semantics). This means:
- The sidecar starts before the main
vmafx-nodecontainer. - The pod's
Readycondition waits for the sidecar's readiness probe to pass (Unix socket bound). - If the sidecar crashes it is restarted without restarting
vmafx-node.
On older clusters (restartPolicy: Always is ignored) the sidecar behaves as a regular container; a crash restarts the whole pod.
Go-side metrics¶
FeedbackClient exposes two counters available via the node's Prometheus metrics endpoint (when instrumented):
| Counter | Description |
|---|---|
vmafx_training_feedback_dropped_total | Messages dropped due to queue overflow or sidecar unavailability. |
vmafx_training_feedback_delivered_total | Messages successfully delivered to the sidecar. |
A non-zero dropped counter means the sidecar is behind the scoring rate. Increase resources.limits.cpu for the sidecar or reduce batchSize.
Limitations (v1)¶
- No live hot-reload: the node loads the new checkpoint only on restart. The
VmafxModelTrainingCRD + controller (atomic session swap viaatomic.Pointer) is the follow-up. - CPU training only by default: CUDA training available via
VMAFX_SIDECAR_CUDA=1but requires the node pod to have excess GPU budget. - Single base model per sidecar: per-tenant adapter heads (LoRA) are deferred to v2.
- No stability gate: a checkpoint that regresses PLCC by more than 0.005 is not automatically quarantined in v1. Planned per Research-0733 §3.4.
See also¶
- ADR-0781 — design rationale
- Research-0733 — architecture evaluation
- ADR-0394 — on-host vmaf-tune predictor sidecar (different surface)
- ADR-0249 — ONNX export opset constraints
- docs/ai/local-sidecar-training.md — vmaf-tune on-host ridge sidecar