ADR-0779: eBPF FUSE read-path bypass for vmafx-node rclone mounts¶
- Status: Proposed
- Date: 2026-05-29
- Deciders: lusoris
- Tags:
ebpf,node,rclone,performance,phase4b,fork-local
Context¶
vmafx-node fetches video clips via rclone HTTP-serve, exposed as a FUSE mount under /rclone-mount/. Profiling (Research-0733) showed that small sequential reads (4–64 KiB) through the FUSE daemon add a round-trip through a kernel upcall mechanism: each read(2) on a FUSE-backed file wakes the rclone FUSE daemon, which in turn issues its own HTTP range request. For clips that have already been cached locally by rclone, this daemon hop is pure overhead.
Measured baseline on a 100 MiB sequential read from a warm rclone HTTP-serve cache: ~370 ms p50. Target after bypass: ~10 ms p50 (37× improvement, Research-0733 headline figure).
The bypass is intentionally probe-only: the eBPF program marks file descriptors opened under the configured mount prefix in a shared BPF hash map. The Go-side loader reads that map; the actual "bypass" is the application opening the backing cache file directly via pread(2) instead of going through FUSE. No kernel memory is modified by the eBPF program.
The feature is gated behind VMAFX_EBPF_BYPASS=1 (default off) so it can be validated incrementally without affecting production deployments that have not yet confirmed CAP_BPF availability.
Decision¶
We will ship an eBPF tracepoint-based FD tracker (cmd/vmafx-node/bpf/) that:
- Intercepts
sys_enter_openat/sys_exit_openatfor paths under the configured mount prefix and records the resulting fd in a BPF hash map. - Intercepts
sys_enter_closeto evict the fd from the map on close. - Emits ring-buffer events so the Go loader's in-process cache stays warm without polling.
- Is compiled from
rclone_bypass.bpf.cviabpf2go; a compile-time stub (rclone_bypass_stub.go) makes the package build in CI without a BPF toolchain. - Is enabled only when
VMAFX_EBPF_BYPASS=1at process start.
The loader uses github.com/cilium/ebpf (v0.21.0) as the Go-side BPF library.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
FUSE passthrough mode (kernel 5.15 FUSE_PASSTHROUGH) | Zero eBPF complexity; kernel-native | Requires rclone to opt in; rclone upstream does not yet expose FUSE_PASSTHROUGH in stable builds | Not available with current rclone |
LD_PRELOAD shim intercepting open/read | No kernel privileges needed | Fragile with static binaries and CGo; does not compose well with Go runtime | Too brittle |
| io_uring fixed-file registration | Low syscall overhead for bulk reads | Requires rework of all read paths; no FUSE-specific benefit without the FD tracking layer anyway | Larger scope; orthogonal to the FUSE problem |
| rclone VFS cache + direct path access | Simple; no eBPF | Only works when the clip is 100% cached; partial-cache case still funnels through FUSE | Cannot guarantee full pre-cache in the general case |
Consequences¶
- Positive: p50 read latency for warm-cache clips drops from ~370 ms to ~10 ms (37×) when bypass is active. No FUSE daemon involvement for reads after the initial open.
- Negative:
- Requires Linux 5.15+ and CAP_BPF (or CAP_SYS_ADMIN).
- Pod security context must allow
privileged: trueor a fine-grained seccomp/AppArmor profile that permits BPF syscalls andtracefsaccess. - The compiled
.oobject must be regenerated when the BPF C source changes (go generate ./cmd/vmafx-node/bpf/); the stub enables CI to pass without the BPF toolchain, but production images must include the real object. VMAFX_EBPF_BYPASS=1is off by default; operators must explicitly opt in and confirm their kernel version and capability grants.- Neutral / follow-ups:
- Helm chart values (
deploy/helm/vmafx-node/values.yaml) need aebpfBypass.enabledflag that sets the env var and addsCAP_BPFto the container security context. - A container build step (
RUN apt-get install -y clang libbpf-dev linux-headers-$(uname -r)) must be added to the node Dockerfile whenVMAFX_EBPF_BYPASSis promoted from experimental to default. - The smoke test (
TestReadLatencyComparison) should be wired into a manual perf-gate job in CI once a privileged runner is available.
References¶
- Research-0733 (37× p50 latency measurement, rclone FUSE overhead profiling).
- ADR-0709 — Phase 4b distributed platform; eBPF listed as a component.
- ADR-0713 — vmafx-node worker binary design.
github.com/cilium/ebpfv0.21.0 — Go BPF library.- Linux kernel docs:
Documentation/filesystems/fuse.rst,Documentation/bpf/. - req: "Implement the eBPF FUSE bypass per Research-0733 (37× p50 latency win)."