vmaf-perShot — per-shot CRF predictor sidecar¶
vmaf-perShot is a fork-added CLI binary (T6-3b / ADR-0222) that turns a single YUV reference into a per-shot CRF plan: a CSV or JSON sidecar your encoder can consume to drive content-adaptive bitrate without an ML framework on the encode side.
It does not measure VMAF — its output is an encoder hint, not a quality score. Use vmaf to verify the post-encode VMAF.
Pipeline¶
ref.yuv ──► vmaf-perShot ──► plan.csv ──► encoder (--zones / --crf table)
│
└──► (post-encode) vmaf to verify VMAF
The plan describes shot boundaries and a recommended per-shot CRF clamped to [--crf-min, --crf-max]. Downstream encoders are free to adjust — every per-shot signal is exposed in the sidecar.
Quick start¶
# Generate a CSV plan targeting VMAF 90, CRF 18-35.
vmaf-perShot \
--reference ref.yuv \
--width 1920 --height 1080 \
--pixel_format 420 --bitdepth 8 \
--output plan.csv \
--target-vmaf 90 \
--crf-min 18 --crf-max 35
# JSON variant (for pipelines that prefer it).
vmaf-perShot \
--reference ref.yuv \
--width 1920 --height 1080 \
--pixel_format 420 --bitdepth 8 \
--output plan.json \
--format json
Required flags¶
| Flag | Argument | Notes |
|---|---|---|
-r / --reference | PATH | Planar YUV input. |
-w / --width | N | Frame width in pixels. |
-h / --height | N | Frame height in pixels. |
-p / --pixel_format | 420 \| 422 \| 444 | Planar YUV chroma subsampling. |
-b / --bitdepth | 8 \| 10 \| 12 \| 16 | Planar YUV bit depth. |
-o / --output | PATH | Plan destination (CSV or JSON). |
Optional flags¶
| Flag | Default | Notes |
|---|---|---|
-t / --target-vmaf | 90 | Target VMAF; reduces CRF as target rises. |
-m / --crf-min | 18 | Lower CRF clamp. |
-M / --crf-max | 35 | Upper CRF clamp. |
-d / --diff-threshold | 12.0 | Shot-detector cutoff (8-bit mean-abs-delta units). |
-f / --format | csv | csv or json. |
--help | - | Print usage and exit 0. |
Output format — CSV¶
shot_id,start_frame,end_frame,frames,mean_complexity,mean_motion,predicted_crf
0,0,3,4,0.000051,0.020046,25.48
1,4,47,44,0.019353,0.016716,24.62
Columns:
shot_id— zero-based ordinal.start_frame/end_frame— inclusive frame range.frames—end_frame - start_frame + 1.mean_complexity— mean luma sample variance over the shot, normalised to[0, 1]per pixel (then squared into variance units).mean_motion— mean absolute frame-to-frame luma delta over the shot, normalised to[0, 1]per pixel.predicted_crf— the v1 linear-blend prediction, clamped into[--crf-min, --crf-max].
Output format — JSON¶
{
"target_vmaf": 90.00,
"crf_min": 18,
"crf_max": 35,
"shots": [
{"shot_id": 0, "start_frame": 0, "end_frame": 3, "frames": 4,
"mean_complexity": 0.000051, "mean_motion": 0.020046,
"predicted_crf": 25.48}
]
}
How the CRF prediction works (v1)¶
The v1 predictor is a transparent linear blend, not a trained model. This keeps the binary debuggable today; v2 (separate ADR) will wire a small MLP once a labelled per-shot CRF corpus exists. See ADR-0222 §Decision for the exact formula.
The intuitions baked in:
- Higher target VMAF → lower CRF (better quality).
- Higher complexity (busier shot) → lower CRF (artefacts more visible).
- Higher motion → higher CRF (motion masks artefacts; saves bits).
- Very short shots (< 24 frames) → smaller motion bonus (rate-control startup amortisation).
Shot detector (v1 fallback)¶
The built-in detector is a frame-difference heuristic: a frame's mean absolute luma delta vs. its predecessor is compared against --diff-threshold (8-bit domain, default 12.0). Detected cuts only fire after the running shot has reached 4 frames (suppresses flash / fade flicker).
This is intentionally simple. Once the TransNet V2 extractor (T6-3a / ADR-0223) lands, a future revision will accept a pre-computed shot map (--shots PATH) and bypass the detector entirely.
Worked example — feeding x265 --zones¶
# 1. Build a plan.
vmaf-perShot \
--reference ref.yuv -w 1920 -h 1080 -p 420 -b 8 \
--output plan.csv
# 2. Convert plan.csv to x265 --zones syntax (one zone per row).
awk -F, 'NR>1 { printf("%s,%s,crf=%.0f/", $2, $3, $7) }' plan.csv \
> zones.txt
# 3. Encode using the zone string.
ffmpeg -i ref.y4m -c:v libx265 \
-x265-params "$(< zones.txt)" out.mp4
# 4. Verify the post-encode VMAF.
vmaf --reference ref.y4m --distorted out.mp4 --output verify.json
Input Formats¶
vmaf-perShot reads planar YUV directly with no demuxer. The shot detector and CRF prior use only the luma plane, but the scanner still counts and skips chroma bytes according to --pixel_format so frame iteration stays aligned for 4:2:0, 4:2:2, and 4:4:4 inputs. High-bit- depth inputs use little-endian 16-bit sample containers for 10/12/16-bit content, matching the fork's raw-YUV CLI convention.
Limitations (v1)¶
- Frame-difference detector misses dissolves / cross-fades — those segments will collapse into a single longer shot. T6-3a / TransNet V2 fixes this once integrated.
- The linear-blend CRF is a static prior, not a trained fit. v2 ships a small MLP under the same CSV / JSON schema (no consumer break expected).
- Shot table capped at 4096 entries (covers ≈3-hour content at one cut every 2 s); overflow surfaces as
ENOSPC.
Related¶
- ADR-0222 — design + alternatives.
- ADR-0223 — TransNet V2 extractor (T6-3a, in-flight).
cli.md— the mainvmafscoring CLI.docs/ai/roadmap.md§2.4 — the broader per-shot CRF roadmap.