BVI-DVC corpus ingestion → fr_regressor_v2¶
The BVI-DVC dataset (Ma, Zhang, Bull 2021) is a 4-tier 4:2:0 10-bit YCbCr reference corpus distributed by the Bristol Visual Information Lab. This page documents how to bring BVI-DVC into the fr_regressor_v2 training corpus alongside the existing Netflix Public drop and what to expect from doing so.
See ADR-0310 for the decision record and Research-0082 for the license / overlap / fold-expansion analysis.
1. Dataset overview¶
BVI-DVC ships reference-only material — no human DMOS scores — across four resolution tiers, encoded in the filename prefix:
| Tier | Resolution | Notes |
|---|---|---|
| A | 3840 × 2176 | UHD, large clip files (~370 MiB each) |
| B | 1920 × 1088 | HD, the most common production resolution |
| C | 960 × 544 | Quarter-HD |
| D | 480 × 272 | Mobile / preview tier, fastest to iterate on |
Content categories include high-motion sports, urban / architectural walks, natural scenes, and a handful of texture-heavy clips. Compared to the Netflix Public drop (9 sources, mostly cinematic film_drama plus two sports / two wildlife), BVI-DVC widens content diversity — particularly on high-motion and texture-heavy material that the Netflix drop under-represents.
Citation:
Di Ma, Fan Zhang, David R. Bull. BVI-DVC: A Training Database for Deep Video Compression. IEEE Transactions on Multimedia, 2021.
2. Where to download¶
BVI-DVC is distributed by the University of Bristol via Zenodo / the lab's research-data portal. The fork does not redistribute the corpus (license is research-only — see Research-0082 §2). Obtain it from the upstream source and place it locally:
.workingdir2/BVI-DVC Part 1.zip # original archive (~84 GiB)
.workingdir2/bvi-dvc-extracted/ # gitignored extraction target
Both paths are gitignored. The repo never commits BVI-DVC YUV / MP4 data, the extracted parquet, or any cached vmaf JSON — only derived training weights and scripts ship in-tree. This is the same posture the fork already takes for the Netflix Public drop (ADR-0203).
2.1 If you already have the YUVs extracted¶
If you have already extracted the archive (e.g. a directory of raw .yuv files or the local lossless .mkv bundle), pass --bvi-dir instead of --bvi-zip to skip the streaming-extraction step entirely:
python ai/scripts/bvi_dvc_to_full_features.py \
--bvi-dir /path/to/bvi-dvc-extracted \
--tier D \
--vmaf-bin core/build-cpu/tools/vmaf \
--out runs/full_features_bvi_dvc_D.parquet
--bvi-dir and --bvi-zip are mutually exclusive. Omitting both falls back to --bvi-zip using the default path or the $VMAF_BVI_DVC_ZIP environment variable. See ADR-0527 for the design rationale.
Accepted file types in --bvi-dir mode:
| Suffix | Behaviour |
|---|---|
.yuv | Used directly as the reference; width, height, fps, and bit-depth are parsed from the filename. No intermediate decode step. |
.mp4 | Decoded to raw YUV via ffmpeg (same as the zip path). |
.mkv | Decoded to raw YUV via ffmpeg (same path as .mp4; used by the local lossless bundle). |
Files that do not match the BVI-DVC naming convention (<Stem>_<W>x<H>_<fps>fps_<depth>bit_420.<ext>) are skipped with a warning. Files whose resolution does not map to one of the four canonical tiers (A/B/C/D) are also skipped with a warning.
3. Pipeline¶
The end-to-end ingestion is two stages:
.workingdir2/BVI-DVC Part 1.zip OR /path/to/bvi-dvc-extracted/
│ │
│ --bvi-zip (default) │ --bvi-dir (ADR-0527)
└──────────────────────┬──────────────────┘
│ (1) feature parquet
▼
ai/scripts/bvi_dvc_to_full_features.py
│ → runs/full_features_bvi_dvc_<tier>.parquet
│
│ (2) corpus JSONL (fr_regressor_v2 schema)
▼
ai/scripts/bvi_dvc_to_corpus_jsonl.py
│ → runs/bvi_dvc_corpus.jsonl
│
│ (3) merge with Netflix shard
▼
ai/scripts/merge_corpora.py
→ runs/fr_v2_train_corpus.jsonl
Stage (1) is the per-frame feature parquet consumed by the vmaf_tiny_v* and fr_regressor_v1 trainers. It runs libvmaf with the current FULL_FEATURES pool (25 feature columns as of the SpEED chroma/temporal refresh) and writes one parquet row per frame. This stage already existed in tree; ADR-0310 added the downstream corpus-JSONL reshape. The stage also writes runs/full_features_bvi_dvc_<tier>.manifest.json by default, or a caller-selected path via --manifest-out. That sidecar records whether the run used --bvi-zip or --bvi-dir, the tier, cache/model/vmaf inputs, CRF/codec recipe, selected clip count, emitted row/column counts, feature order, extractor list, and ADR-0661 run_provenance. Keep it beside the local parquet so later training refreshes can prove which BVI-DVC material was scored.
Stage (2) is new in ADR-0310. It re-shapes the BVI-DVC encodes into the vmaf-tune Phase A corpus row schema (CORPUS_ROW_KEYS) that fr_regressor_v2 consumes. One JSONL row per (source, preset, CRF) tuple, mirroring what vmaf-tune corpus would emit if it had a BVI-DVC adapter.
Stage (3) is the merge utility added by ADR-0310. It de-duplicates by (src_sha256, encoder, preset, crf) so re-runs and overlap with other corpora cannot inflate the training set.
4. Run command¶
Once the parquet exists (stage 1) and the JSONL adapter has produced runs/bvi_dvc_corpus.jsonl (stage 2), merge with the Netflix shard:
python ai/scripts/merge_corpora.py \
--inputs runs/netflix_corpus.jsonl runs/bvi_dvc_corpus.jsonl \
--output runs/fr_v2_train_corpus.jsonl
The summary line lands on stderr:
[merge_corpora] rows_in=N rows_out=M duplicates=K unique_sources=S \
-> runs/fr_v2_train_corpus.jsonl
The trainer then consumes the merged JSONL exactly as it consumes a single-source corpus today:
python ai/scripts/train_fr_regressor_v2.py \
--corpus runs/fr_v2_train_corpus.jsonl \
--epochs 200 --seed 0
5. Expected impact on fr_regressor_v2¶
The Netflix-only LOSO baseline (see ADR-0303) leaves 9 folds × ~24 rows / fold (216 rows total). Adding BVI-DVC's tier-D clips (~120 sources) roughly triples the training corpus and expands the LOSO partition from 9 source-folds to 9 + N folds, where N is the number of BVI-DVC sources retained after dedup.
LOSO methodology is unchanged: each fold holds out one source, trains on the remainder, and reports per-fold PLCC / SROCC / RMSE against the vmaf_v0.6.1 per-frame teacher. Aggregate quality is the mean ± std across folds. This is the same gate ADR-0303 uses for the production-flip decision.
Ship-gate posture: a corpus expansion that does not raise mean LOSO PLCC by at least one σ above the Netflix-only baseline is not shipped to production weights. The inclusion criterion is empirical; ADR-0310's decision is to make the corpus available for training, not to commit ahead of measurement to a production weights flip.
6. Reproducibility¶
Every step is deterministic given the same input archive. The feature-extraction stage caches per-clip libvmaf JSON under $VMAF_TINY_AI_CACHE_BVI_DVC_FULL (default ~/.cache/vmaf-tiny-ai-bvi-dvc-full/); re-runs with the same archive hit the cache. The corpus-JSONL stage and the merge are pure transforms with no hidden state.
CI cannot retrain end-to-end (the corpus is non-redistributable); the merge_corpora smoke test under ai/tests/test_merge_corpora.py covers the schema contract on synthetic fixture rows and runs in under one second.