Skip to content

BVI-DVC corpus ingestion → fr_regressor_v2

The BVI-DVC dataset (Ma, Zhang, Bull 2021) is a 4-tier 4:2:0 10-bit YCbCr reference corpus distributed by the Bristol Visual Information Lab. This page documents how to bring BVI-DVC into the fr_regressor_v2 training corpus alongside the existing Netflix Public drop and what to expect from doing so.

See ADR-0310 for the decision record and Research-0082 for the license / overlap / fold-expansion analysis.

1. Dataset overview

BVI-DVC ships reference-only material — no human DMOS scores — across four resolution tiers, encoded in the filename prefix:

Tier Resolution Notes
A 3840 × 2176 UHD, large clip files (~370 MiB each)
B 1920 × 1088 HD, the most common production resolution
C 960 × 544 Quarter-HD
D 480 × 272 Mobile / preview tier, fastest to iterate on

Content categories include high-motion sports, urban / architectural walks, natural scenes, and a handful of texture-heavy clips. Compared to the Netflix Public drop (9 sources, mostly cinematic film_drama plus two sports / two wildlife), BVI-DVC widens content diversity — particularly on high-motion and texture-heavy material that the Netflix drop under-represents.

Citation:

Di Ma, Fan Zhang, David R. Bull. BVI-DVC: A Training Database for Deep Video Compression. IEEE Transactions on Multimedia, 2021.

2. Where to download

BVI-DVC is distributed by the University of Bristol via Zenodo / the lab's research-data portal. The fork does not redistribute the corpus (license is research-only — see Research-0082 §2). Obtain it from the upstream source and place it locally:

.workingdir2/BVI-DVC Part 1.zip   # original archive (~84 GiB)
.workingdir2/bvi-dvc-extracted/   # gitignored extraction target

Both paths are gitignored. The repo never commits BVI-DVC YUV / MP4 data, the extracted parquet, or any cached vmaf JSON — only derived training weights and scripts ship in-tree. This is the same posture the fork already takes for the Netflix Public drop (ADR-0203).

2.1 If you already have the YUVs extracted

If you have already extracted the archive (e.g. a directory of raw .yuv files or the local lossless .mkv bundle), pass --bvi-dir instead of --bvi-zip to skip the streaming-extraction step entirely:

python ai/scripts/bvi_dvc_to_full_features.py \
    --bvi-dir /path/to/bvi-dvc-extracted \
    --tier D \
    --vmaf-bin core/build-cpu/tools/vmaf \
    --out runs/full_features_bvi_dvc_D.parquet

--bvi-dir and --bvi-zip are mutually exclusive. Omitting both falls back to --bvi-zip using the default path or the $VMAF_BVI_DVC_ZIP environment variable. See ADR-0527 for the design rationale.

Accepted file types in --bvi-dir mode:

Suffix Behaviour
.yuv Used directly as the reference; width, height, fps, and bit-depth are parsed from the filename. No intermediate decode step.
.mp4 Decoded to raw YUV via ffmpeg (same as the zip path).
.mkv Decoded to raw YUV via ffmpeg (same path as .mp4; used by the local lossless bundle).

Files that do not match the BVI-DVC naming convention (<Stem>_<W>x<H>_<fps>fps_<depth>bit_420.<ext>) are skipped with a warning. Files whose resolution does not map to one of the four canonical tiers (A/B/C/D) are also skipped with a warning.

3. Pipeline

The end-to-end ingestion is two stages:

   .workingdir2/BVI-DVC Part 1.zip    OR    /path/to/bvi-dvc-extracted/
              │                                          │
              │  --bvi-zip (default)                    │  --bvi-dir (ADR-0527)
              └──────────────────────┬──────────────────┘
                                     │  (1) feature parquet
              ai/scripts/bvi_dvc_to_full_features.py
                                     │      → runs/full_features_bvi_dvc_<tier>.parquet
                                     │  (2) corpus JSONL (fr_regressor_v2 schema)
              ai/scripts/bvi_dvc_to_corpus_jsonl.py
                                     │      → runs/bvi_dvc_corpus.jsonl
                                     │  (3) merge with Netflix shard
              ai/scripts/merge_corpora.py
                                           → runs/fr_v2_train_corpus.jsonl

Stage (1) is the per-frame feature parquet consumed by the vmaf_tiny_v* and fr_regressor_v1 trainers. It runs libvmaf with the current FULL_FEATURES pool (25 feature columns as of the SpEED chroma/temporal refresh) and writes one parquet row per frame. This stage already existed in tree; ADR-0310 added the downstream corpus-JSONL reshape. The stage also writes runs/full_features_bvi_dvc_<tier>.manifest.json by default, or a caller-selected path via --manifest-out. That sidecar records whether the run used --bvi-zip or --bvi-dir, the tier, cache/model/vmaf inputs, CRF/codec recipe, selected clip count, emitted row/column counts, feature order, extractor list, and ADR-0661 run_provenance. Keep it beside the local parquet so later training refreshes can prove which BVI-DVC material was scored.

Stage (2) is new in ADR-0310. It re-shapes the BVI-DVC encodes into the vmaf-tune Phase A corpus row schema (CORPUS_ROW_KEYS) that fr_regressor_v2 consumes. One JSONL row per (source, preset, CRF) tuple, mirroring what vmaf-tune corpus would emit if it had a BVI-DVC adapter.

Stage (3) is the merge utility added by ADR-0310. It de-duplicates by (src_sha256, encoder, preset, crf) so re-runs and overlap with other corpora cannot inflate the training set.

4. Run command

Once the parquet exists (stage 1) and the JSONL adapter has produced runs/bvi_dvc_corpus.jsonl (stage 2), merge with the Netflix shard:

python ai/scripts/merge_corpora.py \
    --inputs runs/netflix_corpus.jsonl runs/bvi_dvc_corpus.jsonl \
    --output runs/fr_v2_train_corpus.jsonl

The summary line lands on stderr:

[merge_corpora] rows_in=N rows_out=M duplicates=K unique_sources=S \
    -> runs/fr_v2_train_corpus.jsonl

The trainer then consumes the merged JSONL exactly as it consumes a single-source corpus today:

python ai/scripts/train_fr_regressor_v2.py \
    --corpus runs/fr_v2_train_corpus.jsonl \
    --epochs 200 --seed 0

5. Expected impact on fr_regressor_v2

The Netflix-only LOSO baseline (see ADR-0303) leaves 9 folds × ~24 rows / fold (216 rows total). Adding BVI-DVC's tier-D clips (~120 sources) roughly triples the training corpus and expands the LOSO partition from 9 source-folds to 9 + N folds, where N is the number of BVI-DVC sources retained after dedup.

LOSO methodology is unchanged: each fold holds out one source, trains on the remainder, and reports per-fold PLCC / SROCC / RMSE against the vmaf_v0.6.1 per-frame teacher. Aggregate quality is the mean ± std across folds. This is the same gate ADR-0303 uses for the production-flip decision.

Ship-gate posture: a corpus expansion that does not raise mean LOSO PLCC by at least one σ above the Netflix-only baseline is not shipped to production weights. The inclusion criterion is empirical; ADR-0310's decision is to make the corpus available for training, not to commit ahead of measurement to a production weights flip.

6. Reproducibility

Every step is deterministic given the same input archive. The feature-extraction stage caches per-clip libvmaf JSON under $VMAF_TINY_AI_CACHE_BVI_DVC_FULL (default ~/.cache/vmaf-tiny-ai-bvi-dvc-full/); re-runs with the same archive hit the cache. The corpus-JSONL stage and the merge are pure transforms with no hidden state.

CI cannot retrain end-to-end (the corpus is non-redistributable); the merge_corpora smoke test under ai/tests/test_merge_corpora.py covers the schema contract on synthetic fixture rows and runs in under one second.