Skip to content

MOS-corpus ingestion family

The VMAFx fork trains no-reference and mixed-reference VQA models against human Mean Opinion Score labels. Several public video-quality corpora are supported. Each corpus ships its own adapter script that produces a corpus JSONL shard; the shards are then unified via ai/scripts/aggregate_corpora.py before the trainer consumes them.

This page is the index for the entire family. Follow the per-corpus links for acquisition steps, operator flags, and schema details.

Available corpora

Corpus Clips MOS scale Size (approx.) Adapter script Per-corpus doc
KonViD-1k 1 200 1–5 ACR Likert ~2.3 GB ai/scripts/konvid_1k_to_corpus_jsonl.py konvid-1k-ingestion.md
KonViD-150k ~150 000 1–5 ACR Likert ~120–200 GB ai/scripts/konvid_150k_to_corpus_jsonl.py konvid-150k-ingestion.md
LSVQ ~39 000 1–5 ACR Likert ~500 GB (whole) ai/scripts/lsvq_to_corpus_jsonl.py lsvq-ingestion.md
YouTube UGC ~1 500 1–5 ACR Likert ~2 TB (whole) ai/scripts/youtube_ugc_to_corpus_jsonl.py youtube-ugc-ingestion.md
Waterloo IVC 4K-VQA 1 200 0–100 continuous multi-TB (whole) ai/scripts/waterloo_ivc_to_corpus_jsonl.py waterloo-ivc-4k-ingestion.md
LIVE-VQC 585 0–100 continuous ~few GB ai/scripts/live_vqc_to_corpus_jsonl.py live-vqc-ingestion.md
CHUG UGC-HDR 5 992 0–100 continuous, mapped to 1–5 at ingest tens of GB ai/scripts/chug_to_corpus_jsonl.py chug-ingestion.md
BVI-DVC (no-MOS FR shard) ~120+ n/a — no human MOS ~84 GiB archive ai/scripts/bvi_dvc_to_corpus_jsonl.py bvi-dvc-corpus-ingestion.md

BVI-DVC is a reference-only corpus without human MOS labels. It feeds the fr_regressor_v2 encode-grid trainer via ai/scripts/merge_corpora.py rather than the MOS-aggregation path; it is listed here for completeness because merge_corpora.py and aggregate_corpora.py are sibling utilities (see multi-corpus-aggregation.md §3).

Citations

Corpus Citation
KonViD-1k Hosu, Hahn, Jenadeleh, Lin, Men, Szirányi, Li, Saupe. The Konstanz natural video database (KoNViD-1k). QoMEX 2017. http://database.mmsp-kn.de
KonViD-150k Götz-Hahn, Hosu, Lin, Saupe. KonVid-150k: A Dataset for No-Reference Video Quality Assessment of Videos in-the-Wild. IEEE Access 2021. https://database.mmsp-kn.de/konvid-150k-vqa-database.html
LSVQ Ying, Mandal, Ghadiyaram, Bovik. Patch-Based No-Reference Image and Video Quality Assessment. ICCV 2021. https://github.com/baidut/PatchVQ
YouTube UGC Wang, Inguva, Adsumilli. YouTube UGC Dataset for Video Compression Research. MMSP 2019. https://research.google/pubs/youtube-ugc-dataset-for-video-compression-research/
Waterloo IVC 4K-VQA Li, Duanmu, Liu, Wang. 4K-VQA: A 4K Video Quality Assessment Database. ICIAR 2019. https://ivc.uwaterloo.ca/database/4KVQA.html
LIVE-VQC Sinno, Bovik. Large-Scale Study of Perceptual Video Quality. IEEE TIP 2019. https://live.ece.utexas.edu/research/LIVEVQC/
CHUG Saini, Bovik, Birkbeck, Wang, Adsumilli. CHUG: Crowdsourced User-Generated HDR Video Quality Dataset. ICIP 2025. https://doi.org/10.1109/ICIP55913.2025.11084488
BVI-DVC Ma, Zhang, Bull. BVI-DVC: A Training Database for Deep Video Compression. IEEE TMM 2021.

Output schema — corpus JSONL

Every adapter produces one JSON object per line. The fields below are present in every shard output by the corpora above:

{
  "src":               "clip_or_basename.mp4",   // filename within the dataset
  "src_sha256":        "<64-hex>",               // SHA-256 of clip bytes, 1 MiB chunks
  "src_size_bytes":    1234567,
  "width":             1920,
  "height":            1080,
  "framerate":         30.0,
  "duration_s":        8.0,
  "pix_fmt":           "yuv420p",
  "encoder_upstream":  "h264",                   // codec reported by ffprobe
  "mos":               3.42,                     // native-scale MOS (NOT normalised here)
  "mos_std_dev":       0.51,                     // inter-rater std; 0.0 if unpublished
  "n_ratings":         50,                       // number of crowdworker ratings
  "corpus":            "konvid-1k",              // stable corpus label
  "corpus_version":    "konvid-1k-2017",         // dataset release identifier
  "ingested_at_utc":   "2026-05-08T12:00:00+00:00"
}

Key invariants:

  • src_sha256 is the deduplication key across corpora. It is computed by the adapter, never taken from the dataset's own metadata.
  • mos is recorded verbatim from the dataset's native scale at ingest time — normalisation to a unified axis is the aggregator's job.
  • mos_std_dev of 0.0 signals that the dataset did not publish inter-rater spread (distinct from a real zero spread).
  • The schema is disjoint from the vmaf-tune Phase A CORPUS_ROW_KEYS row (no vmaf_score, encoder, preset, crf). Mixing them requires the appropriate merge utility — see §Combining corpora below.

Combining corpora

MOS-labelled corpora → unified training JSONL

Use ai/scripts/aggregate_corpora.py (PR #518, ADR-0340). It normalises each shard to a common 0–100 axis, deduplicates by src_sha256, and emits a unified JSONL the v2 trainer can consume directly:

python ai/scripts/aggregate_corpora.py \
    --inputs .workingdir2/konvid-150k/konvid_150k.jsonl \
             .workingdir2/lsvq/lsvq.jsonl \
             .workingdir2/waterloo-ivc-4k/waterloo_ivc_4k.jsonl \
             .workingdir2/youtube-ugc/youtube_ugc.jsonl \
    --output .workingdir2/aggregated/unified_corpus.jsonl

The command writes .workingdir2/aggregated/unified_corpus.manifest.json by default. That manifest records source-shard hashes, MOS scale conversions, dedup counters, corpus-source overrides, and ADR-0661 run_provenance. Use --manifest-out PATH when the JSONL and manifest need to live in a separate experiment bundle.

Each source adapter writes the same kind of replay sidecar before aggregation. By default chug_to_corpus_jsonl.py, konvid_1k_to_corpus_jsonl.py, konvid_150k_to_corpus_jsonl.py, youtube_ugc_to_corpus_jsonl.py, lsvq_to_corpus_jsonl.py, live_vqc_to_corpus_jsonl.py, and waterloo_ivc_to_corpus_jsonl.py write <output>.manifest.json; pass --manifest-out PATH when storing a dated experiment bundle. The source manifests record the corpus root, manifest CSV, progress path, row caps, written/skipped/dedup counters, and ADR-0661 run_provenance.

For a one-command discover-then-train workflow:

bash ai/scripts/run_aggregated_training.sh

See multi-corpus-aggregation.md for the full scale-conversion table, dedup policy, and failure-mode reference.

Feature tables + MOS labels → MOS-head parquet

Real MOS-head training tables must already carry mos or mos_raw_0_100. When a refreshed feature parquet only contains extracted metrics, join the labels first with ai/scripts/materialize_mos_labels.py:

.venv/bin/python ai/scripts/materialize_mos_labels.py \
    --features runs/full_features_konvid_refresh_20260520_with_folds.parquet \
    --labels .corpus/konvid-150k/konvid_150k.jsonl \
    --feature-key-column key \
    --label-key-column src \
    --feature-key-regex '([0-9]{6,})' \
    --label-key-regex '([0-9]{6,})' \
    --out runs/full_features_konvid_refresh_20260520_with_mos.parquet \
    --audit-json runs/full_features_konvid_refresh_20260520_with_mos.audit.json

The materializer fails by default below 95% unique-key coverage and refuses to overwrite existing MOS columns unless --overwrite is passed. See mos-label-materializer.md for the key-matching and audit schema.

Encode-grid corpora (Netflix + BVI-DVC) → FR-regressor JSONL

Use ai/scripts/merge_corpora.py (PR #310). This utility operates on the vmaf-tune Phase A CORPUS_ROW_KEYS schema and deduplicates by (src_sha256, encoder, preset, crf):

python ai/scripts/merge_corpora.py \
    --inputs runs/netflix_corpus.jsonl runs/bvi_dvc_corpus.jsonl \
    --output runs/fr_v2_train_corpus.jsonl

This command writes runs/fr_v2_train_corpus.manifest.json by default with the input shard hashes, required vmaf-tune corpus keys, the natural dedup key (src_sha256, encoder, preset, crf), summary counters, and ADR-0661 run_provenance. Pass --manifest-out PATH to place the sidecar elsewhere. The older BVI-DVC JSONL adapter follows the same convention: direct bvi_dvc_to_corpus_jsonl.py runs write <output>.manifest.json with cache inputs, row counts, adapter labels, row schema version, and run_provenance. It emits current vmaf-tune v3 additive columns with explicit unavailable defaults for HDR, shot, canonical-feature aggregates, and encoder-internal signals when those fields are not present in the cached libvmaf JSON.

Shared ingestion infrastructure (ADR-0371)

All MOS-corpus adapter scripts share a common base class defined in ai/src/corpus/base.py (PYTHONPATH=ai/src). The base class provides:

  • sha256_file(path) — SHA-256 computed in 1 MiB chunks (dedup key).
  • probe_geometry(clip_path, ...) — ffprobe wrapper; returns a dict with width, height, framerate, duration_s, pix_fmt, encoder_upstream, or None on probe failure. Injected via the runner kwarg for unit tests.
  • load_progress / save_progress / mark_done / mark_failed / should_attempt — atomic tempfile-rename progress state (JSON) so multi-hour runs are safe to Ctrl-C and resume.
  • read_sha_index(jsonl_path) — builds a set[str] of already-ingested src_sha256 values from a partially-written JSONL so re-runs skip duplicates.
  • download_clip(...) — curl-based download with configurable timeout, returning (ok, reason). Also injectable via runner.
  • RunStats — dataclass accumulating written, skipped_download, skipped_broken, dedups with a computed attrition_pct.
  • CorpusIngestBase — abstract base class. Subclass, set corpus_label, implement iter_source_rows(clips_dir) -> Iterator[(clip_path, row_dict)], and call ingest.run().

Adding a new MOS corpus:

# ai/scripts/my_corpus_to_corpus_jsonl.py
from corpus.base import CorpusIngestBase, utc_now_iso

class MyCorpusIngest(CorpusIngestBase):
    corpus_label = "my-corpus"

    def iter_source_rows(self, clips_dir):
        for row in parse_my_csv(...):
            yield clips_dir / row["filename"], row

See ADR-0371 and the unit tests at ai/tests/test_corpus_base.py for the full contract.

Per-corpus quick-start commands

KonViD-1k (1 200 clips, ~2.3 GB, ~5 min)

# 1. Fetch + extract (idempotent — skips completed files).
python ai/scripts/fetch_konvid_1k.py
#    → ~/datasets/konvid-1k/KoNViD_1k_videos/  (default location)
#    → ~/datasets/konvid-1k/fetch_manifest.json

# 2. Convert to JSONL.
python ai/scripts/konvid_1k_to_corpus_jsonl.py
#    → .workingdir2/konvid-1k/konvid_1k.jsonl

# Smoke (5 clips only):
python ai/scripts/konvid_1k_to_corpus_jsonl.py --max-rows 5

KonViD-150k (~150 000 clips, ~120–200 GB)

# Drop manifest.csv first:  https://database.mmsp-kn.de/konvid-150k-vqa-database.html

python ai/scripts/konvid_150k_to_corpus_jsonl.py
#    → .workingdir2/konvid-150k/konvid_150k.jsonl
#    Resumable — safe to Ctrl-C and re-run.

# Smoke (50 clips):
python ai/scripts/konvid_150k_to_corpus_jsonl.py --max-rows 50

LSVQ (~39 000 clips, ~500 GB whole)

# Drop LSVQ_whole_train CSV at .workingdir2/lsvq/manifest.csv, then:
python ai/scripts/lsvq_to_corpus_jsonl.py          # laptop subset (500 clips)
python ai/scripts/lsvq_to_corpus_jsonl.py --full   # whole corpus
#    → .workingdir2/lsvq/lsvq.jsonl

YouTube UGC (~1 500 clips, ~2 TB whole)

# Drop original_videos.csv at .workingdir2/youtube-ugc/manifest.csv, then:
python ai/scripts/youtube_ugc_to_corpus_jsonl.py          # laptop subset (300 clips)
python ai/scripts/youtube_ugc_to_corpus_jsonl.py --full   # whole corpus
#    → .workingdir2/youtube-ugc/youtube_ugc.jsonl

Waterloo IVC 4K-VQA (1 200 clips, staged locally)

# Extract bulk archives to .workingdir2/waterloo-ivc-4k/clips/
# Drop scores.txt at .workingdir2/waterloo-ivc-4k/manifest.csv, then:
python ai/scripts/waterloo_ivc_to_corpus_jsonl.py          # default subset
python ai/scripts/waterloo_ivc_to_corpus_jsonl.py --full   # whole corpus
#    → .workingdir2/waterloo-ivc-4k/waterloo_ivc_4k.jsonl

Note: Waterloo IVC 4K-VQA uses a 0–100 continuous scale, not 1–5 ACR Likert. The aggregator converts it via identity (no rescaling needed). See multi-corpus-aggregation.md §2 for the conversion table.

CHUG UGC-HDR (5 992 clips, S3-hosted)

mkdir -p .workingdir2/chug
curl -L https://raw.githubusercontent.com/shreshthsaini/CHUG/master/chug.csv \
  -o .workingdir2/chug/manifest.csv

PYTHONPATH=ai/src python ai/scripts/chug_to_corpus_jsonl.py          # 500-row subset
PYTHONPATH=ai/src python ai/scripts/chug_to_corpus_jsonl.py --full   # whole corpus
#    → .workingdir2/chug/chug.jsonl

CHUG is UGC-HDR and reports MOS on a 0–100 continuous scale. The adapter preserves that source value as mos_raw_0_100 and maps trainer-facing mos onto [1, 5] via 1 + 4 * mos_raw_0_100 / 100 so the existing MOS-head trainer can consume the rows directly. The adapter also preserves CHUG bitrate-ladder, orientation, manifest geometry, and content-name metadata under chug_* optional fields.

LIVE-VQC (585 clips, ~few GB)

# Drop manifest CSV at .workingdir2/live-vqc/manifest.csv
# and clips at .workingdir2/live-vqc/clips/, then:
python ai/scripts/live_vqc_to_corpus_jsonl.py          # laptop subset (200 clips)
python ai/scripts/live_vqc_to_corpus_jsonl.py --full   # whole corpus (585 clips)
#    → .workingdir2/live-vqc/live_vqc.jsonl

Note: LIVE-VQC uses a 0–100 continuous scale (same as Waterloo IVC 4K-VQA). Obtain the dataset from https://live.ece.utexas.edu/research/LIVEVQC/. Two manifest shapes accepted: headerless <filename>,<mos> (minimal MOS export) or standard named-column CSV. See live-vqc-ingestion.md for acquisition and operator flag details.

KonViD MOS head v1

After building a unified JSONL from KonViD-1k and KonViD-150k, the fork can train a lightweight MLP that maps libvmaf canonical-6 features plus saliency and TransNet shot-metadata to a scalar MOS prediction in [1, 5]. This is the konvid_mos_head_v1 model (PR #491, ADR-0336):

# Smoke (synthetic corpus — no real data needed, ~30 s):
python ai/scripts/train_konvid_mos_head.py --smoke

# Production (real KonViD JSONL drops on disk):
python ai/scripts/train_konvid_mos_head.py \
    --konvid-1k   .workingdir2/konvid-1k/konvid_1k.jsonl \
    --konvid-150k .workingdir2/konvid-150k/konvid_150k.jsonl
#    → model/konvid_mos_head_v1.onnx
#    → model/konvid_mos_head_v1.json  (manifest sidecar)

# Production from refreshed feature parquet, after MOS materialisation:
python ai/scripts/train_konvid_mos_head.py \
    --konvid-1k /tmp/no-konvid-1k.jsonl \
    --konvid-150k /tmp/no-konvid-150k.jsonl \
    --feature-parquet runs/full_features_konvid_refresh_20260520_with_mos.parquet

Real mode fails when the input paths yield zero MOS-labelled rows. Use --smoke for synthetic pipeline checks; use the MOS label materializer for real feature tables that do not yet carry mos.

CHUG HDR MOS training is a separate local experiment because CHUG is HDR subjective-MOS data and the current Netflix VMAF teacher is SDR/8-bit. The CHUG wrapper defaults to the chug-hdr-wide-v1 feature schema, which uses the CHUG temporal quantiles/std columns and HDR ladder metadata in addition to the canonical-6 feature means:

python ai/scripts/train_chug_hdr_mos_head.py
#    → .workingdir2/chug/chug_hdr_mos_head_v1.onnx
#    → .workingdir2/chug/chug_hdr_mos_head_v1.json

To train against a target HDR display instead of treating MOS as display-invariant, provide a display profile:

python ai/scripts/train_chug_hdr_mos_head.py \
  --display-profile-json .workingdir2/chug/display-profile.json

With that flag and no explicit --feature-schema, the wrapper uses chug-hdr-display-v1, a 45-column schema that appends normalized panel and viewing-context features to chug-hdr-wide-v1. The manifest records the normalized display profile and the source JSON sha256.

For an apples-to-apples ablation against the older 11-feature baseline, add --feature-schema konvid-v1. Do not use that ablation as the default CHUG command; it discards CHUG-specific signal that was already materialised by the extractor.

See models/konvid_mos_head_v1.md for the full model card (architecture, I/O contract, production-flip gate, and predictor integration).

License and redistribution posture

The fork ships adapter scripts and schemas only. No corpus clips, no per-clip MOS values, and no derived feature caches are committed. Only the trained ONNX weights (derived from the corpora) are redistributable, with attribution following the source licence:

Corpus Licence
KonViD-1k Research-use, citation required
KonViD-150k Research-use, citation required
LSVQ CC-BY-4.0
YouTube UGC Creative Commons Attribution
Waterloo IVC 4K-VQA Permissive academic, attribution required
LIVE-VQC Research-use, attribution required
CHUG UGC-HDR CC BY-NC / CC BY-NC-SA mismatch; treat as non-commercial/share-alike until clarified
BVI-DVC Research-use (non-redistributable)

For the full licence analysis per corpus see the respective ADR: ADR-0325 (KonViD), ADR-0333 (LSVQ), ADR-0334 / ADR-0368 (YouTube UGC), ADR-0369 (Waterloo IVC), ADR-0370 (LIVE-VQC), ADR-0426 (CHUG), ADR-0310 (BVI-DVC).