ADR-0413: YouTube UGC corpus ingestion for nr_metric_v1¶
- Status: Accepted
- Date: 2026-05-08
- Deciders: @Lusoris
- Tags: ai, training, corpus, license, fork-local
Context¶
nr_metric_v1 (~19 K params) is the fork's tiny no-reference VQA model. ADR-0325 Phase 2 ingests KonViD-150k and ADR-0333 ingests LSVQ; the contributor-pack research digest #465 flags YouTube UGC (Wang, Inguva, Adsumilli; MMSP 2019) as the under-represented content-distribution axis of the training mix. KonViD draws content from Flickr, LSVQ from Internet Archive social-video; YouTube UGC adds the genuine YouTube content distribution (gaming, vlogs, lyric-videos, HDR clips, animation, …) that the production scoring path cares about most.
The corpus is hosted by Google in the public-readable Google Cloud Storage bucket gs://ugc-dataset/. The bucket has the allUsers:objectViewer IAM role applied — there is no request form, no sign-up, no API key. License is Creative Commons Attribution per the bucket-root ATTRIBUTION file. Working set is ~2 TB end-to-end (~1500 originals at up to 4 K + the transcoded ladder), so the ingestion adapter follows the LSVQ posture: laptop-class default subset, whole-corpus opt-in via --full.
What is missing today is a JSONL adapter that bridges the YouTube UGC manifest CSV to the same MOS-corpus row schema the LSVQ / KonViD-150k adapters emit, so the trainer can consume all three shards through one loader without per-corpus branching.
Decision¶
We will adopt YouTube UGC as a fourth training shard for nr_metric_v1 under three constraints:
- The YouTube UGC archive, extracted clips, and any cached features stay local-only (
.workingdir2/youtube-ugc/). The fork redistributes only derivednr_metric_v1_*.onnxweights, with CC-BY attribution travelling alongside. - The MOS-corpus row schema (introduced for KonViD-150k Phase 2 and reused by ADR-0333 LSVQ) is the merge contract. A new adapter (
ai/scripts/youtube_ugc_to_corpus_jsonl.py) emits one JSONL row per surviving clip withcorpus = "youtube-ugc". The schema is byte-identical to the LSVQ / KonViD-150k adapters' modulo thecorpusandcorpus_versionliterals. - Laptop-class development is the default path. The script ingests the first
--max-rows=300clips by default; the ~2 TB whole-corpus run is opt-in via--full. The resumable-download contract from ADR-0325 Phase 2 / ADR-0333 carries over verbatim (.download-progress.json, atomic tempfile-rename writes, non-retriable failure persistence).
The ENCODER_VOCAB v4 collapse to "ugc-mixed" is not done here — this script records encoder_upstream from ffprobe verbatim, identical to LSVQ / KonViD-150k. The trainer-side collapse lands in a separate PR (and applies uniformly across the three UGC shards once it does).
Per-clip scoring methodology: YouTube UGC's 2019 release (Wang et al. MMSP 2019) provides per-original-clip MOS values on the same 1.0-5.0 Likert scale as LSVQ / KonViD. The 2020 transcoded follow-up (Wang et al. CVPR 2021) adds per-bitrate ratings on transcoded outputs at four rate points (orig / cbr / vod / vodlb); operators wanting those ratings pre-aggregate them into a one-row-per-orig CSV with corpus_version = "ugc-2020-transcoded-mean". The adapter records whatever the manifest's MOS column contains, without rescaling (matching LSVQ / KonViD-150k).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| KonViD-150k + LSVQ only (skip UGC) | Smallest surface; two ingestion adapters; two licence reviews. | Misses the genuine YouTube content distribution; under-represents gaming / animation / HDR; the contributor-pack digest #465 specifically flags UGC as the under-weighted axis. | Leaves measurable signal on the table; UGC is the third leg of the public-corpus tripod every modern NR-VQA paper trains on. |
| Add YouTube UGC alongside LSVQ + KonViD-150k | Adds the canonical UGC distribution; CC-BY is permissively redistributable for derived weights; adapter shape is a verbatim port of ADR-0333. | Working set ~2 TB; per-clip MOS is split across a 2019 originals release and a 2020 transcoded follow-up so operators must understand which release they have; partial-corpus runs need explicit operator opt-in. | Chosen. The marginal infra (one adapter mirroring ADR-0333 + 18 tests + a --max-rows / --full CLI knob + a --bucket-prefix knob for synthesised URLs) is small. |
Scrape YouTube directly via yt-dlp | No 2 TB GCS pull; can target newer / longer-tail content. | Violates YouTube ToS; per-clip MOS values do not exist outside Google's curated release; reproducibility nil; licence posture indeterminate per clip. | Hard rejected. The Google YouTube UGC release is the only legal-and-MOS-bearing source for this content. |
| KonViD-150k + LSVQ + LIVE-VQC + YouTube UGC | Largest possible training corpus; broadest content. | LIVE-VQC redistribution licence is research-only-non-commercial; LOSO partition explodes; ADR-0287 already showed marginal ensemble gains past three-corpus regime. | Premature without a clean three-corpus measurement. The three-public-shard regime is the next step; LIVE-VQC remains gated behind a separate licence review if the three-shard run leaves PLCC headroom. |
Consequences¶
- Positive:
nr_metric_v1becomes trainable on the same large-scale UGC corpus the field uses for content-distribution breadth (DOVER, FAST-VQA, Q-Align, MaxVQA all train on the YouTube UGC bucket). The LSVQ adapter pattern becomes the canonical shape for any future MOS-corpus ingestion (one more adapter per dataset, no schema drift). - Negative: Operators who want the whole corpus need ~2 TB free under
.workingdir2/. The--max-rows=300default avoids surprise disk-fill but means a default run is not a full ingestion — operators must read the CLI help. The per-clip-MOS-source ambiguity (2019 originals vs 2020 transcoded) is documented but propagates as the operator's responsibility to pin--corpus-versioncorrectly. - Neutral / follow-ups: The ENCODER_VOCAB v4 trainer-side collapse to
"ugc-mixed"is still pending; landing it is decoupled from this PR. A future PR may also wire the YouTube UGC_testper-resolution slices explicitly to the held-out evaluation harness once the trainer can consume the new shard.
References¶
- Wang, Y., Inguva, S., Adsumilli, B., "YouTube UGC Dataset for Video Compression Research," IEEE Workshop on Multimedia Signal Processing (MMSP) 2019.
- Wang, Y. et al., "Rich features for perceptual quality assessment of UGC videos," CVPR 2021 (transcoded-quality follow-up release).
- Public-readable GCS bucket: https://storage.googleapis.com/ugc-dataset/ (license: Creative Commons Attribution, verified 2026-05-08).
- Bucket-root attribution file: https://storage.googleapis.com/ugc-dataset/ATTRIBUTION.
- Original-video listing CSV: https://storage.googleapis.com/ugc-dataset/original_videos.csv (verified 2026-05-08).
- Companion research digest: Research-0091.
- Prior corpus ingestion ADRs: ADR-0310 (BVI-DVC), ADR-0325 Phase 2 (KonViD-150k, in flight as PR #447), ADR-0333 (LSVQ, in flight as PR #471).
- Source:
req— implementation task spec routed through the agent harness 2026-05-08, citing the contributor-pack digest #465 and PR #471 (LSVQ) as the pattern source.