ADR-0367: LSVQ corpus ingestion for nr_metric_v1¶
- Status: Accepted
- Date: 2026-05-08
- Deciders: @Lusoris
- Tags: ai, training, corpus, license, fork-local
Context¶
nr_metric_v1 (~19 K params) is the fork's tiny no-reference VQA model. To compete with DOVER-Mobile (~9.86 M params, PLCC 0.853 on KoNViD / 0.867 on LSVQ_test) on published leaderboards we need a training corpus broad enough to span community-uploaded video content. ADR-0325 Phase 2 ingests KonViD-150k (~150 k clips) on its own; the Tiny-AI SOTA digest flags LSVQ (Ying et al. ICCV 2021, ~39 K videos / ~5.5 M ratings) as the canonical large NR-VQA training corpus the field trains on. KonViD-150k alone has too narrow a content distribution to land on the LSVQ_test / LSVQ_test_1080p leaderboards every modern NR-VQA paper benchmarks against.
LSVQ is hosted on Hugging Face under teowu/LSVQ-videos, mirroring the original Patch-VQ author drop. The corpus is large (~500 GB end-to-end raw); the licence is CC-BY-4.0 per the dataset card, so derived ONNX models can ship with attribution but the raw clips and per-clip MOS values must stay local-only on this fork (same posture as ADR-0310 BVI-DVC and ADR-0325 KonViD-150k).
What is missing today is a JSONL adapter that bridges the LSVQ split CSV(s) to the same MOS-corpus row schema the KonViD-150k adapter emits, so the trainer can consume both shards through one loader without per-corpus branching.
Decision¶
We will adopt LSVQ as a third training shard for nr_metric_v1 under three constraints:
- The LSVQ archive, extracted MP4s, and any cached features stay local-only (
.workingdir2/lsvq/). The fork redistributes only derivednr_metric_v1_*.onnxweights, with CC-BY-4.0 attribution travelling alongside. - The MOS-corpus row schema (introduced for KonViD-150k Phase 2) is the merge contract. A new adapter (
ai/scripts/lsvq_to_corpus_jsonl.py) emits one JSONL row per surviving clip withcorpus = "lsvq". The schema is byte-identical to the KonViD-150k adapter's modulo thecorpusandcorpus_versionliterals. - Laptop-class development is the default path. The script ingests the first
--max-rows=500clips by default; the ~500 GB whole-corpus run is opt-in via--full. The resumable-download contract from ADR-0325 Phase 2 carries over verbatim (.download-progress.json, atomic tempfile-rename writes, non-retriable failure persistence).
The ENCODER_VOCAB v4 collapse to "ugc-mixed" is not done here — this script records encoder_upstream from ffprobe verbatim, identical to KonViD-150k. The trainer-side collapse lands in a separate PR.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| KonViD-150k only | Smallest surface; one ingestion adapter; one license review. | Cannot benchmark against the LSVQ_test / LSVQ_test_1080p leaderboards every modern NR-VQA paper publishes; content distribution narrower than the field's de-facto training corpus. | The fork has the LSVQ Hugging Face mirror available and the KonViD-150k adapter shape ports trivially; not using it leaves measured signal on the table. |
| KonViD-150k + LSVQ | Adds the canonical NR-VQA corpus; the leaderboards we want to publish against use LSVQ_test; CC-BY-4.0 is permissively redistributable for derived weights. | Working set ~500 GB; CSV column-name drift across mirrors; partial-corpus runs need explicit operator opt-in. | Chosen. The marginal infra (one adapter mirroring KonViD-150k Phase 2 + tests + a --max-rows / --full CLI knob) is small. |
| KonViD-150k + LSVQ + LIVE-VQC + YouTube-UGC | Largest possible training corpus; broadest content; matches the exact union DOVER trains on. | LIVE-VQC redistribution licence is research-only-non-commercial; YouTube-UGC clip URLs degrade over time; LOSO partition explodes; ADR-0287 already showed marginal ensemble gains past two-corpus regime. | Premature without a clean KonViD-150k + LSVQ measurement. The two-public-shard regime is the next step; broader corpora work returns when (and if) the two-shard run leaves PLCC headroom. |
Consequences¶
- Positive:
nr_metric_v1becomes trainable on the same large-scale UGC corpus DOVER / FAST-VQA / Q-Align use, so cross-paper comparison on LSVQ_test / LSVQ_test_1080p becomes possible. The KonViD-150k adapter pattern becomes the canonical shape for any future MOS-corpus ingestion (one more adapter per dataset, no schema drift). - Negative: Operators who want the whole corpus need ~500 GB free under
.workingdir2/. The--max-rowsdefault avoids surprise disk-fill but means a default run is not a full ingestion — operators must read the CLI help. - Neutral / follow-ups: The ENCODER_VOCAB v4 trainer-side collapse to
"ugc-mixed"is still pending; landing it is decoupled from this PR. A future PR may also add the LSVQ test-split CSV (LSVQ_test,LSVQ_test_1080p) explicitly to the held-out evaluation harness once the trainer can consume the new shard.
References¶
- Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A. C., "Patch-VQ: 'Patching Up' the Video Quality Problem," ICCV 2021.
- Hugging Face dataset card:
teowu/LSVQ-videos(CC-BY-4.0, verified 2026-05-08). - Patch-VQ author drop: https://github.com/baidut/PatchVQ.
- Tiny-AI SOTA deep-dive digest:
docs/research/0086-tiny-ai-sota-deep-dive-2026-05-08.md. - Prior corpus ingestion ADRs: ADR-0310 (BVI-DVC), ADR-0325 Phase 2 (KonViD-150k, in flight as PR #447).
- Source:
req— implementation task spec routed through the agent harness 2026-05-08, citing the SOTA digest #449 and the KonViD-150k Phase 2 PR #447 as the pattern source.