ADR-0362 — K150K-A corpus integration: FR-from-NR extraction of FULL_FEATURES¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-09 |
| Tags | ai, training-data, corpus, k150k, full-features, fork-local |
Context¶
KoNViD-150k-A (K150K-A) is the largest publicly available no-reference (NR) video quality corpus: 152,265 clips each carrying a per-clip mean-opinion-score (MOS) aggregated from crowd-sourced ratings. Integrating it into the tiny-AI training pipeline requires mapping from the NR setting (no reference video) to the full-reference (FR) VMAF extractor interface.
The existing training corpora (Netflix Public, BVI-DVC, KoNViD-1k, YouTube-UGC subset) cover at most ~15,000 clips total. Adding K150K-A increases training scale by an order of magnitude and covers a wider distribution of user-generated content quality levels.
The FULL_FEATURES set (Research-0026) — 22 features including ADM sub-bands, VIF sub-bands, motion, PSNR, SSIM/MS-SSIM, CAMBI, ciede2000, psnr_hvs, ssimulacra2, and the VMAF teacher — is the target feature space for the Phase 3 tiny-AI models.
Decision¶
Use the FR-from-NR adapter (ADR-0346): decode each K150K-A clip once to raw YUV and feed the same buffer as both --reference and --distorted in the libvmaf CLI. Run all 11 FULL_FEATURES extractors plus the vmaf_v0.6.1 model for the VMAF teacher score. Aggregate per-frame values to per-clip mean + std.
Output: runs/full_features_k150k.parquet (gitignored). One row per clip, 48 columns: clip_name, mos, width, height, plus <feat>_mean and <feat>_std for each of the 22 FEATURE_NAMES.
Hardware: RTX 4090 via build-cpu/tools/vmaf --backend cuda (fork build).
Alternatives considered¶
| Alternative | Why rejected |
|---|---|
Full NrToFrAdapter Python pipeline | 5–10× compute overhead from the re-encoding step; not needed when the MOS is the training target and FR features at identity suffice for content fingerprinting. |
| Canonical-6 features only (adm2, vif_scale*, motion, vmaf) | Wastes the CUDA call — adding the remaining 16 features costs negligible extra per-frame time once the YUV decode is done. |
| KoNViD-1k only | Only ~1,200 clips; K150K-A is the same domain at 100× scale. |
| Skip corpus entirely | Leaves tiny-AI training data-constrained in the UGC domain; K150K-A is the highest-leverage single dataset addition available. |
Consequences¶
Positive:
- Training corpus grows from ~15,000 clips to ~167,000 clips.
- K150K-A's MOS distribution spans a wider quality range than the Netflix reference corpus, improving model calibration at low-quality content.
- Fully restartable extraction (
.donecheckpoint + atomic parquet flush).
Negative:
ciede2000andpsnr_hvsare all-NaN for every K150K-A clip. The libvmaf ciede2000 and psnr_hvs implementations returnnullwhen ref == distorted (identity pair) — this is correct behaviour, not a bug. Downstream loaders must handle NaN columns gracefully (e.g. drop or impute before training).- ADM, VIF, SSIM, MS-SSIM, and VMAF all floor at their identity values (1.0 / trivial) and carry zero discriminative signal for model training. Only CAMBI, motion, motion2, motion3, and ssimulacra2 remain informative.
- Full run ETA: ~296 h single-process sequential at ~7 s/clip on an RTX 4090. Parallelisation via
--limitbatches +xargs -Por a task queue is a follow-up.
References¶
req: "Write a K150K full-feature extraction script + run it on the local CUDA card..." (paraphrased: user requested the extraction pipeline, ADR, research digest, and all six ADR-0108 deliverables in this PR).- ADR-0346 — FR-from-NR adapter pattern.
- Research-0026 — FULL_FEATURES 22-feature set.
- Research-0067 — companion digest.
- ADR-0108 — six deep-dive deliverables rule.