Skip to content

ADR-0362 — K150K-A corpus integration: FR-from-NR extraction of FULL_FEATURES

Field Value
Status Accepted
Date 2026-05-09
Tags ai, training-data, corpus, k150k, full-features, fork-local

Context

KoNViD-150k-A (K150K-A) is the largest publicly available no-reference (NR) video quality corpus: 152,265 clips each carrying a per-clip mean-opinion-score (MOS) aggregated from crowd-sourced ratings. Integrating it into the tiny-AI training pipeline requires mapping from the NR setting (no reference video) to the full-reference (FR) VMAF extractor interface.

The existing training corpora (Netflix Public, BVI-DVC, KoNViD-1k, YouTube-UGC subset) cover at most ~15,000 clips total. Adding K150K-A increases training scale by an order of magnitude and covers a wider distribution of user-generated content quality levels.

The FULL_FEATURES set (Research-0026) — 22 features including ADM sub-bands, VIF sub-bands, motion, PSNR, SSIM/MS-SSIM, CAMBI, ciede2000, psnr_hvs, ssimulacra2, and the VMAF teacher — is the target feature space for the Phase 3 tiny-AI models.

Decision

Use the FR-from-NR adapter (ADR-0346): decode each K150K-A clip once to raw YUV and feed the same buffer as both --reference and --distorted in the libvmaf CLI. Run all 11 FULL_FEATURES extractors plus the vmaf_v0.6.1 model for the VMAF teacher score. Aggregate per-frame values to per-clip mean + std.

Output: runs/full_features_k150k.parquet (gitignored). One row per clip, 48 columns: clip_name, mos, width, height, plus <feat>_mean and <feat>_std for each of the 22 FEATURE_NAMES.

Hardware: RTX 4090 via build-cpu/tools/vmaf --backend cuda (fork build).

Alternatives considered

Alternative Why rejected
Full NrToFrAdapter Python pipeline 5–10× compute overhead from the re-encoding step; not needed when the MOS is the training target and FR features at identity suffice for content fingerprinting.
Canonical-6 features only (adm2, vif_scale*, motion, vmaf) Wastes the CUDA call — adding the remaining 16 features costs negligible extra per-frame time once the YUV decode is done.
KoNViD-1k only Only ~1,200 clips; K150K-A is the same domain at 100× scale.
Skip corpus entirely Leaves tiny-AI training data-constrained in the UGC domain; K150K-A is the highest-leverage single dataset addition available.

Consequences

Positive:

  • Training corpus grows from ~15,000 clips to ~167,000 clips.
  • K150K-A's MOS distribution spans a wider quality range than the Netflix reference corpus, improving model calibration at low-quality content.
  • Fully restartable extraction (.done checkpoint + atomic parquet flush).

Negative:

  • ciede2000 and psnr_hvs are all-NaN for every K150K-A clip. The libvmaf ciede2000 and psnr_hvs implementations return null when ref == distorted (identity pair) — this is correct behaviour, not a bug. Downstream loaders must handle NaN columns gracefully (e.g. drop or impute before training).
  • ADM, VIF, SSIM, MS-SSIM, and VMAF all floor at their identity values (1.0 / trivial) and carry zero discriminative signal for model training. Only CAMBI, motion, motion2, motion3, and ssimulacra2 remain informative.
  • Full run ETA: ~296 h single-process sequential at ~7 s/clip on an RTX 4090. Parallelisation via --limit batches + xargs -P or a task queue is a follow-up.

References

  • req: "Write a K150K full-feature extraction script + run it on the local CUDA card..." (paraphrased: user requested the extraction pipeline, ADR, research digest, and all six ADR-0108 deliverables in this PR).
  • ADR-0346 — FR-from-NR adapter pattern.
  • Research-0026 — FULL_FEATURES 22-feature set.
  • Research-0067 — companion digest.
  • ADR-0108 — six deep-dive deliverables rule.