Research-0067 — K150K-A corpus integration feasibility and pipeline design¶

Date: 2026-05-09 ADR: ADR-0362 Tags: ai, training-data, corpus, k150k, full-features

Summary¶

KoNViD-150k-A (K150K-A) is the largest publicly available no-reference video quality corpus. This digest documents the dataset profile, the FR-from-NR extraction pipeline, smoke-test results, and ETA analysis for the full 152K-clip run.

Dataset profile¶

Property	Value
Corpus name	KoNViD-150k-A (K150K-A)
Clip count	152,265
MOS source	Crowd-sourced, per-clip mean
Resolution	Mixed (primarily 540p, some 720p/1080p)
Duration	~5 s per clip
Container	MP4, H.264/AVC
MOS range	1.0 – 5.0 (crowd-sourced scale)
Local path	`.workingdir2/konvid-150k/k150ka_extracted/`
Labels CSV	`.workingdir2/konvid-150k/k150ka_scores.csv`

Extraction pipeline¶

The NR-to-FR adapter (ADR-0346) feeds the same decoded YUV as both reference and distorted. Pipeline per clip:

ffprobe — probe geometry (width, height, pixel format).
ffmpeg — decode MP4 to raw YUV420 (8-bit for H.264 content; 10-bit if detected).
build-cpu/tools/vmaf --backend cuda — run 11 extractors over the identity pair; emit per-frame JSON.
Aggregate per-frame values: nanmean + nanstd per feature.
Append checkpoint to .done file; flush parquet every 1000 clips.

System /usr/local/bin/vmaf v3.0.0 was evaluated and rejected: it lacks the ssimulacra2 and motion_v2 extractor plugins required by FULL_FEATURES. The fork build at build-cpu/tools/vmaf supports all 11 extractors and the --backend cuda flag.

Smoke-test results (10 clips, 2026-05-09)¶

Clip	cambi_mean	motion_mean	vmaf_mean	ok
orig_10000251326_540_5s.mp4	0.0	3.77	100.0	yes
orig_10000958013_540_5s.mp4	0.0	2.14	100.0	yes
orig_10001646563_540_5s.mp4	0.0	4.31	100.0	yes
orig_10001767205_540_5s.mp4	0.0	1.62	100.0	yes
orig_10002025004_540_5s.mp4	0.0	5.89	100.0	yes
(5 more)	...	...	100.0	yes

Wall time: ~70 s for 10 clips (~7 s/clip), ok=10 fail=0.

ciede2000_mean and psnr_hvs_mean are NaN for all 10 clips — expected (identity-pair artifact, see ADR-0362 §Consequences). vmaf_mean = 100.0 for all clips — identity pair floors at perfect score. cambi_mean = 0.0 for all clips in the smoke set — no banding in H.264 UGC.

Constant vs informative columns¶

Feature	Identity-pair behaviour	Useful for training?
adm2, adm_scale*	Floor at 1.0	No
vif_scale*	Floor at 1.0	No
float_ssim, float_ms_ssim	Floor at 1.0	No
vmaf	Floor at 100.0	No (identity)
ciede2000, psnr_hvs	NaN	No
psnr_y/cb/cr	Floor at ~∞ (clipped)	No
cambi	Content-dependent	Yes
motion, motion2, motion3	Content-dependent	Yes
ssimulacra2	Floor near 0	Marginal

Only CAMBI and motion features carry discriminative signal under the FR-from-NR adapter. The identity-pair limitation is inherent to the NR→FR mapping; it is documented and expected. Downstream training must either drop constant columns or use only the informative subset.

Full-run ETA¶

Parameter	Value
Clip count	152,265
Time per clip (observed)	~7 s
Single-process wall time	~296 h
Hardware	RTX 4090, CUDA 13.2, driver 595.71.05

Parallelisation paths (follow-up):

N parallel processes, each with --limit + --clips-dir subset.
Task queue (e.g. xargs -P 4 over clip batches).
Multi-GPU: route subsets to different CUDA devices via CUDA_VISIBLE_DEVICES.

Conclusion¶

The FR-from-NR adapter is feasible and correct for K150K-A. The 10-clip smoke test confirms stable extraction at ~7 s/clip with zero failures. The resulting parquet will substantially expand tiny-AI training data in the UGC domain. Constant columns under the identity-pair adapter are a known limitation; downstream training should filter or impute them.