KoNViD-150k-A (K150K-A) — Feature Extraction¶
KoNViD-150k-A is the largest publicly available no-reference video quality corpus: 152,265 H.264 clips each annotated with a crowd-sourced mean-opinion-score (MOS). This document covers how to extract the 22 FULL_FEATURES (Research-0026) from it for tiny-AI model training.
Quick start¶
# Smoke-test: 100 clips (~3 min at 8 workers on a 32-thread CPU)
python ai/scripts/extract_k150k_features.py --limit 100 --threads-cuda 8
# Full run (background, ~60–80 h at 8 workers)
nohup python ai/scripts/extract_k150k_features.py \
--threads-cuda 8 --threads 4 \
> runs/k150k_extract.log 2>&1 &
echo "PID $!"
Both commands resume automatically from the .done checkpoint if interrupted.
Throughput¶
Default K150K throughput is achieved via parallel CPU workers (--threads-cuda). Benchmarking showed that 540p 5-second clips are CPU-bound: CUDA context initialisation overhead makes the GPU binary 3.7× slower than CPU for this clip geometry. At 8 parallel workers the CPU path achieves 0.5–0.7 clip/s vs the 0.14 clip/s serial baseline — a 4–5× improvement. See ADR-0383 and Research-0096 for the full investigation.
For larger local corpora such as CHUG, passing a CUDA-capable --vmaf-bin enables the split extractor path described below: stable CUDA twins run on the GPU, while residual float_ssim / cambi extraction stays on the CPU binary. Pass --metadata-jsonl .workingdir2/chug/chug.jsonl for CHUG runs so the output parquet keeps content, bitrate-ladder, raw-MOS, and deterministic train/validation/test split metadata alongside the feature columns. ai/scripts/train_konvid_mos_head.py --feature-parquet <path> consumes these FULL_FEATURES parquet files directly and maps <feat>_mean columns onto the trainer's canonical feature vector.
| Configuration | clip/s | Full-corpus ETA |
|---|---|---|
| Serial (old baseline) | 0.14 | ~296 h |
| 8 workers, 4 threads | ~0.6 | ~70 h |
| 12 workers, 2 threads | ~0.6 | ~70 h |
Build¶
The system /usr/local/bin/vmaf v3.0.0 is not compatible — it lacks ssimulacra2 and motion_v2. Build the fork binary:
meson setup core/build-cpu core \
-Denable_cuda=false -Denable_sycl=false --buildtype=release
ninja -C core/build-cpu
Optional CUDA extraction uses a CUDA-capable fork binary for the stable CUDA feature set and a CPU fork binary for the residual feature pass:
meson setup core/build-cuda core \
-Denable_cuda=true -Denable_sycl=false --buildtype=release
ninja -C core/build-cuda
python ai/scripts/extract_k150k_features.py \
--vmaf-bin core/build-cuda/tools/vmaf \
--cpu-vmaf-bin core/build-cpu/tools/vmaf
The split is intentional. The CUDA binary exposes GPU twins for most FULL_FEATURES, but the all-feature --backend cuda auto-selection path can double-register feature keys on mixed extractor bundles. The extractor uses explicit CUDA feature names for the GPU-safe pass and keeps float_ssim / cambi on the CPU binary so the emitted 22-column parquet schema stays unchanged.
Options¶
| Flag | Default | Description |
|---|---|---|
--clips-dir | .workingdir2/konvid-150k/k150ka_extracted | Directory of K150K-A .mp4 clips |
--scores | .workingdir2/konvid-150k/k150ka_scores.csv | CSV with video_name, video_score MOS labels |
--vmaf-bin | core/build-cpu/tools/vmaf | Fork vmaf binary (must support ssimulacra2 + motion_v2) |
--cpu-vmaf-bin | core/build-cpu/tools/vmaf | CPU fork binary used for residual CPU-only feature passes when --vmaf-bin is CUDA-capable |
--out | runs/full_features_k150k.parquet | Output parquet path (gitignored) |
--threads | 2 | vmaf --threads value per worker (inner threading) |
--threads-cuda | 8 | Number of parallel worker processes (outer parallelism) |
--flush-every | 200 | Flush parquet every N clips |
--limit | (none) | Process at most N clips (smoke-test mode) |
--no-cuda | off | Pass --no_cuda --no_sycl to vmaf binary (Vulkan removed in ADR-0726) |
--scratch-dir | /tmp/k150k_yuv_scratch | Temporary YUV decode directory |
--metadata-jsonl | (none) | Optional CHUG/K150K sidecar JSONL whose identity, ladder, raw-MOS, and split columns are copied into the output parquet |
Output schema¶
The output parquet has one row per clip and 48 columns:
| Column(s) | Description |
|---|---|
clip_name | Filename of the source .mp4 |
mos | Mean opinion score (1.0 – 5.0) |
width, height | Decoded frame dimensions |
<feat>_mean (22 cols) | Per-clip nanmean of the feature across frames |
<feat>_std (22 cols) | Per-clip nanstd of the feature across frames |
CHUG runs that pass --metadata-jsonl append sidecar metadata columns such as content_id, bitrate_ladder_id, mos_raw_0_100, split, chug_split_key, and chug_split_policy. The MOS-head trainer preserves split for held-out validation and uses <feat>_mean values for the canonical feature columns it knows about.
Existing FULL_FEATURES parquets can be enriched after extraction when the original run omitted --metadata-jsonl:
python ai/scripts/enrich_k150k_parquet_metadata.py \
--features-parquet .workingdir2/chug/training/full_features_chug.parquet \
--metadata-jsonl .workingdir2/chug/chug.jsonl
The utility rewrites the output atomically, fills only missing metadata cells by default, and prints a JSON summary with matched and missing row counts.
Feature order follows FEATURE_NAMES in ai/scripts/extract_k150k_features.py (column-order-locked — see ai/AGENTS.md §K150K-A corpus extraction).
Parallelism architecture¶
Each worker in the ProcessPoolExecutor independently:
- Decodes one clip to a worker-private YUV file at
--scratch-dir/<stem>_w<id>_<pid>.yuv. - Scores it via
vmafwith--threads <N>. - Aggregates per-frame metrics.
- Deletes the YUV file immediately (keeps peak scratch usage to ~8 × 120 MB = ~1 GB).
- Returns the row dict to the main process.
The main process collects results from as_completed(), writes the .done checkpoint, and flushes the parquet every --flush-every clips. Worker failures are isolated — one bad clip emits a FAIL log line and does not abort the run.
FR-from-NR adapter¶
K150K-A carries no reference video. To run full-reference libvmaf extractors the script uses the FR-from-NR adapter (ADR-0346): the decoded YUV is passed as both --reference and --distorted. This has two consequences:
- Informative features (cambi, motion, motion2, motion3, ssimulacra2) reflect actual content properties and are useful for training.
- Constant features (ADM, VIF, SSIM, VMAF) floor at their identity value; null features (ciede2000, psnr_hvs) are all-NaN. Downstream training must drop or impute these columns.
Hardware requirements¶
| Component | Requirement |
|---|---|
| CPU | 32+ threads recommended (one per vmaf thread across all workers) |
| Disk (scratch) | ~1 GB for peak YUV scratch across 8 workers |
| Disk (output) | ~500 MB for the full parquet |
| GPU | Optional — not required for K150K 540p clips; useful for larger local FR-from-NR corpora through the split CUDA path |
Restartability¶
The script writes a .done checkpoint file (same path as --out with .done extension) listing completed clip names, one per line. On restart it skips already-processed clips.
Two durability layers cover the on-run write path (Research-0135, ADR-0862):
- A
<out>.rows.jsonlstaging file is appended one line per completed clip so a crash mid-run does not lose features that already finished their VMAF pass. The staging file is the in-run write-ahead log. - The parquet at
<out>is written exactly once at the end via a.tmprename, thenfsync'd (file + parent directory) before the staging file is unlinked.
On a clean restart, .done is consulted; clips already listed are skipped. If the staging file survives a prior crash, its rows are merged into the parquet during the no-op or end-of-run write.
Consistency check (ADR-0862, since 2026-05-30). Before the no-op early-exit returns, the script compares len(.done) against _parquet_row_count(<out>) plus any rows recovered from the staging file. On mismatch (typically: a prior run crashed after the parquet rename but before the staging unlink, on a filesystem that does not order metadata vs data writes), the script raises RuntimeError naming the missing-clip count and refuses to silently confirm the loss. Recovery: remove the affected entries from .done (or delete it to re-extract everything) and re-run.
End-of-run accounting. The post-write code path asserts len(rows) == len(recovered_rows) + ok before writing the parquet; on mismatch it raises RuntimeError and preserves the staging file for forensic recovery instead of letting a broken parquet land.
Further reading¶
- ADR-0383 — parallelism redesign decision
- Research-0096 — CUDA investigation and timing data
- ADR-0362 — original corpus integration design
- ADR-0346 — FR-from-NR adapter pattern
- ADR-0862 —
.donevs parquet consistency check on restart - Research-0026 — FULL_FEATURES definition