Skip to content

KoNViD-150k-A (K150K-A) — Feature Extraction

KoNViD-150k-A is the largest publicly available no-reference video quality corpus: 152,265 H.264 clips each annotated with a crowd-sourced mean-opinion-score (MOS). This document covers how to extract the 22 FULL_FEATURES (Research-0026) from it for tiny-AI model training.

Quick start

# Smoke-test: 100 clips (~3 min at 8 workers on a 32-thread CPU)
python ai/scripts/extract_k150k_features.py --limit 100 --threads-cuda 8

# Full run (background, ~60–80 h at 8 workers)
nohup python ai/scripts/extract_k150k_features.py \
    --threads-cuda 8 --threads 4 \
    > runs/k150k_extract.log 2>&1 &
echo "PID $!"

Both commands resume automatically from the .done checkpoint if interrupted.

Throughput

Default K150K throughput is achieved via parallel CPU workers (--threads-cuda). Benchmarking showed that 540p 5-second clips are CPU-bound: CUDA context initialisation overhead makes the GPU binary 3.7× slower than CPU for this clip geometry. At 8 parallel workers the CPU path achieves 0.5–0.7 clip/s vs the 0.14 clip/s serial baseline — a 4–5× improvement. See ADR-0383 and Research-0096 for the full investigation.

For larger local corpora such as CHUG, passing a CUDA-capable --vmaf-bin enables the split extractor path described below: stable CUDA twins run on the GPU, while residual float_ssim / cambi extraction stays on the CPU binary. Pass --metadata-jsonl .workingdir2/chug/chug.jsonl for CHUG runs so the output parquet keeps content, bitrate-ladder, raw-MOS, and deterministic train/validation/test split metadata alongside the feature columns. ai/scripts/train_konvid_mos_head.py --feature-parquet <path> consumes these FULL_FEATURES parquet files directly and maps <feat>_mean columns onto the trainer's canonical feature vector.

Configuration clip/s Full-corpus ETA
Serial (old baseline) 0.14 ~296 h
8 workers, 4 threads ~0.6 ~70 h
12 workers, 2 threads ~0.6 ~70 h

Build

The system /usr/local/bin/vmaf v3.0.0 is not compatible — it lacks ssimulacra2 and motion_v2. Build the fork binary:

meson setup core/build-cpu core \
    -Denable_cuda=false -Denable_sycl=false --buildtype=release
ninja -C core/build-cpu

Optional CUDA extraction uses a CUDA-capable fork binary for the stable CUDA feature set and a CPU fork binary for the residual feature pass:

meson setup core/build-cuda core \
    -Denable_cuda=true -Denable_sycl=false --buildtype=release
ninja -C core/build-cuda

python ai/scripts/extract_k150k_features.py \
    --vmaf-bin core/build-cuda/tools/vmaf \
    --cpu-vmaf-bin core/build-cpu/tools/vmaf

The split is intentional. The CUDA binary exposes GPU twins for most FULL_FEATURES, but the all-feature --backend cuda auto-selection path can double-register feature keys on mixed extractor bundles. The extractor uses explicit CUDA feature names for the GPU-safe pass and keeps float_ssim / cambi on the CPU binary so the emitted 22-column parquet schema stays unchanged.

Options

Flag Default Description
--clips-dir .workingdir2/konvid-150k/k150ka_extracted Directory of K150K-A .mp4 clips
--scores .workingdir2/konvid-150k/k150ka_scores.csv CSV with video_name, video_score MOS labels
--vmaf-bin core/build-cpu/tools/vmaf Fork vmaf binary (must support ssimulacra2 + motion_v2)
--cpu-vmaf-bin core/build-cpu/tools/vmaf CPU fork binary used for residual CPU-only feature passes when --vmaf-bin is CUDA-capable
--out runs/full_features_k150k.parquet Output parquet path (gitignored)
--threads 2 vmaf --threads value per worker (inner threading)
--threads-cuda 8 Number of parallel worker processes (outer parallelism)
--flush-every 200 Flush parquet every N clips
--limit (none) Process at most N clips (smoke-test mode)
--no-cuda off Pass --no_cuda --no_sycl to vmaf binary (Vulkan removed in ADR-0726)
--scratch-dir /tmp/k150k_yuv_scratch Temporary YUV decode directory
--metadata-jsonl (none) Optional CHUG/K150K sidecar JSONL whose identity, ladder, raw-MOS, and split columns are copied into the output parquet

Output schema

The output parquet has one row per clip and 48 columns:

Column(s) Description
clip_name Filename of the source .mp4
mos Mean opinion score (1.0 – 5.0)
width, height Decoded frame dimensions
<feat>_mean (22 cols) Per-clip nanmean of the feature across frames
<feat>_std (22 cols) Per-clip nanstd of the feature across frames

CHUG runs that pass --metadata-jsonl append sidecar metadata columns such as content_id, bitrate_ladder_id, mos_raw_0_100, split, chug_split_key, and chug_split_policy. The MOS-head trainer preserves split for held-out validation and uses <feat>_mean values for the canonical feature columns it knows about.

Existing FULL_FEATURES parquets can be enriched after extraction when the original run omitted --metadata-jsonl:

python ai/scripts/enrich_k150k_parquet_metadata.py \
    --features-parquet .workingdir2/chug/training/full_features_chug.parquet \
    --metadata-jsonl .workingdir2/chug/chug.jsonl

The utility rewrites the output atomically, fills only missing metadata cells by default, and prints a JSON summary with matched and missing row counts.

Feature order follows FEATURE_NAMES in ai/scripts/extract_k150k_features.py (column-order-locked — see ai/AGENTS.md §K150K-A corpus extraction).

Parallelism architecture

Each worker in the ProcessPoolExecutor independently:

  1. Decodes one clip to a worker-private YUV file at --scratch-dir/<stem>_w<id>_<pid>.yuv.
  2. Scores it via vmaf with --threads <N>.
  3. Aggregates per-frame metrics.
  4. Deletes the YUV file immediately (keeps peak scratch usage to ~8 × 120 MB = ~1 GB).
  5. Returns the row dict to the main process.

The main process collects results from as_completed(), writes the .done checkpoint, and flushes the parquet every --flush-every clips. Worker failures are isolated — one bad clip emits a FAIL log line and does not abort the run.

FR-from-NR adapter

K150K-A carries no reference video. To run full-reference libvmaf extractors the script uses the FR-from-NR adapter (ADR-0346): the decoded YUV is passed as both --reference and --distorted. This has two consequences:

  • Informative features (cambi, motion, motion2, motion3, ssimulacra2) reflect actual content properties and are useful for training.
  • Constant features (ADM, VIF, SSIM, VMAF) floor at their identity value; null features (ciede2000, psnr_hvs) are all-NaN. Downstream training must drop or impute these columns.

Hardware requirements

Component Requirement
CPU 32+ threads recommended (one per vmaf thread across all workers)
Disk (scratch) ~1 GB for peak YUV scratch across 8 workers
Disk (output) ~500 MB for the full parquet
GPU Optional — not required for K150K 540p clips; useful for larger local FR-from-NR corpora through the split CUDA path

Restartability

The script writes a .done checkpoint file (same path as --out with .done extension) listing completed clip names, one per line. On restart it skips already-processed clips.

Two durability layers cover the on-run write path (Research-0135, ADR-0862):

  1. A <out>.rows.jsonl staging file is appended one line per completed clip so a crash mid-run does not lose features that already finished their VMAF pass. The staging file is the in-run write-ahead log.
  2. The parquet at <out> is written exactly once at the end via a .tmp rename, then fsync'd (file + parent directory) before the staging file is unlinked.

On a clean restart, .done is consulted; clips already listed are skipped. If the staging file survives a prior crash, its rows are merged into the parquet during the no-op or end-of-run write.

Consistency check (ADR-0862, since 2026-05-30). Before the no-op early-exit returns, the script compares len(.done) against _parquet_row_count(<out>) plus any rows recovered from the staging file. On mismatch (typically: a prior run crashed after the parquet rename but before the staging unlink, on a filesystem that does not order metadata vs data writes), the script raises RuntimeError naming the missing-clip count and refuses to silently confirm the loss. Recovery: remove the affected entries from .done (or delete it to re-extract everything) and re-run.

End-of-run accounting. The post-write code path asserts len(rows) == len(recovered_rows) + ok before writing the parquet; on mismatch it raises RuntimeError and preserves the staging file for forensic recovery instead of letting a broken parquet land.

Further reading