KoNViD-150k-A (K150K-A) — Feature Extraction¶

KoNViD-150k-A is the largest publicly available no-reference video quality corpus: 152,265 H.264 clips each annotated with a crowd-sourced mean-opinion-score (MOS). This document covers how to extract the 22 FULL_FEATURES (Research-0026) from it for tiny-AI model training.

Quick start¶

# Smoke-test: 100 clips (~3 min at 8 workers on a 32-thread CPU)
python ai/scripts/extract_k150k_features.py --limit 100 --threads-cuda 8

# Full run (background, ~60–80 h at 8 workers)
nohup python ai/scripts/extract_k150k_features.py \
    --threads-cuda 8 --threads 4 \
    > runs/k150k_extract.log 2>&1 &
echo "PID $!"

Both commands resume automatically from the .done checkpoint if interrupted.

Throughput¶

Default K150K throughput is achieved via parallel CPU workers (--threads-cuda). Benchmarking showed that 540p 5-second clips are CPU-bound: CUDA context initialisation overhead makes the GPU binary 3.7× slower than CPU for this clip geometry. At 8 parallel workers the CPU path achieves 0.5–0.7 clip/s vs the 0.14 clip/s serial baseline — a 4–5× improvement. See ADR-0383 and Research-0096 for the full investigation.

For larger local corpora such as CHUG, passing a CUDA-capable --vmaf-bin enables the split extractor path described below: stable CUDA twins run on the GPU, while residual float_ssim / cambi extraction stays on the CPU binary. Pass --metadata-jsonl .workingdir2/chug/chug.jsonl for CHUG runs so the output parquet keeps content, bitrate-ladder, raw-MOS, and deterministic train/validation/test split metadata alongside the feature columns. ai/scripts/train_konvid_mos_head.py --feature-parquet <path> consumes these FULL_FEATURES parquet files directly and maps <feat>_mean columns onto the trainer's canonical feature vector.

Configuration	clip/s	Full-corpus ETA
Serial (old baseline)	0.14	~296 h
8 workers, 4 threads	~0.6	~70 h
12 workers, 2 threads	~0.6	~70 h

Build¶

The system /usr/local/bin/vmaf v3.0.0 is not compatible — it lacks ssimulacra2 and motion_v2. Build the fork binary:

meson setup core/build-cpu core \
    -Denable_cuda=false -Denable_sycl=false --buildtype=release
ninja -C core/build-cpu

Optional CUDA extraction uses a CUDA-capable fork binary for the stable CUDA feature set and a CPU fork binary for the residual feature pass:

meson setup core/build-cuda core \
    -Denable_cuda=true -Denable_sycl=false --buildtype=release
ninja -C core/build-cuda

python ai/scripts/extract_k150k_features.py \
    --vmaf-bin core/build-cuda/tools/vmaf \
    --cpu-vmaf-bin core/build-cpu/tools/vmaf

The split is intentional. The CUDA binary exposes GPU twins for most FULL_FEATURES, but the all-feature --backend cuda auto-selection path can double-register feature keys on mixed extractor bundles. The extractor uses explicit CUDA feature names for the GPU-safe pass and keeps float_ssim / cambi on the CPU binary so the emitted 22-column parquet schema stays unchanged.

Options¶

Flag	Default	Description
`--clips-dir`	`.workingdir2/konvid-150k/k150ka_extracted`	Directory of K150K-A `.mp4` clips
`--scores`	`.workingdir2/konvid-150k/k150ka_scores.csv`	CSV with `video_name`, `video_score` MOS labels
`--vmaf-bin`	`core/build-cpu/tools/vmaf`	Fork vmaf binary (must support ssimulacra2 + motion_v2)
`--cpu-vmaf-bin`	`core/build-cpu/tools/vmaf`	CPU fork binary used for residual CPU-only feature passes when `--vmaf-bin` is CUDA-capable
`--out`	`runs/full_features_k150k.parquet`	Output parquet path (gitignored)
`--threads`	`2`	vmaf `--threads` value per worker (inner threading)
`--threads-cuda`	`8`	Number of parallel worker processes (outer parallelism)
`--flush-every`	`200`	Flush parquet every N clips
`--limit`	(none)	Process at most N clips (smoke-test mode)
`--no-cuda`	off	Pass `--no_cuda --no_sycl` to vmaf binary (Vulkan removed in ADR-0726)
`--scratch-dir`	`/tmp/k150k_yuv_scratch`	Temporary YUV decode directory
`--metadata-jsonl`	(none)	Optional CHUG/K150K sidecar JSONL whose identity, ladder, raw-MOS, and split columns are copied into the output parquet

Output schema¶

The output parquet has one row per clip and 48 columns:

Column(s)	Description
`clip_name`	Filename of the source `.mp4`
`mos`	Mean opinion score (1.0 – 5.0)
`width`, `height`	Decoded frame dimensions
`<feat>_mean` (22 cols)	Per-clip nanmean of the feature across frames
`<feat>_std` (22 cols)	Per-clip nanstd of the feature across frames

CHUG runs that pass --metadata-jsonl append sidecar metadata columns such as content_id, bitrate_ladder_id, mos_raw_0_100, split, chug_split_key, and chug_split_policy. The MOS-head trainer preserves split for held-out validation and uses <feat>_mean values for the canonical feature columns it knows about.

Existing FULL_FEATURES parquets can be enriched after extraction when the original run omitted --metadata-jsonl:

python ai/scripts/enrich_k150k_parquet_metadata.py \
    --features-parquet .workingdir2/chug/training/full_features_chug.parquet \
    --metadata-jsonl .workingdir2/chug/chug.jsonl

The utility rewrites the output atomically, fills only missing metadata cells by default, and prints a JSON summary with matched and missing row counts.

Feature order follows FEATURE_NAMES in ai/scripts/extract_k150k_features.py (column-order-locked — see ai/AGENTS.md §K150K-A corpus extraction).

Parallelism architecture¶

Each worker in the ProcessPoolExecutor independently:

Decodes one clip to a worker-private YUV file at --scratch-dir/<stem>_w<id>_<pid>.yuv.
Scores it via vmaf with --threads <N>.
Aggregates per-frame metrics.
Deletes the YUV file immediately (keeps peak scratch usage to ~8 × 120 MB = ~1 GB).
Returns the row dict to the main process.

The main process collects results from as_completed(), writes the .done checkpoint, and flushes the parquet every --flush-every clips. Worker failures are isolated — one bad clip emits a FAIL log line and does not abort the run.

FR-from-NR adapter¶

K150K-A carries no reference video. To run full-reference libvmaf extractors the script uses the FR-from-NR adapter (ADR-0346): the decoded YUV is passed as both --reference and --distorted. This has two consequences:

Informative features (cambi, motion, motion2, motion3, ssimulacra2) reflect actual content properties and are useful for training.
Constant features (ADM, VIF, SSIM, VMAF) floor at their identity value; null features (ciede2000, psnr_hvs) are all-NaN. Downstream training must drop or impute these columns.

Hardware requirements¶

Component	Requirement
CPU	32+ threads recommended (one per vmaf thread across all workers)
Disk (scratch)	~1 GB for peak YUV scratch across 8 workers
Disk (output)	~500 MB for the full parquet
GPU	Optional — not required for K150K 540p clips; useful for larger local FR-from-NR corpora through the split CUDA path

Restartability¶

The script writes a .done checkpoint file (same path as --out with .done extension) listing completed clip names, one per line. On restart it skips already-processed clips.

Two durability layers cover the on-run write path (Research-0135, ADR-0862):

A <out>.rows.jsonl staging file is appended one line per completed clip so a crash mid-run does not lose features that already finished their VMAF pass. The staging file is the in-run write-ahead log.
The parquet at <out> is written exactly once at the end via a .tmp rename, then fsync'd (file + parent directory) before the staging file is unlinked.

On a clean restart, .done is consulted; clips already listed are skipped. If the staging file survives a prior crash, its rows are merged into the parquet during the no-op or end-of-run write.

Consistency check (ADR-0862, since 2026-05-30). Before the no-op early-exit returns, the script compares len(.done) against _parquet_row_count(<out>) plus any rows recovered from the staging file. On mismatch (typically: a prior run crashed after the parquet rename but before the staging unlink, on a filesystem that does not order metadata vs data writes), the script raises RuntimeError naming the missing-clip count and refuses to silently confirm the loss. Recovery: remove the affected entries from .done (or delete it to re-extract everything) and re-run.

End-of-run accounting. The post-write code path asserts len(rows) == len(recovered_rows) + ok before writing the parquet; on mismatch it raises RuntimeError and preserves the staging file for forensic recovery instead of letting a broken parquet land.