KonViD-1k corpus ingestion → MOS-corpus JSONL¶

The KonViD-1k dataset (Hosu et al., QoMEX 2017) is a 1,200-clip user-generated-video corpus with crowdsourced subjective Mean Opinion Scores. The VMAFx fork uses it as Phase 1 of the ADR-0325 KonViD-150k ingestion plan: a small, fast-to-iterate predecessor that validates the JSONL conversion shape before scaling to the full ~150 k corpus in Phase 2.

See ADR-0325 for the two-phase decision and Research-0086 for the feasibility analysis.

1. Dataset overview¶

KonViD-1k ships distorted-only content — every clip is a user-uploaded YouTube/Flickr encode, and every clip carries one subjective MOS aggregated from ≥ 50 crowdworker ratings on a 1–5 scale. There is no separate raw reference; each clip is its own opinion datum. Clips are mostly 540p, ~8 s long, h264 / vp9.

Citation:

Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, Dietmar Saupe. The Konstanz natural video database (KoNViD-1k). QoMEX 2017.

2. Where to download¶

KonViD-1k is distributed by the University of Konstanz / MMSP group as two zip files (videos + metadata). The fork does not redistribute the corpus, the per-clip MOS values, or any derived per-clip statistics (license is research-only — see ADR-0325 §License). Obtain it from the upstream source and place it locally:

.workingdir2/konvid-1k/
  ├── KoNViD_1k_videos/
  │     ├── 1.mp4
  │     ├── 2.mp4
  │     └── ... (1200 clips)
  └── KoNViD_1k_metadata/
        └── KoNViD_1k_attributes.csv

Both paths are gitignored (.workingdir2/ is the fork's standard research-data drop, see CLAUDE.md §5). The companion downloader ai/scripts/fetch_konvid_1k.py will fetch + extract everything to a similarly-shaped location under $VMAF_DATA_ROOT/konvid-1k/; the ingestion script accepts either layout via --konvid-dir. The fetcher writes <root>/fetch_manifest.json by default and accepts --manifest-out PATH. That sidecar records the archive URLs, observed archive sizes, extracted-directory status, --keep-zips, and ADR-0661 run_provenance before the JSONL adapter consumes the local root.

Dataset URL: https://database.mmsp-kn.de/konvid-1k-database.html

3. Pipeline¶

The Phase 1 ingestion is a single transform:

   .workingdir2/konvid-1k/
            │
            │  ai/scripts/konvid_1k_to_corpus_jsonl.py
            │      (ffprobe per clip + CSV MOS join)
            ▼
   .workingdir2/konvid-1k/konvid_1k.jsonl
            │
            │  (Phase 3 — out of scope here, ADR-0325 §Phase 3)
            ▼
   ensemble-training-kit MOS-head trainer

The script runs ffprobe once per clip, joins with the attribute CSV's MOS / SD / rating-count columns, and emits one JSONL row per clip. Heavy feature extraction (libvmaf, GPU-cached parquet) does not run here — Phase 1 is a pure metadata-and-MOS join. That keeps the working set small (~50 MB JSONL on top of the 5 GB clip corpus) and re-runs idempotent.

4. Run command¶

After dropping the extracted KonViD-1k under .workingdir2/konvid-1k/:

python ai/scripts/konvid_1k_to_corpus_jsonl.py

Default output path is .workingdir2/konvid-1k/konvid_1k.jsonl. Override with --output. Override the input layout with --konvid-dir. Override the ffprobe binary with --ffprobe-bin (also picked up from $FFPROBE_BIN).

The adapter writes <output>.manifest.json by default and accepts --manifest-out PATH when a run bundle needs a separate sidecar location. The manifest records the KonViD root, effective corpus version, row counters, and ADR-0661 run_provenance.

The summary line lands on stderr on completion:

[konvid-1k-jsonl] wrote N rows, skipped M (broken), K dedups -> <path>

Re-running against an existing output is idempotent: clips already present (keyed by src_sha256) are dedup'd; new clips are appended. Existing rows are never rewritten.

5. Output schema¶

One JSON object per line:

{
  "src": "1234567.mp4",
  "src_sha256": "ab12...",
  "src_size_bytes": 4321987,
  "width": 960,
  "height": 540,
  "framerate": 30.0,
  "duration_s": 8.0,
  "pix_fmt": "yuv420p",
  "encoder_upstream": "h264",
  "mos": 3.42,
  "mos_std_dev": 0.51,
  "n_ratings": 64,
  "corpus": "konvid-1k",
  "corpus_version": "konvid-1k-2017",
  "ingested_at_utc": "2026-05-08T12:00:00+00:00"
}

The schema is disjoint from the existing vmaf-tune Phase A CORPUS_ROW_KEYS row (no vmaf_score, encoder, preset, crf). The two corpora are merged at the trainer level in Phase 3 — not at the JSONL level — because their natural keys differ: vmaf-tune rows key on (src_sha256, encoder, preset, crf) (synthetic encodes of a known reference), while KonViD rows key on src_sha256 alone (the clip is the artefact, and it carries a human MOS instead of an algorithmic VMAF score).

The corpus_version field defaults to "konvid-1k-2017" (the QoMEX release year) and is overridable via --corpus-version for downstream shards (e.g. a re-rated 2019 metadata refresh).

6. Refusal: KonViD-150k mis-mount¶

If the operator points the script at a KoNViD_1k_metadata/ directory that actually holds the KonViD-150k attribute CSV (~150 000 rows), the script aborts with a hint pointing at konvid_150k_to_corpus_jsonl.py (Phase 2; not yet shipped — see ADR-0325 §Phase 2). The threshold is 1500 rows — the actual KonViD-1k size is exactly 1200, so the gap absorbs minor index-row variations without false positives. This guards against silently ingesting a 100 × larger corpus through a 1k-shaped pipeline; the geometry probe alone would take days at that scale and the disk impact (200+ GB) would surprise the operator.

7. Reproducibility and CI¶

Every step is deterministic given the same input archive. The src_sha256 field is a chunked SHA-256 of the clip bytes (1 MiB chunks; same shape vmaftune.corpus.py already emits) so re-runs across machines produce identical hashes for identical clips.

CI cannot retrain end-to-end (the corpus is non-redistributable). The adapter is exercised by ai/tests/test_konvid_1k.py, which mocks ffprobe via a synthesised JSON payload and stands up a temporary .workingdir2/konvid-1k/-shaped tree on disk. The tests run in well under one second and require neither ffprobe nor the corpus.

8. Operational notes¶

Broken clips are skipped, not fatal. If ffprobe returns a non-zero exit code, fails to parse JSON, or reports zero streams / zero geometry, the clip is logged as skipped (broken) and the run continues. The summary line reports the count.
CSV column-name aliases. The attribute CSV uses different column names across the 2017 / 2019 dataset releases (MOS vs mos, SD vs mos_std, n vs num_ratings, file_name vs video_name). The adapter accepts every spelling that has shipped to date; if a release adds a new alias, edit _CSV_*_KEYS at the top of the script.
License posture. Per ADR-0325 §License, neither the clips, the per-clip MOS values, nor the JSONL itself ship in the repo. Only the ingestion script, this docs page, and the schema definition are in-tree. Trained model weights derived from the corpus are redistributable; per-clip data is not.

9. Next phases¶

Phase 1.5 (optional). Drop the same script-shape against YouTube-UGC (Google's 1.5k-clip MOS+VMAF set) for a cross-corpus sanity check.
Phase 2. Scale to KonViD-150k via ai/scripts/konvid_150k_to_corpus_jsonl.py (not yet shipped — see ADR-0325 §Phase 2). Adds resumable downloads, ~5–8 % attrition tolerance, and an "ugc-mixed" ENCODER_VOCAB slot.
Phase 3. Train a sibling MOS-head ONNX via the existing ensemble-training-kit harness (ADR-0324). Held-out fold gates production-flip via the ADR-0303 protocol.