Skip to content

LIVE-VQC corpus ingestion

LIVE Video Quality Challenge (LIVE-VQC; Sinno & Bovik, IEEE TIP 2019) is a 585-video real-world UGC dataset collected by the LIVE Lab at UT Austin. Videos were captured on consumer smartphones and tablets across diverse real-world scenes, providing authentic in-the-wild distortions from the device perspective rather than from a transcoding pipeline.

This page covers acquisition, operator flags, and schema details for the live_vqc_to_corpus_jsonl.py adapter (ADR-0370). For the unified family index see mos-corpora.md.

Corpus identity

Property Value
Clips 585
MOS scale 0–100 continuous (LIVE Lab crowdsourcing framework)
Size (approximate) A few GB
Corpus label (corpus field) live-vqc
Default corpus_version live-vqc-2019
License Research use with attribution
Dataset page https://live.ece.utexas.edu/research/LIVEVQC/

Citation:

Sinno, Z., Bovik, A. C., "Large-Scale Study of Perceptual Video Quality," IEEE Transactions on Image Processing, 28(2), pp. 612–627, Feb. 2019. DOI: 10.1109/TIP.2018.2875341

Acquisition

LIVE-VQC is available via the UT Austin LIVE Lab website (link above). The dataset typically requires a short request form. The adapter does not include any clips, per-clip MOS values, or derived feature caches — only the conversion script ships in tree.

Expected local layout after extraction:

.workingdir2/live-vqc/
  ├── manifest.csv        # MOS table (operator drops — see Manifest below)
  └── clips/              # video files (operator extraction)
        ├── 001.mp4
        └── ...

Quick-start

# Laptop-class smoke run (200 clips, requires manifest + clips on disk):
python ai/scripts/live_vqc_to_corpus_jsonl.py
#    → .workingdir2/live-vqc/live_vqc.jsonl

# Full-corpus ingestion (all 585 clips):
python ai/scripts/live_vqc_to_corpus_jsonl.py --full
#    → .workingdir2/live-vqc/live_vqc.jsonl

# Custom directory:
python ai/scripts/live_vqc_to_corpus_jsonl.py \
    --live-vqc-dir /data/live-vqc \
    --output /data/live-vqc/live_vqc.jsonl

The run is resumable — Ctrl-C and re-run; the .download-progress.json sidecar tracks per-clip state.

Manifest format

Two shapes are accepted and auto-detected:

1. Canonical two-column headerless (minimal MOS export)

The LIVE-VQC MOS spreadsheet can be exported as a headerless <filename>, <mos> CSV. Drop it at <live-vqc-dir>/manifest.csv:

001.mp4,45.23
002.mp4,72.18
003.mp4,61.05

In this shape mos_std_dev and n_ratings default to 0.0 / 0 (the export does not include inter-rater spread).

2. Standard adapter CSV (named-column header)

Alternatively, produce a standard CSV matching the LSVQ / KonViD-150k header convention:

name,url,mos,sd,n
001.mp4,https://...,45.23,8.1,30
002.mp4,https://...,72.18,6.4,28

Column aliases accepted for each field:

Field Aliases
filename name, video_name, filename, file_name
URL url, download_url, video_url
MOS mos, MOS, mos_score
MOS SD sd, SD, mos_std, mos_std_dev, SD_MOS
n ratings n, ratings, num_ratings, n_ratings

If the URL column is present, the adapter attempts to download missing clips via curl (resumable, with a 120-second per-clip timeout).

CLI flags

Flag Default Description
--live-vqc-dir .workingdir2/live-vqc/ Local working directory
--manifest-csv <dir>/manifest.csv MOS manifest path
--clips-subdir clips Sub-directory for video files
--output <dir>/live_vqc.jsonl Output JSONL path
--manifest-out <output>.manifest.json Replay manifest JSON sidecar
--max-rows 200 Cap on rows ingested (laptop-class subset)
--full off Ingest the entire manifest; overrides --max-rows
--corpus-version live-vqc-2019 Version string baked into each row
--ffprobe-bin ffprobe ffprobe binary path / env $FFPROBE_BIN
--curl-bin curl curl binary path / env $CURL_BIN
--attrition-warn-threshold 0.10 Warn if download failures exceed this fraction
--download-timeout-s 120 Per-clip curl --max-time seconds
--log-level INFO Logging level (DEBUG / INFO / WARNING / ERROR)
--progress-path <dir>/.download-progress.json Resumable download state; delete to retry failures

The replay manifest records the LIVE-VQC working directory, manifest path, row cap, attrition counters, effective corpus version, and ADR-0661 run_provenance.

Output schema

Every row in the output JSONL follows the corpus_v3 schema (ADR-0366):

{
  "src":               "001.mp4",
  "src_sha256":        "<64-hex>",
  "src_size_bytes":    4321000,
  "width":             1920,
  "height":            1080,
  "framerate":         30.0,
  "duration_s":        6.0,
  "pix_fmt":           "yuv420p",
  "encoder_upstream":  "h264",
  "mos":               63.4,          // native 0–100 scale — NOT normalised
  "mos_std_dev":       8.2,           // 0.0 if two-column CSV consumed
  "n_ratings":         30,            // 0 if two-column CSV consumed
  "corpus":            "live-vqc",
  "corpus_version":    "live-vqc-2019",
  "ingested_at_utc":   "2026-05-09T12:00:00+00:00"
}

MOS scale note: LIVE-VQC uses a 0–100 continuous scale, not the 1–5 ACR Likert scale used by KonViD-1k / KonViD-150k / LSVQ. When combining corpora use ai/scripts/aggregate_corpora.py, which normalises each shard to a common axis before the trainer consumes it. See multi-corpus-aggregation.md.

Integrating with the unified training pipeline

python ai/scripts/aggregate_corpora.py \
    --inputs .workingdir2/konvid-150k/konvid_150k.jsonl \
             .workingdir2/lsvq/lsvq.jsonl \
             .workingdir2/live-vqc/live_vqc.jsonl \
    --output .workingdir2/aggregated/unified_corpus.jsonl

License and redistribution

LIVE-VQC is available for research use with attribution. No clips, MOS values, or derived features are committed to this repository (per ADR-0370). ONNX weights trained on LIVE-VQC data travel with the Sinno & Bovik 2019 citation in their model-card sidecar.