ADR-0371 — Shared `CorpusIngestBase` for MOS-corpus ingestion adapters¶

Field	Value
Status	Accepted
Date	2026-05-10
Scope	ai, corpus, refactor, fork-local
PR	(this PR)

Context¶

Six MOS-corpus ingestion scripts (KonViD-1k, KonViD-150k, LSVQ, LIVE-VQC, Waterloo IVC 4K-VQA, YouTube UGC) each duplicated approximately 200 lines of identical boilerplate:

_sha256_file — chunked SHA-256 with 1 MiB reads
_utc_now_iso — second-precision ISO-8601 UTC timestamp
probe_geometry — ffprobe JSON geometry extractor with identical command construction, stream selection, and error handling
_pick — case-insensitive CSV column picker (identical across all six files)
_parse_framerate — rational a/b ffprobe framerate parser
load_progress / save_progress / mark_done / mark_failed / should_attempt — resumable-download progress state (four scripts)
download_clip — curl-backed per-clip downloader (four scripts)
_read_existing_sha_index — JSONL dedup reader
RunStats — aggregate counters class (four scripts)
build_row — JSONL row assembler (identical output schema across all)

Total duplicated lines: ~1 200 across the six files (average ~200 per script), in addition to ~1 600 lines of corpus-specific CSV parsing and CLI that are legitimately different.

The user requested consolidation to reduce maintenance surface and prevent future divergence as new corpora are added.

Decision¶

Extract the shared boilerplate into ai/src/corpus/base.py as a package-level module exporting:

Free functions: sha256_file, utc_now_iso, pick, normalise_clip_name, probe_geometry, load_progress, save_progress, mark_done, mark_failed, should_attempt, download_clip, read_sha_index
RunStats class (aggregate counters)
CorpusIngestBase ABC with:
__init__ accepting all shared configuration (corpus_dir, output, ffprobe_bin, curl_bin, corpus_version, runner seam, etc.)
Abstract iter_source_rows(clips_dir) that subclasses override
run() orchestrator implementing the probe-SHA-write-dedup loop

Each of the six scripts is refactored to a ~80-150 LOC file containing only:

Corpus-specific constants and CSV column aliases
A parse_manifest_csv function (the only legitimately different logic per script)
A CorpusIngestBase subclass with iter_source_rows implemented
A backward-compatible module-level run() function preserving the existing test and caller API
The argparse CLI

bvi_dvc_to_corpus_jsonl.py is intentionally excluded: it wraps the vmaftune.CORPUS_ROW_KEYS schema (a full-reference encode-quality corpus), not the MOS-ingest schema; conflating the two would couple unrelated abstractions.

Consequences¶

Positive¶

Single authoritative implementation of ffprobe invocation, SHA-256 dedup, resumable-download state management, and the JSONL row schema builder — future bug fixes propagate to all corpora automatically.
New corpora require only a parse_manifest_csv function and a ~10-line CorpusIngestBase subclass.
Test coverage of shared logic lives in one place (ai/tests/test_corpus_base.py) rather than being re-tested across six per-corpus test files.

Negative / risks¶

from corpus.base import ... requires PYTHONPATH=ai/src (already the convention in this repo's conftest.py and CI).
The CorpusIngestBase.run() materialises iter_source_rows into a list before the main loop (to support max_rows capping). Very large manifests (KonViD-150k ~150 000 rows) will hold the full parsed list in memory simultaneously with the JSONL output handle. The rows are pure-Python dicts (~200 bytes each); 150 000 rows is ~30 MB — well within acceptable limits.

Alternatives considered¶

Keep the duplication, enforce via linting. A semgrep rule could detect divergence between the six probe_geometry implementations. Ruled out: linting catches drift after the fact; the base-class approach prevents it structurally.

Use a code-generation approach (Jinja templates). Ruled out: adds a build step and loses static analysis on generated code. The object-oriented base-class approach is directly readable and importable.

References¶

req: "Extract a shared CorpusIngestBase class to ai/src/corpus/base.py for the 7 corpus ingestion scripts. Each script duplicates ~200 lines of ffprobe / SHA-index / JSONL-append / argparse boilerplate."
ADR-0325 (KonViD corpus ingestion)
ADR-0333 / ADR-0367 (LSVQ)
ADR-0368 (YouTube UGC)
ADR-0369 (Waterloo IVC 4K-VQA)
ADR-0370 (LIVE-VQC)

ADR-0371 — Shared CorpusIngestBase for MOS-corpus ingestion adapters¶