ADR-0371 — Shared CorpusIngestBase for MOS-corpus ingestion adapters¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-10 |
| Scope | ai, corpus, refactor, fork-local |
| PR | (this PR) |
Context¶
Six MOS-corpus ingestion scripts (KonViD-1k, KonViD-150k, LSVQ, LIVE-VQC, Waterloo IVC 4K-VQA, YouTube UGC) each duplicated approximately 200 lines of identical boilerplate:
_sha256_file— chunked SHA-256 with 1 MiB reads_utc_now_iso— second-precision ISO-8601 UTC timestampprobe_geometry— ffprobe JSON geometry extractor with identical command construction, stream selection, and error handling_pick— case-insensitive CSV column picker (identical across all six files)_parse_framerate— rationala/bffprobe framerate parserload_progress/save_progress/mark_done/mark_failed/should_attempt— resumable-download progress state (four scripts)download_clip— curl-backed per-clip downloader (four scripts)_read_existing_sha_index— JSONL dedup readerRunStats— aggregate counters class (four scripts)build_row— JSONL row assembler (identical output schema across all)
Total duplicated lines: ~1 200 across the six files (average ~200 per script), in addition to ~1 600 lines of corpus-specific CSV parsing and CLI that are legitimately different.
The user requested consolidation to reduce maintenance surface and prevent future divergence as new corpora are added.
Decision¶
Extract the shared boilerplate into ai/src/corpus/base.py as a package-level module exporting:
- Free functions:
sha256_file,utc_now_iso,pick,normalise_clip_name,probe_geometry,load_progress,save_progress,mark_done,mark_failed,should_attempt,download_clip,read_sha_index RunStatsclass (aggregate counters)CorpusIngestBaseABC with:__init__accepting all shared configuration (corpus_dir, output, ffprobe_bin, curl_bin, corpus_version, runner seam, etc.)- Abstract
iter_source_rows(clips_dir)that subclasses override run()orchestrator implementing the probe-SHA-write-dedup loop
Each of the six scripts is refactored to a ~80-150 LOC file containing only:
- Corpus-specific constants and CSV column aliases
- A
parse_manifest_csvfunction (the only legitimately different logic per script) - A
CorpusIngestBasesubclass withiter_source_rowsimplemented - A backward-compatible module-level
run()function preserving the existing test and caller API - The
argparseCLI
bvi_dvc_to_corpus_jsonl.py is intentionally excluded: it wraps the vmaftune.CORPUS_ROW_KEYS schema (a full-reference encode-quality corpus), not the MOS-ingest schema; conflating the two would couple unrelated abstractions.
Consequences¶
Positive¶
- Single authoritative implementation of ffprobe invocation, SHA-256 dedup, resumable-download state management, and the JSONL row schema builder — future bug fixes propagate to all corpora automatically.
- New corpora require only a
parse_manifest_csvfunction and a ~10-lineCorpusIngestBasesubclass. - Test coverage of shared logic lives in one place (
ai/tests/test_corpus_base.py) rather than being re-tested across six per-corpus test files.
Negative / risks¶
from corpus.base import ...requiresPYTHONPATH=ai/src(already the convention in this repo'sconftest.pyand CI).- The
CorpusIngestBase.run()materialisesiter_source_rowsinto a list before the main loop (to supportmax_rowscapping). Very large manifests (KonViD-150k ~150 000 rows) will hold the full parsed list in memory simultaneously with the JSONL output handle. The rows are pure-Python dicts (~200 bytes each); 150 000 rows is ~30 MB — well within acceptable limits.
Alternatives considered¶
Keep the duplication, enforce via linting. A semgrep rule could detect divergence between the six probe_geometry implementations. Ruled out: linting catches drift after the fact; the base-class approach prevents it structurally.
Use a code-generation approach (Jinja templates). Ruled out: adds a build step and loses static analysis on generated code. The object-oriented base-class approach is directly readable and importable.
References¶
- req: "Extract a shared CorpusIngestBase class to ai/src/corpus/base.py for the 7 corpus ingestion scripts. Each script duplicates ~200 lines of ffprobe / SHA-index / JSONL-append / argparse boilerplate."
- ADR-0325 (KonViD corpus ingestion)
- ADR-0333 / ADR-0367 (LSVQ)
- ADR-0368 (YouTube UGC)
- ADR-0369 (Waterloo IVC 4K-VQA)
- ADR-0370 (LIVE-VQC)