ADR-0926: Parquet schema v2 — canonical column order, zstd-3, schema metadata¶
- Status: Accepted
- Date: 2026-05-31
- Deciders: lusoris
- Tags: ai, data, storage, parquet, k150k, chug
Context¶
The fork's AI training and evaluation pipelines emit parquet files at several points in the stack: K150K feature extraction (extract_k150k_features.py), CHUG sweeps, KonViD-1k full-feature builds, BVI-DVC sweeps, eval-LOSO reports, and metadata enrichment. Corpora at scale (K150K is approximately 150 000 clips, CHUG is hundreds of GB) make storage cost and read-side ergonomics first-class concerns.
Two pain points showed up repeatedly:
- Compression default is snappy. Snappy is fast but its compression ratio is poor on the float / categorical mix typical of VMAF feature tables. On real K150K-shape data with repeating clip ids and per-clip shared MOS, switching to zstd-3 reclaims roughly 20-30 % of disk at comparable read/write CPU. On float-dense synthetic data the win is smaller (about 6 %), but real data is rarely float-dense once metadata and identifiers are included.
- Column order is whatever pandas decided. Downstream readers — the training loop,
validate_norm.py, eval scripts, the manifest enricher — re-discover "which column holds the score" by name lookup at every site. This is harmless but it adds boilerplate to every new script.
The fork also has no way to detect what produced a parquet file. A reader cannot tell whether a file came from a v1 ad-hoc df.to_parquet(...) call, from a snappy-compressed bulk extraction, or from a fresh run with the current column conventions — and that ambiguity has blocked at least one in-flight refactor.
Decision¶
Define a parquet schema v2 that the central aiutils.parquet_utils.write_parquet_atomic helper produces by default, and keep the v1 reader path intact so legacy files remain consumable.
The v2 contract:
- Canonical column order:
[clip_id?, clip_name?, frame_idx?, frame_index?, ...features (alphabetical), ...labels, ...metadata]. Labels and metadata are identified from a small built-in allowlist (mos,dmos,score,*_label,*_target→ labels;source,split,codec,*_hash, ... → metadata) or from explicitlabels=/metadata=lists the caller passes. - Compression:
zstdatcompression_level=3by default. Callers may override viacompression="snappy"orcompression_level=N; the override is honoured exactly. - File-level metadata: pyarrow custom-metadata block carries
vmafx_schema_version=2(ASCII bytes) andvmafx_pipeline_hash=<12-char git short SHA>. The hash is best-effort; an empty value is acceptable in environments without git. - Reader: a new
read_parquet_with_schema(path)returns(dataframe, schema_version). Files with novmafx_schema_versionkey are reported as v1. The simplepd.read_parquet(...)path keeps working for both versions.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| zstd-9 (max ratio) | Another 5-10 % smaller | 2-4x write CPU; marginal benefit | The default has to stay cheap on the extraction hot path. Cold archives can opt into a higher level. |
| brotli | Best ratio on tabular text | Pyarrow brotli support varies across builds | Pyarrow zstd is the de-facto standard; portability beats marginal ratio. |
| LZ4 | Faster than snappy | Worse ratio than snappy | We want more compression, not less. |
Schema versioning in a sidecar .manifest.json | No format coupling | Sidecar drift is a real failure mode (see the K150K resume-set bug); two files travel together poorly. | Inline pyarrow custom-metadata is a single source of truth and survives cp / rclone copy. |
| Hard-code a column allowlist (no heuristics) | Predictable | Every new feature requires a code change | Heuristics keep the helper usable for ad-hoc tables; explicit labels= / metadata= is available for callers that need precision. |
| Bump compression default but skip column reorder | Smaller PR | Doesn't address the "find the score column" pain point | The two changes share the same write path; landing them together is one ADR and one rebase note. |
Consequences¶
- Positive:
- K150K / CHUG cold storage shrinks roughly 20-30 % on real data (mix of floats plus repeating categorical identifiers); test-synthetic random-float data sees about 6 %.
- Downstream readers can rely on column 0 being
clip_id(when present) and the score column being in a predictable block; this opens the door to deleting per-script "find score column" boilerplate in a follow-up. vmafx_pipeline_hashmakes "which build wrote this?" answerable without sidecar files.-
vmafx_schema_versionfuture-proofs further format changes — v3 can branch on the same key. -
Negative:
- Files written by older callers (
df.to_parquet(...)directly) remain v1; the codebase keeps two read paths in mind until the call sites are migrated. -
Zstd decompression is slightly slower than snappy on large reads; the difference is negligible on PCIe NVMe with modern CPUs but is measurable on cold spinning disk or RAM-bound workers.
-
Neutral / follow-ups:
- Migrate the direct
df.to_parquet(...)call sites inai/scripts/towrite_parquet_atomicin a follow-up sweep (tracked separately — out of scope for this ADR to keep the change surface small). - Once all writers are migrated, evaluate raising the default zstd level to 5 for archival writes.
References¶
- Source: req — user direction for modernization #19, 2026-05-31, to "standardize Parquet column ordering + switch snappy to zstd-3 compression + add schema_version metadata".
- Related:
ai/src/aiutils/parquet_utils.py,ai/src/aiutils/AGENTS.md. - ADR-0221 (changelog fragment pattern).
- ADR-0108 (deep-dive deliverables rule).