ADR-0862: K150K extractor — .done vs parquet consistency check on restart¶
- Status: Accepted
- Date: 2026-05-30
- Deciders: lusoris, Claude (Anthropic)
- Tags:
ai,pipeline,durability,k150k
Context¶
ai/scripts/extract_k150k_features.py extracts VMAF-CPU features for the 152 265-clip K150K corpus and persists them as a single parquet file. Three durability mechanisms have grown up around it:
- A
<out>.donecheckpoint — JSONL of completed clip names, appended one line at a time, fsync-durable per_append_done. - A
<out>.rows.jsonlstaging file — append-only buffer of completed rows, written one line at a time so a crash mid-run does not lose already-extracted features. - The final
<out>.parquet— written exactly once at the end ofmain()via_write_parquet_from_rows(rename(2)-then-unlink), then the staging file is unlinked.
The 2026-05-30 RCA for Bug-3 found that the .done checkpoint had grown to 152 265 entries while the on-disk parquet held only 59 812 rows — a silent loss of ~92 K rows. The mechanism:
- A prior K150K run was killed (OOM / operator interrupt) after the parquet write at L1406 completed but before the staging-file
unlink(missing_ok=True)at L1408 — or, more likely, after a partial parquet rename(2) into place but before fsync, on a filesystem that reordered the rename ahead of the data writes. - The next invocation observed
pending == [](because.donealready covered every clip in--clips-dir) and fell into the no-op early-exit branch at L1310–L1329. - That branch unconditionally wrote a
status=complete-noopmanifest and returned 0. It never compared.donecount to parquet row count. - The operator saw "nothing to do, complete" and trusted the (truncated) parquet downstream.
The same branch will be entered every time the script is re-run, because the gap between .done and the parquet is permanent unless somebody truncates .done or re-extracts. The silent-confirmation property meant the failure was discoverable only by counting parquet rows by hand.
Decision¶
We will treat the .done checkpoint as the authoritative ledger of which clips have been processed, and require the no-op early-exit branch to fail closed (RuntimeError) when the parquet row count plus any recovered staging rows is less than len(done_set). The end-of-run write path gets a matching row-count sanity check (recovered + ok must equal the final row list length) and an explicit fsync of the parquet file and its parent directory before the staging file is unlinked.
The four concrete changes shipped together:
_load_staging_rowssurfaces a malformed-line count — replaces the silentJSONDecodeError: continuewith a stderr WARNING reporting skip count and recovered count. A truncated staging tail is a leading indicator of the same crash class.- No-op branch consistency check — at L1310, after recovering any staging rows, compare
len(done_set)against parquet row count (plus recovered if not already merged). On mismatch, raiseRuntimeErrornaming the missing-clip count and the recovery hint ("remove the affected entries from.doneor delete it to re-extract everything"). - End-of-run row accounting assert — before writing the parquet at L1406, verify
len(rows) == len(recovered_rows) + ok. On mismatch, raiseRuntimeErrorand preserve the staging file for forensic recovery instead of letting the broken parquet land. - fsync before unlink — call
_fsync_path(args.out)after the parquet rename and before the staging unlink, both in the no-op branch and in the end-of-run write path. The helper fsyncs the file then the parent directory so the rename(2) is durable before the companion unlink can race ahead of it on power loss.
Operator workflow on the existing 92 K-row deficit: delete (or trim) the .done file's missing entries and re-run; the new no-op-branch guard will refuse to silently confirm any future mismatch.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Status quo — no consistency check, rely on operator vigilance | Zero code | Already failed in production (~92 K rows lost without detection); operator vigilance is exactly what failed | Lossy-silent is unacceptable for a multi-week extraction job |
| Periodic parquet writes (every N clips) instead of write-once-at-end | Smaller blast radius per crash | Reverts Research-0135 Win 1 (pandas DataFrame allocation dominates wall-time when re-run repeatedly); doesn't fix the no-op branch trust gap | The staging-JSONL design is correct; the bug is the missing post-condition check |
| Replace JSONL staging with sqlite WAL | True transactional durability per row; row count is queryable in O(1) | New dependency in the AI script tree; sqlite locking interacts badly with the ProcessPoolExecutor row append pattern; large schema-migration cost | Costs are outsized for a one-shot per-corpus extraction; existing JSONL is fine with a consistency check |
Auto-truncate .done to match parquet on mismatch | Re-runnable without operator intervention | Hides the loss; rewards the same antipattern that caused the bug | Loud-fail is the explicit goal — the operator must know rows were lost |
| Add fsync only (skip consistency check) | One-liner; covers future crashes | Doesn't detect or recover from the existing 92 K-row deficit; future operator silence still possible if fsync isn't reached | Both are needed: fsync prevents the next loss, consistency check catches the loss that already happened |
Consequences¶
- Positive:
- Silent row loss is now structurally impossible on restart — the no-op branch refuses to confirm a mismatched corpus.
- End-of-run accounting catches in-process row-bookkeeping bugs (the
recovered_rowsmerge logic is non-trivial; the assert pins its invariant). - fsync before unlink prevents the next instance of the same crash class, even on filesystems that reorder rename ahead of data writes (ext4 with
data=writeback, NFS withoutsyncmount option). -
Malformed-staging WARNING surfaces leading indicators an operator can act on (e.g. an in-progress run was killed during the final JSONL line write).
-
Negative:
- The no-op branch is now slower by one
pd.read_parquet(columns=["clip_name"])pass on every restart. For 152 K rows the cost is ~1–2 s on local NVMe — negligible against a multi-week extraction wall-time. - The existing 92 K-row deficit needs operator action (truncate
.doneand re-extract) before the script is runnable on the current corpus; the script will refuse to silently no-op until that's done. This is intentional — silent no-op is the antipattern. -
_fsync_pathis best-effort: it catchesOSErrorand warns, so on tmpfs / FUSE / overlayfs it degrades gracefully but the durability guarantee weakens. Production K150K runs on local ext4 where fsync is real. -
Neutral / follow-ups:
- Unit test added:
ai/tests/test_extract_k150k_consistency.py— four cases (consistency-error raise, negative control, malformed-line WARNING, fsync no-raise on missing). - Operator runbook update: when restarting after a crash on the current K150K corpus, expect the
RuntimeErrorand follow the recovery hint in the error message. - The K150K extraction currently in flight (PID 307050, 98.3 % complete as of 2026-05-30) must complete before this fix lands. Once landed, the next operator-initiated restart will surface the historical deficit.
References¶
- Bug-3 RCA: 2026-05-30 investigation by user;
.donecount 152 265 vs parquet 59 812 rows on shared K150K parquet output. - Source:
req(operator-provided RCA sketch and fix plan, 2026-05-30 13:54 UTC) — paraphrased: "surface JSONDecodeError counts, add a.done-vs-parquet consistency check on restart, sanity-check at end-of-run, and don't unlink staging before parquet is fsync'd." - Related: Research-0135 (parquet write-once-at-end optimisation — preserved; this ADR adds a durability gate on top of it, not in place of it).
- Related ADRs: ADR-0042 (tiny-AI docs bar; the operator-facing failure mode is documented in
docs/ai/k150k-extraction.mdas part of the same PR).