K150K extractor crash-restart row-loss RCA (2026-05-30)¶
Summary¶
ai/scripts/extract_k150k_features.py reported 152 265 completed clips in its .done checkpoint while the on-disk parquet held only 59 812 rows — a silent loss of approximately 92 K rows discovered on 2026-05-30 during a routine integrity check of the K150K feature corpus.
The root cause is a missing post-condition in the restart no-op branch: when pending == [] the script copies any recovered staging rows into the parquet and returns status=complete-noop without ever verifying that the resulting row count matches .done. A prior run killed mid-write (parquet partially flushed but staging file already unlinked, or rename(2) reordered ahead of data on a non-fsync'd filesystem) leaves the parquet truncated and the staging file gone; every subsequent invocation silently confirms the loss.
The fix is three coordinated guards plus an fsync ordering fix.
Failure-mode table¶
| # | Symptom | Where | Root cause | Fix |
|---|---|---|---|---|
| 1 | Silent JSONDecodeError on staging-tail truncation | _load_staging_rows L865–879 (pre-fix) | continue on exception, no count surfaced | WARNING with skipped-line count + recovered count |
| 2 | Restart no-op branch confirms .done-vs-parquet mismatch | main L1310–1329 (pre-fix) | No row-count comparison; manifest written status=complete-noop regardless | Raise RuntimeError when len(done_set) > parquet_rows + recovered |
| 3 | End-of-run row accounting unchecked | main L1405–1408 (pre-fix) | rows is built from two sources (recovered + as_completed) with no cardinality assert | Assert len(rows) == len(recovered_rows) + ok; raise + preserve staging on mismatch |
| 4 | Staging unlinked before parquet is durable | main L1408 (pre-fix) | staging_path.unlink() runs immediately after _write_parquet_from_rows returns; no fsync of parquet or parent dir between rename(2) and unlink(2) | New _fsync_path helper called before staging_path.unlink in both no-op and end-of-run paths |
Reproducer¶
# Synthesize the failure mode
mkdir -p /tmp/k150k-rca
cd /tmp/k150k-rca
# 100-entry .done checkpoint
printf 'clip_%04d.mp4\n' {0..99} > k150k.done
# Parquet with 50 rows (the "lost half")
python -c "
import pandas as pd
pd.DataFrame([{'clip_name': f'clip_{i:04d}.mp4', 'vmaf': 95.0}
for i in range(50)]).to_parquet('k150k.parquet')
"
# No staging file — operator-cleaned
# Run the script (pre-fix: prints 'nothing to do.' and exits 0
# with status=complete-noop in the manifest)
# Run the script (post-fix: raises RuntimeError with CONSISTENCY ERROR
# message naming the 50-clip gap and the recovery hint)
The post-fix behaviour is pinned by ai/tests/test_extract_k150k_consistency.py:
In-flight run guidance¶
The K150K extraction currently running (PID 307050, 98.3 % complete as of 2026-05-30, ETA ~1.5 h) must complete before this fix lands. The fix changes only the restart path's behaviour — once the in-flight run finishes its at-end parquet write, the operator can rebase + merge this PR; the new consistency check will succeed because the freshly completed parquet matches .done.
If the in-flight run also crashes mid-write before the fix lands, the operator will hit the existing silent-loss antipattern one more time. Acceptable: the next restart attempt after this PR lands will surface the deficit instead of confirming it.
References¶
- Source:
req(operator RCA sketch + fix plan, 2026-05-30 13:54 UTC) - ADR: ADR-0862
- Related: Research-0135 (parquet write-once-at-end optimisation — preserved)
- Test:
ai/tests/test_extract_k150k_consistency.py