K150K extractor crash-restart row-loss RCA (2026-05-30)¶

Summary¶

ai/scripts/extract_k150k_features.py reported 152 265 completed clips in its .done checkpoint while the on-disk parquet held only 59 812 rows — a silent loss of approximately 92 K rows discovered on 2026-05-30 during a routine integrity check of the K150K feature corpus.

The root cause is a missing post-condition in the restart no-op branch: when pending == [] the script copies any recovered staging rows into the parquet and returns status=complete-noop without ever verifying that the resulting row count matches .done. A prior run killed mid-write (parquet partially flushed but staging file already unlinked, or rename(2) reordered ahead of data on a non-fsync'd filesystem) leaves the parquet truncated and the staging file gone; every subsequent invocation silently confirms the loss.

The fix is three coordinated guards plus an fsync ordering fix.

Failure-mode table¶

#	Symptom	Where	Root cause	Fix
1	Silent JSONDecodeError on staging-tail truncation	`_load_staging_rows` L865–879 (pre-fix)	`continue` on exception, no count surfaced	WARNING with skipped-line count + recovered count
2	Restart no-op branch confirms `.done`-vs-parquet mismatch	`main` L1310–1329 (pre-fix)	No row-count comparison; manifest written `status=complete-noop` regardless	Raise `RuntimeError` when `len(done_set) > parquet_rows + recovered`
3	End-of-run row accounting unchecked	`main` L1405–1408 (pre-fix)	`rows` is built from two sources (recovered + as_completed) with no cardinality assert	Assert `len(rows) == len(recovered_rows) + ok`; raise + preserve staging on mismatch
4	Staging unlinked before parquet is durable	`main` L1408 (pre-fix)	`staging_path.unlink()` runs immediately after `_write_parquet_from_rows` returns; no `fsync` of parquet or parent dir between rename(2) and unlink(2)	New `_fsync_path` helper called before `staging_path.unlink` in both no-op and end-of-run paths

Reproducer¶

# Synthesize the failure mode
mkdir -p /tmp/k150k-rca
cd /tmp/k150k-rca

# 100-entry .done checkpoint
printf 'clip_%04d.mp4\n' {0..99} > k150k.done

# Parquet with 50 rows (the "lost half")
python -c "
import pandas as pd
pd.DataFrame([{'clip_name': f'clip_{i:04d}.mp4', 'vmaf': 95.0}
              for i in range(50)]).to_parquet('k150k.parquet')
"

# No staging file — operator-cleaned

# Run the script (pre-fix: prints 'nothing to do.' and exits 0
# with status=complete-noop in the manifest)
# Run the script (post-fix: raises RuntimeError with CONSISTENCY ERROR
# message naming the 50-clip gap and the recovery hint)

The post-fix behaviour is pinned by ai/tests/test_extract_k150k_consistency.py:

pytest ai/tests/test_extract_k150k_consistency.py -v
# 4 passed in 0.25 s

In-flight run guidance¶

The K150K extraction currently running (PID 307050, 98.3 % complete as of 2026-05-30, ETA ~1.5 h) must complete before this fix lands. The fix changes only the restart path's behaviour — once the in-flight run finishes its at-end parquet write, the operator can rebase + merge this PR; the new consistency check will succeed because the freshly completed parquet matches .done.

If the in-flight run also crashes mid-write before the fix lands, the operator will hit the existing silent-loss antipattern one more time. Acceptable: the next restart attempt after this PR lands will surface the deficit instead of confirming it.

References¶

Source: req (operator RCA sketch + fix plan, 2026-05-30 13:54 UTC)
ADR: ADR-0862
Related: Research-0135 (parquet write-once-at-end optimisation — preserved)
Test: ai/tests/test_extract_k150k_consistency.py