ADR-1097: Atomic file writes for AI-script cache and output files¶
- Status: Accepted
- Date: 2026-06-07
- Deciders: Lusoris
- Tags:
ai,correctness,reliability
Context¶
Several AI pipeline scripts write intermediate cache files (per-clip JSON) and final output files (Parquet, JSONL) using non-atomic Path.write_text() or df.to_parquet() calls. If the process is interrupted mid-write (OOM kill, SIGKILL, power loss, disk full) the destination file is left partially truncated. On the next resume attempt the file exists with a non-zero size, so the resume logic treats it as a valid cached result and either crashes on a json.loads parse error or silently produces corrupt output rows.
The specific patterns observed:
chug_extract_features.py—cache_path.write_text(...)for the per-pair feature JSON (line 653) andvisual_cache_path.write_text(...)for the visual-signal JSON (line 669). Both files are tested for existence at line 611 before skipping re-extraction; a partial write permanently poisons the cache entry.bvi_dvc_to_full_features.py— twocache_dir / key.json).write_text(...)calls (lines 308, 358) plus a non-atomicdf.to_parquet(out_path)at the final write (line 602).konvid_to_full_features.py—cache_path.write_text(vmaf_json.read_text())(line 274) and threedf.to_parquet(...)calls in_write_outputs(lines 297, 307, 315).extract_full_features.py—cache_path.write_text(json.dumps(payload))(line 100) anddf.to_parquet(args.out)(line 232).aggregate_corpora.py— output JSONL opened as"w"before any rows are written; a crash during iteration leaves a truncated file.aiutils/run_manifest.py:write_manifest_json—path.write_text(...)used by every script's provenance manifest; a crash during the write corrupts the manifest sidecar.
extract_k150k_features.py already has a correct atomic-parquet pattern (_write_parquet_from_rows using .tmp + tmp.rename); the fix brings the remaining scripts to the same standard.
Decision¶
We will add write_text_atomic(path, text) to aiutils/file_utils.py, implement it as tempfile.mkstemp(dir=path.parent) + write + os.replace, and use it in all AI-script cache and output write sites that previously used Path.write_text() directly. Final Parquet outputs in bvi_dvc_to_full_features, konvid_to_full_features, and extract_full_features are migrated to write_parquet_atomic (which already implements the tmp-rename pattern). write_manifest_json in aiutils/run_manifest.py is also made atomic via the same helper. The aggregate_corpora.py JSONL write is wrapped with tempfile.mkstemp + os.replace inline.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Catch json.JSONDecodeError in cache-read and delete corrupt entry | Recovers from corruption at read time | Does not prevent corruption; adds logic at every read site | Treats a symptom, not the cause |
Use fcntl.flock to mark files in-progress | Works on most Linux filesystems | Complex locking semantics; still does not prevent partial writes on kill | Over-engineered for a single-writer pipeline |
| Keep existing behaviour and rely on operator to delete corrupt files | Zero code change | Silently corrupts extraction runs; production impact | Unacceptable data-quality risk |
Consequences¶
- Positive: cache-file corruption on interrupt is eliminated; resume after kill/OOM is safe; all final outputs are either old-or-new, never partial.
- Negative: requires
path.parentto be on the same filesystem as the temp file (guaranteed bydir=path.parentinmkstemp); a very small extra inode allocation per write. - Neutral / follow-ups:
write_text_atomicexported fromaiutilsso future scripts can adopt it without copying the pattern.
References¶
extract_k150k_features.pyexisting atomic pattern:_write_parquet_from_rows(same-directory tmp + rename, plus_fsync_pathbefore staging unlink).aiutils/parquet_utils.py:write_parquet_atomic— already implements atomic rename for parquet outputs.- req: "ai-script-resume-correctness — checkpoint/resume correctness; partial-write file corruption protection; atomic file operations for state files"