Research-0118: external-bench wrapper schema validation¶
- Status: Active
- Workstream: external benchmark harness robustness
- Last updated: 2026-05-14
Question¶
Should tools/external-bench/compare.py continue trusting wrapper JSON until aggregation, or validate each wrapper payload at the subprocess boundary?
Sources¶
tools/external-bench/compare.pyparsed wrapper JSON and returned it directly to aggregation.tools/external-bench/README.mddocuments the wrapper schema as the licence-boundary contract.tools/external-bench/AGENTS.mdmarks that output schema as the load-bearing contract between wrapper scripts andcompare.py.
Findings¶
The wrapper seam is where schema errors should be reported. Without a validator, a malformed wrapper output reaches aggregate() and fails later as a generic KeyError or TypeError, losing the wrapper name, the output file context, and the exact missing field. The existing main loop already catches RuntimeError from run_wrapper() and skips the bad (competitor, corpus item) pair, so moving schema checks into run_wrapper() fits the existing failure path.
Alternatives considered¶
Validating only in aggregate() was rejected because aggregation sees only a list of payloads and no longer knows which wrapper invocation produced the bad JSON.
Adding a JSON Schema dependency was rejected for this small fixed contract. A typed in-tree validator keeps the harness dependency-free and is enough to pin required keys, numeric fields, and summary.competitor identity.
Decision¶
Add validate_wrapper_output() and call it immediately after JSON parsing in run_wrapper(). Malformed JSON raises a RuntimeError with invalid JSON; schema violations raise a RuntimeError with invalid schema. Extra fields remain allowed so wrappers can carry debug metadata without changing the aggregation contract.