Skip to content

Research-0118: external-bench wrapper schema validation

  • Status: Active
  • Workstream: external benchmark harness robustness
  • Last updated: 2026-05-14

Question

Should tools/external-bench/compare.py continue trusting wrapper JSON until aggregation, or validate each wrapper payload at the subprocess boundary?

Sources

Findings

The wrapper seam is where schema errors should be reported. Without a validator, a malformed wrapper output reaches aggregate() and fails later as a generic KeyError or TypeError, losing the wrapper name, the output file context, and the exact missing field. The existing main loop already catches RuntimeError from run_wrapper() and skips the bad (competitor, corpus item) pair, so moving schema checks into run_wrapper() fits the existing failure path.

Alternatives considered

Validating only in aggregate() was rejected because aggregation sees only a list of payloads and no longer knows which wrapper invocation produced the bad JSON.

Adding a JSON Schema dependency was rejected for this small fixed contract. A typed in-tree validator keeps the harness dependency-free and is enough to pin required keys, numeric fields, and summary.competitor identity.

Decision

Add validate_wrapper_output() and call it immediately after JSON parsing in run_wrapper(). Malformed JSON raises a RuntimeError with invalid JSON; schema violations raise a RuntimeError with invalid schema. Extra fields remain allowed so wrappers can carry debug metadata without changing the aggregation contract.