ADR-0169: ONNX op-allowlist — admit Loop + If with recursive subgraph scan (T6-5)¶
- Status: Accepted
- Date: 2026-04-25
- Deciders: Lusoris, Claude (Anthropic)
- Tags: tiny-ai, onnx, security, op-allowlist
Context¶
BACKLOG T6-5 calls for "Op-allowlist expansion (Loop, If with bounded-iteration guard). Unlocks MUSIQ / RAFT / small VLMs. Scan stays rejected." The existing allowlist in core/src/dnn/op_allowlist.c contains pure feed-forward ops (arithmetic, conv, normalisation, activations, dense, dropout, QDQ, structural, constants). Every control-flow op was rejected outright.
The cost of rejecting all control flow is concrete: well-known small-model architectures rely on Loop for iterative refinement (diffusion-style or RAFT optical-flow), and If for conditional branches (resolution-dependent backbones). Without these ops, the allowlist hard-stops anyone wanting to ship those models through the fork's tiny-AI surface.
The risk of admitting Loop and If is also concrete: their subgraphs (Loop.body, If.then_branch, If.else_branch) can contain arbitrary ops. A naive allowlist edit that just adds Loop to the top-level scanner would let an attacker hide forbidden ops inside a Loop body — defeating the whole point of the allowlist.
The fork's existing scanner (onnx_scan.c) walks the ModelProto wire format three levels deep (Model → Graph → Node → op_type) and explicitly does not recurse into NodeProto.attribute, which is where the embedded subgraphs live (per ADR D39's bounded- auditable-scope justification for not pulling in libprotobuf-c).
Decision¶
1. Add Loop and If to the C allowlist; keep Scan rejected¶
op_allowlist.c gains two new entries under a /* control flow */ section. Scan stays off the list — its variant-typed input/output binding (sequence-typed inputs/outputs, axis specifications, scan_input_directions) makes static bound-checking impractical for a wire-format scanner; revisit if a concrete consumer model surfaces.
2. Recursive subgraph scan in the C wire-format scanner¶
onnx_scan.c gains a scan_attribute helper that walks AttributeProto and, for every AttributeProto.g (single embedded GraphProto, field 6) or AttributeProto.graphs (repeated GraphProto, field 11), recursively invokes scan_graph on the embedded subgraph. scan_node is extended to descend into NodeProto.attribute (field 5) and call scan_attribute on each.
The recursion is depth-bounded by VMAF_DNN_MAX_SUBGRAPH_DEPTH = 8. Real models keep nesting shallow (Loop within If at most). The cap is a defence-in-depth bound against pathological recursion in malformed input — not a feature ceiling.
The scan is structural — it visits every NodeProto and verifies its op_type. It does not care about op_type when deciding to recurse: it always recurses into AttributeProto graph fields. This means the allowlist contract is uniformly applied at every depth: a forbidden op nested anywhere fails the scan regardless of the enclosing op_type.
3. Mirror the recursion in the Python export-time check¶
vmaf_train.op_allowlist gains a _collect_op_types helper that walks GraphProto.node and recurses into AttributeProto.g / .graphs whenever the attribute type is GRAPH / GRAPHS. Both check_model and check_graph use it. This keeps the export-time check (catches the issue at vmaf-train export time) and the runtime load-time check (catches the issue at vmaf_dnn_load) in lockstep.
4. Bounded-iteration guard — explicitly deferred¶
The BACKLOG row called for "bounded-iteration guard" alongside the allowlist expansion. This is deferred to a follow-up PR. Proving that a Loop's trip-count input is statically bounded requires backwards data-flow analysis (the trip-count input must trace to a Constant node with a known int64_data value ≤ some cap), which is non-trivial in a wire-format scanner. The follow-up PR can either:
- (a) Extend
onnx_scan.cto enforce that anyLoopnode's first input traces to aConstant ≤ MAX_LOOP_ITERATIONS, withMAX_LOOP_ITERATIONSa build-time constant (suggested: 65536). - (b) Add a runtime guard via an ORT custom kernel that wraps
Loopwith an iteration counter. - (c) Punt entirely — accept that
Loopconsumers must self-bound their iteration counts, and document this as a contract.
Path (a) keeps enforcement at the load-time scanner; path (b) moves it to runtime; path (c) trusts the consumer. Tracking the choice as a separate ADR keeps this PR's scope honest.
Alternatives considered¶
-
Reject
Scanexplicitly via a deny-list rather than relying on default-deny. Rejected: the allowlist is already a closed list; listingScanseparately as denied would suggest there's a carve-out, which there isn't. Better to document in the comment block above the new"Loop", "If"entries thatScanis intentionally absent. -
Walk
AttributeProto.subgraphonly when the parent op_type isLooporIf. Rejected: Future ONNX ops may also embed subgraphs (e.g.SequenceMap). Always recursing into any AttributeProto subgraph field is forward-compatible and never wrong — a non-control-flow op simply has no embedded graph attribute, so the recursion never fires. -
Defer until ORT's
OptimizedModelintrospection API matures. Rejected: ORT 1.22 still does not expose a stable C API for per-node iteration that includes subgraph descent. Pinning the scanner to wire-format parsing (per ADR D39) is the long-game choice. -
Add a
--allow-control-flowCLI flag that gates the relaxation per call site. Rejected: complicates the trust-root model. The allowlist is the contract; either an op is on it or it isn't. Per-call gates create surface area for bypasses. -
Ship the bounded-iteration guard in the same PR. Rejected: the data-flow analysis for
Loop.M → Constantdoubles the scope and the existing T6-5 sizing was "S". Better to ship the allowlist + recursive-scan now and queue the iteration guard separately.
Consequences¶
Positive:
- Tiny-AI surface admits
Loop+Ifbaselines that were previously blocked. MUSIQ / RAFT / small-VLM-class models become exportable through the fork's pipeline. - The
kind: "filter"enum from ADR-0168 is now compatible with stateful filter models (e.g. recurrent denoisers usingLoop). - Recursive-scan invariant is uniform: a forbidden op cannot hide inside any embedded subgraph at any depth.
- Python and C scanners stay symmetric — the export-time check catches issues before the model ever leaves the trainer.
Negative:
- Without the bounded-iteration guard, a malicious or poorly- written
Loopmodel could enter an unbounded compute loop at runtime. ORT's process-level inference timeout is the only defence today. Operators consuming untrusted models should set an inference timeout via ORT'sRunOptions. Tracked as follow-up T6-5b in BACKLOG. MAX_SUBGRAPH_DEPTH = 8is an arbitrary heuristic. If a future legitimate model nests deeper than 8 levels, the cap would need bumping. No model in the fork's current consumer list comes close.- ABI-additive change to
op_allowlist.conly — no header-level surface change for downstream consumers. A few existing tests that asserted "Loop / If are rejected" needed flipping (already updated in this PR).
Tests¶
Test changes in this PR:
core/test/dnn/test_op_allowlist.c- Renamed
test_custom_ops_rejectedto keep just the NULL/empty/unknown checks. - New
test_control_flow_ops_allowedasserts Loop + If accepted, Scan still rejected. core/test/dnn/test_onnx_scan.ctest_disallowed_op_loop→test_loop_top_level_allowed(flipped expectation).test_disallowed_op_if_after_allowed→test_if_after_allowed_now_accepted(flipped expectation).- New
test_scan_still_rejectedcoversScanrejection. - New
test_loop_with_allowed_subgraph(Loop body = Conv → ok). - New
test_loop_with_forbidden_subgraph(Loop body = "Fake" → rejected withfirst_bad="Fake"). ai/tests/test_op_allowlist.pytest_control_flow_ops_are_not_allowed→test_loop_and_if_now_allowed+test_scan_still_rejected.test_forbidden_op_rejectedswitched fixture fromLooptoScan.- New
test_loop_body_with_allowed_op_passes(Loop body = Relu → ok). - New
test_loop_body_with_forbidden_op_rejected(Loop body =FakeOp→ rejected; Loop wrapper itself still allowed).
References¶
- BACKLOG T6-5 — backlog row.
- Wave 1 roadmap § Op-allowlist expansion.
- ADR D39 — onnx_scan.c bounded-scope rationale (no libprotobuf-c).
- ADR-0020 — Tiny-AI four- capabilities surface.
- ADR-0022 — ONNX Runtime as the inference backend.
- ADR-0168 — sister ADR landing the C2/C3 baselines.
- ONNX wire format reference: https://github.com/onnx/onnx/blob/main/onnx/onnx.proto
req— user popup choice 2026-04-25: "T6-5 op-allowlist expansion (S, Recommended)".