ADR-0140: SIMD DX framework — header macros + scaffolding skill¶
- Status: Accepted
- Date: 2026-04-21
- Deciders: Lusoris, Claude (Anthropic)
- Tags: simd, dx, build, agents
Context¶
After ADR-0138 and ADR-0139, the fork-local SIMD surface has grown large enough that the per-feature boilerplate is visible as a tax on future SIMD work. Three concrete costs surfaced:
- Repeated intrinsic patterns. The bit-exactness invariants from ADR-0138 (single-rounded
float * float→_mm256_cvtps_pdwiden →_mm256_add_pd, no FMA) and ADR-0139 (per-lane scalar double reduction for double-promoted scalar C) are written verbatim in multiple TUs. Every new SIMD kernel that needs the same pattern re-derives it from scratch, risking silent drift. - Repeated dispatch scaffolding. Every new SIMD path requires a matching
_*_set_dispatchsetter, a function-pointer typedef in a_simd.hheader, and CPU-flag plumbing incpu.c. The shape is mechanical but error-prone — missing one step gives a scalar-only dispatch that silently eats the SIMD code at link time. - Repeated test + meson scaffolding. Each new SIMD TU needs a matching entry in
core/src/meson.buildunder the correctis_asm_enabled/ AVX-512 guard, plus acore/test/meson.buildtest target with the rightplatform_specific_cpu_objectsinclusion andhost_machine.cpu_familyfilter.
The inventory audit surfaced nine remaining SIMD gaps (convolve NEON; ssim NEON bit-exactness audit; ssimulacra2 AVX2 + NEON; motion_v2 NEON; vif_statistic AVX-512 + NEON; float_ansnr; moment; luminance_tools; DX-level tooling). Cutting the boilerplate cost before PR #B unlocks them at higher throughput than grinding each one separately.
Two fork constraints shape the design:
- User memory
feedback_simd_dx_scope.md— "SIMD framework" in this fork means fork-internal DX (macros / helpers / codegen), NOT Highway / simde / xsimd adoption. No cross-ISA portability layer. - ADR-0125 + ADR-0138 + ADR-0139 — bit-exactness with the scalar reference is non-negotiable. Any DX macro MUST make the bit-exactness invariant easier to preserve, not easier to lose.
Decision¶
We will ship a two-part SIMD DX framework under PR #A (this ADR):
- A new header
core/src/feature/simd_dx.hcontaining ISA-specific macros for the recurring patterns identified in ADR-0138 / ADR-0139. Macros are per-ISA by name (no cross-ISA abstraction): e.g.SIMD_WIDEN_ADD_F32_F64_AVX2(acc_pd, a_ps, b_ps),SIMD_PER_LANE_SCALAR_DOUBLE_REDUCE_AVX2(block),SIMD_WIDEN_ADD_F32_F64_NEON,SIMD_PER_LANE_SCALAR_DOUBLE_REDUCE_NEON. Compiler-inlined; zero runtime overhead. Each macro is documented with its scalar C equivalent and a link to the governing ADR. - An upgraded
/add-simd-pathskill (see.claude/skills/add-simd-path/SKILL.md) that scaffolds a new SIMD TU (.c+.h), a matching unit-test file, and the meson.build row edits from a short kernel-spec declaration. The scaffolding includes the right copyright header, the dispatch setter boilerplate, the CPU-flag guard, and theplatform_specific_cpu_objectsinclude in the test target.
Both are demonstrated in the same PR on two real kernels:
- Demo 1 — convolve NEON. Uses
SIMD_WIDEN_ADD_F32_F64_NEONthe per-lane-scalar-double pattern to match ADR-0138's bit-exactness invariant on aarch64. Runs under QEMU (qemu-aarch64-static+aarch64-linux-gnu-gcccross toolchain). - Demo 2 — ssim NEON bit-exactness audit. Research-0012 follow-up. Cross-compile + run under QEMU; diff
--precision maxoutput against scalar; if the AVX2/AVX-512-style drift repeats, apply the same per-lane-scalar-double fix using the DX macros.
Neither demo introduces a new metric — both are existing kernels retargeted or verified. The intent is to prove the DX layer on real code before PR #B consumes it at scale.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Header-only macros (no skill upgrade) | Simpler scope; skill stays as-is | Per-feature bootstrap cost (meson / test / dispatch setter / header typedef) still dominates; 5 of 9 inventory gaps are full-new kernels where scaffolding is the bottleneck | Rejected — leaves the per-feature cost on the table |
| Skill-only codegen (no macros) | No new runtime .h surface; no pre-processor indirection | The bit-exactness patterns stay copy-pasted; ADR-0138 / ADR-0139 invariants are easy to lose on next port | Rejected — the riskiest cost (silent bit-exactness drift) stays uncut |
| Both — header macros + skill upgrade (chosen) | Cuts both the per-pattern and per-feature costs; demonstrable on two real kernels in the same PR; each future SIMD PR reuses both layers | ~40% larger PR #A scope; two artefacts to maintain | Decision — per user popup Q1.3 2026-04-21, the combined form is the only shape that unlocks PR #B at full throughput |
| Cross-ISA thin abstraction (Highway / simde / xsimd style) | Write-once-run-everywhere; tiny per-kernel LoC | Adds a portability-layer dependency + review surface; ISA-specific bit-exactness rules (FMA availability, rounding mode, lane ordering) hide inside the abstraction; user memory feedback_simd_dx_scope.md explicitly rules it out | Rejected — contradicts the fork's documented SIMD policy |
| ISA-specific macros without cross-ISA abstraction (chosen for macros) | Each macro's bit-exactness invariant is explicit and audit-able; names include the ISA (no #if __AVX2__ dispatch inside one macro name) | More macro names to remember | Decision — per user popup Q1.3-Q2 2026-04-21 |
Consequences¶
- Positive:
- ADR-0138 / ADR-0139 bit-exactness patterns survive future SIMD ports via the macros' documented scalar equivalents.
- PR #B's four gaps (convolve NEON, ssim NEON audit, ssimulacra2 AVX2+NEON, motion_v2 NEON + vif_statistic AVX-512/NEON) reuse both the macros and the skill scaffolding; expected ~40% LoC reduction on the mechanical parts.
- The
/add-simd-pathskill upgrade means new SIMD kernels start from a known-good scaffold instead of a cold copy-paste from the nearest feature; missing steps (dispatch setter, meson row, test target) are harder to forget. - Negative:
- One more header (
simd_dx.h) to maintain; macro names become part of the fork's internal vocabulary. - The skill's
kernel-specDSL is a small new surface area that needs documentation + examples. - PR #A ships no user-visible speedup — its value is in PR #B and beyond.
- Neutral / follow-ups:
core/src/feature/AGENTS.mdrebase invariant:simd_dx.his fork-local and must not conflict with upstream. On rebase, keep the fork's version.docs/rebase-notes.mdentry: document the DX framework and the skill upgrade so future rebase conflicts onadd-simd-path/SKILL.mdare resolved in favour of the fork's version.CHANGELOG.mdentry under Added.-
Reproducer for PR #A:
# Cross-compile + run NEON audit under QEMU meson setup build-aarch64 \ --cross-file=build-aux/aarch64-linux-gnu.ini \ -Denable_cuda=false -Denable_sycl=false ninja -C build-aarch64 qemu-aarch64-static -L /usr/aarch64-linux-gnu \ build-aarch64/tools/vmaf --cpumask 255 \ --reference REF --distorted DIS --feature float_ssim \ --feature float_ms_ssim --precision max -o /tmp/scalar.xml qemu-aarch64-static ... --cpumask 0 ... -o /tmp/neon.xml diff <(grep -v '<fyi fps' /tmp/scalar.xml) \ <(grep -v '<fyi fps' /tmp/neon.xml) # must be empty
References¶
- Source: user popup (2026-04-21, this session) — PR #B scope selection + DX form selection (
Q1.2-Q2"Both — header + skill",Q1.3-Q2"ISA-specific macros"). - User memory:
feedback_simd_dx_scope.md— rules out Highway / simde / xsimd; fork-internal DX only. - Related ADRs: ADR-0125 — MS-SSIM decimate SIMD bit-exactness contract; ADR-0138 — convolve widen-then-add pattern; ADR-0139 — per-lane scalar double reduction pattern; ADR-0108 — six deliverables apply to PR #A.
Status update 2026-05-08: Accepted¶
Audited as part of the 2026-05-08 ADR Proposed sweep (Research-0086).
Acceptance criteria verified in tree at HEAD 0a8b539e:
core/src/feature/simd_dx.h— present./add-simd-pathskill carries the kernel-spec flags documented in.claude/skills/add-simd-path/SKILL.md.- Demo kernels (convolve NEON, ssim NEON bit-exactness audit) are tracked under follow-up ADRs (ADR-0145 motion_v2 NEON, ADR-0159 / 0160 psnr_hvs AVX2 / NEON, ADR-0161 / 0162 / 0163 ssimulacra2 SIMD), all of which cite the framework.
- Verification command:
ls core/src/feature/simd_dx.h.