ADR-0197: float_vif GPU kernels — 4-scale pyramid with mirror-asymmetry fix¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: vulkan, cuda, sycl, gpu, feature-extractor, fork-local, places-4
Context¶
ADR-0192 lists float_vif as the third Group B float twin. CPU reference: float_vif.c + vif.c::compute_vif + vif_tools.c.
The algorithm: 4-scale Gaussian pyramid with separable filters (widths {17, 9, 5, 3} for the production default vif_kernelscale = 1.0), 2x decimation between scales (vif_dec2_s), per-pixel vif_stat_one_pixel computing (num, den) from (mu1, mu2, ref²_filt, dis²_filt, ref·dis_filt). Final score is num / den per scale.
vif_options.h defines VIF_OPT_HANDLE_BORDERS, so per-scale dims = prev / 2 (no border crop). Decimation samples mu1 at (2i, 2j) — mirror padding handles taps near the edge.
Decision¶
Ship float_vif_{vulkan,cuda,sycl} as single-pass-per-stage kernels driven by 7 dispatches per frame: 4 compute (one per scale) + 3 decimate (scales 1, 2, 3). v1 restricts vif_kernelscale to the production default of 1.0 (the CPU supports 11 alternative scales but they require a much larger filter table; rejected at init for the GPU extractor).
The mirror asymmetry — both H mirrors exist in CPU¶
CPU vif_tools.c::vif_mirror_tap_h returns 2 * extent - idx - 1 for the right border. convolution_internal.h::convolution_edge_s (horizontal branch) returns 2 * width - idx - 2 for the same condition. They differ by one.
The mirrors collide in the CPU's hot path:
vif_filter1d_s/vif_filter1d_sq_s/vif_filter1d_xy_s(the scalar fallback) callvif_hpass_row_sfor every column, border or not — usingvif_mirror_tap_h(-1 form).- The AVX2 fast path (
convolution_f32_avx_s) callsconvolution_edge_s(-2 form) for the small border region and a vectorized scanline for the interior.
For vif_sigma_nsq == 2.0 (the default) the AVX2 fast path is taken, so the border H mirror that runs in production is the -2 form. The GPU's first attempt followed vif_mirror_tap_h (the scalar shape) — which produced an empirical 5.46e-4 relative drift at scale 1 versus AVX2 production. Switching the GPU's H mirror to the -2 formula brought the drift to 1.40e-5 at scale 1, well under the places=4 threshold (5e-5).
Mirror padding on GPU (final)¶
| axis | formula |
|---|---|
| vertical (V), idx >= sup | 2 * sup - idx - 2 (matches both CPU paths) |
| horizontal (H), idx >= sup | 2 * sup - idx - 2 (matches AVX2 production path, NOT scalar) |
A short comment block in each backend's mirror routine cites ADR-0197 so future maintainers don't "fix" the asymmetry by re- matching scalar's mirror.
Per-frame dispatch flow¶
scale 0 compute: read raw ref/dis → (num0, den0)
scale 1 decimate: read raw → buf_A (filter @ HFW=4, sample @ 2x)
scale 1 compute: read buf_A → (num1, den1)
scale 2 decimate: read buf_A → buf_B
scale 2 compute: read buf_B → (num2, den2)
scale 3 decimate: read buf_B → buf_A
scale 3 compute: read buf_A → (num3, den3)
Float ping-pong (ref_buf[0], ref_buf[1]) + (dis_buf[0], dis_buf[1]). Per-scale (num, den) partials at wg_count[scale] × 4 × 2 floats (8 buffers total).
Precision contract: places=4 across all 4 scales, 8-bit and 10-bit¶
| Backend / bit-depth | scale 0 | scale 1 | scale 2 | scale 3 |
|---|---|---|---|---|
| Vulkan (Mesa anv + Arc A380) — 8-bit | 1e-6 | 1.4e-5 | 1.8e-5 | 3.7e-5 |
| Vulkan — 10-bit | 1e-6 | 1e-6 | 7e-6 | 2e-6 |
| CUDA (RTX 4090) — 8-bit | 1e-6 | 1.4e-5 | 1.8e-5 | 3.7e-5 |
| SYCL (Arc A380, oneAPI 2025.3) — 8-bit | 1e-6 | 1.4e-5 | 1.8e-5 | 3.7e-5 |
Identical numbers across backends. All under places=4 threshold (5e-5).
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Match scalar CPU's mirror (-1) | Matches the documented vif_mirror_tap_h | Doesn't match AVX2 production; drift 5.46e-4 at scale 1 — fails places=4 | Production runs AVX2; we match what production runs |
| Use double-precision V→H accumulators on GPU | Higher precision; might converge to "ground truth" | Doubles aren't natively supported on Intel Arc; emulated; large perf hit. AVX2 has its own deviation from ground truth (~5e-4 vs scalar) so doubles wouldn't help close the AVX2 gap anyway | Mirror fix solves the actual deviation source for cheaper |
Loosen the gate to places=3 | Quick green | Weakens the test; doesn't expose real bugs | Per CLAUDE.md / project memory: never weaken correctness gates to make red turn green |
Support all 11 vif_kernelscale values in v1 | Full CPU feature parity | 11 × 4 = 44 distinct filter coef arrays; large compile-time table; the cross-backend gate only exercises kernelscale=1.0 | v1 ships the production default; alternative scales TBD as a focused follow-up |
Consequences¶
- Positive: third Group B float twin shipped at the same
places=4contract asfloat_psnr/float_motion. Mirror-fix documented for future maintainers (and for the eventualfloat_admport — which uses the sameconvolution_avx.cinfrastructure and likely has the same asymmetry). - Positive: identical numerical output across all three GPU backends — strong correctness signal.
- Negative: the kernel restricts
vif_kernelscaleto 1.0 only. Callers that pass other values get-EINVALat init. Documented in the extractor's option table; matches thevif_kernelscale = 1.0default that production uses. - Negative: 7 dispatches per frame is heavier than batch 3's earlier kernels (1 dispatch each). Per-scale resolution decreases geometrically so the total compute is bounded; per-frame wall-clock on Arc A380 at 576×324 is roughly 4× a single-dispatch kernel.
- Neutral / follow-ups:
- CHANGELOG + features.md updates ship in the same PR per ADR-0100.
- Lavapipe lane gains a
float_vifstep atplaces=4. - Next batch 3 metric:
float_adm(most complex; closes Group B).
References¶
- Parent: ADR-0192 — batch 3 scope.
- Sibling: ADR-0193, ADR-0194, ADR-0195, ADR-0196.
- CPU reference:
float_vif.c,vif.c,vif_tools.c. - AVX2 fast path:
convolution_avx.c,convolution_internal.h. - Verification: cross-backend gate
scripts/ci/cross_backend_vif_diff.pywith--feature float_vif --places 4. New step in the lavapipe lane oftests-and-quality-gates.yml. - User direction: 2026-04-27 — "fix everything to fucking 4 places min".