ADR-1103 — Fix integer_vif_hip boundary condition: clamp_i → mirror2_i¶

Field	Value
Status	Accepted
Date	2026-06-13
Deciders	lusoris
Tags	hip, vif, parity, boundary, correctness, fork-local
Supersedes	ADR-0566 §"clamp matches CPU within places=4" claim (now incorrect)

Context¶

ADR-0563 fixed the catastrophic carry-bit corruption in integer_vif_hip (VMAF ~57 → ~76) by reverting to per-thread atomicAdd. After that fix, a residual parity gap of places~2.75 (max |HIP−CPU| ≈ 0.0018 per VIF scale) remained on the Netflix src01 576×324 reference pair, violating ADR-0214's places=4 gate.

The root cause was identified by measuring the actual delta on gfx1030 (RDNA2, wave32) with a current-master build:

| Scale | Max |HIP−CPU| (pre-fix) | Places | |-------|------------------------|--------| | scale0 | 0.001788 | ~2.75 | | scale1 | 0.005541 | ~2.26 | | scale2 | 0.005375 | ~2.27 | | scale3 | 0.005368 | ~2.27 |

Root cause: boundary condition mismatch¶

The CPU reference (integer_vif.c) uses PADDING_SQ_DATA before each horizontal pass — a symmetric reflect at the edges:

index -1 reads pixel 1 (mirror at 0)
index w reads pixel w-2 (mirror at w-1)

The CUDA twin (filter1d.cu lines 121-128) implements the same policy via a "Two-bounce mirror" in its shared-memory load stage.

The HIP kernel used clamp_i(idx, 0, dim-1) — replicate-edge boundary. This disagrees with both CPU and CUDA for the filter_half_width pixels at each edge. For scale 0 (filter width 17, half 8) on a 576-pixel-wide frame, this affects 16 of 576 columns per row (~2.8%), producing the observed delta.

The log computation (log_generate vs CPU LUT log2_32/log2_64) was verified bit-identical for the same normalized 16-bit inputs — it is NOT the source of the gap.

Fix: replace clamp_i with mirror2_i in all filter reads¶

A mirror2_i(v, dim) device helper was added, implementing the same two-bounce mirror as CUDA:

__device__ __forceinline__ static int mirror2_i(int v, int dim)
{
    if (v < 0)
        v = -v;
    if (v >= dim)
        v = 2 * dim - v - 2;
    /* Safety clamp for very small frames. */
    if (v < 0)  v = 0;
    if (v >= dim) v = dim - 1;
    return v;
}

All six filter-loop boundary reads in vif_statistics.hip were updated:

Vertical pass (8-bit): ky = mirror2_i(y - HALF + fi, h) (× 1)
Vertical pass (16-bit): ky = mirror2_i(y - HALF + fi, h) (× 1)
Horizontal pass (8-bit): kx = mirror2_i(x - HALF + fj, w) and kx = mirror2_i(x - HALF_RD + fj, w) (× 2)
Horizontal pass (16-bit template): same two replacements (× 2)

Device verification (gfx1030, wave32, current master)¶

Measured on the Netflix src01 576×324 8-bit YUV pair (48 frames), using a current-master build with enable_hipcc=true:

| Scale | Max |HIP−CPU| (post-fix) | Places | Frames perfect | |-------|-------------------------|--------|----------------| | scale0 | 0.0000010 | ~6.00 | 46 / 48 | | scale1 | 0.0000010 | ~6.00 | 43 / 48 | | scale2 | 0.0000010 | ~6.00 | 39 / 48 | | scale3 | 0.0000010 | ~6.00 | 32 / 48 | | All | all 48/48 < 1e-4 | ≥ 4 | |

Pooled VIF scores (48 frames): HIP matches CPU to 7 significant figures. Pooled VMAF (VIF-only features): HIP=76.667848, CPU=76.667831, delta=0.000017.

Decision¶

Replace clamp_i with mirror2_i in all filter-loop boundary reads in core/src/feature/hip/integer_vif/vif_statistics.hip. Tighten the in-repo HIP VIF parity test (test_hip_vif_parity.c) from PARITY_TOL=1e-3 (places=3) to PARITY_TOL=1e-4 (places=4) per ADR-0214 and ADR-0566.

ADR-0566's claim that "clamp matches CPU within places=4" was incorrect — the correct statement is that mirror-reflect matches CPU within places=6, and clamp produces places~2.75.

Alternatives considered¶

Option	Notes	Decision
Keep clamp_i, accept places=3	Violates ADR-0214 and ADR-0566.	Rejected
LUT-based log (match CPU exactly)	Python simulation showed log computation is bit-identical for the same normalized 16-bit inputs; it is not the source of the delta.	Not applicable
Shared-memory tiling (match CUDA exactly)	CUDA uses smem tiles to amortize the mirror boundary; HIP scalar-per-thread avoids the smem complexity. mirror2_i achieves the same numerical result at scalar per-thread cost.	Deferred (performance optimization, not correctness)
Accept places=4 VMAF-score gate only (not per-feature)	Per-feature places=4 is the upstream gate per ADR-0566; tightening to per-feature ensures VMAF-score places=4 is achievable via SVM amplification analysis (ADR-0566 §SVM amplification).	Rejected

Consequences¶

integer_vif_hip achieves per-feature places=4 (and in practice places=6) on the Netflix golden pair, satisfying ADR-0214 and ADR-0566.
The in-repo test test_hip_vif_parity.c now asserts places=4, catching any future regression to clamp-style boundary.
clamp_i is removed from the file (no remaining callers). The mirror2_i helper remains inline, incurring no additional register pressure vs clamp.

References¶

ADR-0214 — Cross-backend parity gate (places=4 at VMAF-score level)
ADR-0537 — integer_vif_hip crash fixes
ADR-0552 — Superseded wavefront XOR-reduce (carry-bit bug)
ADR-0563 — Per-thread atomicAdd fix (carry-bit fix for integer_vif_hip)
ADR-0566 — HIP VIF per-feature places=4 gate (this ADR confirms it is achievable)
CUDA twin: core/src/feature/cuda/integer_vif/filter1d.cu lines 121-128
CPU reference: core/src/feature/integer_vif.h PADDING_SQ_DATA lines 95-118
Verified: gfx1030 (RDNA2 wave32), ROCm 7.2.53211, 48 Netflix src01 frames