ADR-0537: HIP integer VIF kernel crash fix — filter upload, bounds, HtoD staging¶
- Status: Accepted
- Date: 2026-05-18
- Deciders: lusoris, Claude (Anthropic)
- Tags:
hip,gpu,kernel,vif,bug-fix
Context¶
ADR-0530 (PR #1287) wired compute_fex_flags() to add VMAF_FEATURE_EXTRACTOR_HIP whenever a HIP state is imported, so the model-driven dispatch finally picks HIP-flagged extractors instead of falling through to the CPU twins. As a known consequence ADR-0530 explicitly left vmaf_fex_integer_vif_hip un-flagged with an inline citation: "speculatively flagged in batch-1 (ADR-0254) but crashes with a GPU memory access fault on the first frame". This ADR closes that follow-up.
Investigation found four separate defects in the integer-VIF HIP path:
-
Filter table passed as host pointer. The kernel signature was
const uint16_t filter[18](a pointer parameter); the host launch args contained(void *)vif_filter1d_table[0]— the address of a host-side static array. HIP'shipModuleLaunchKernelhappily copied that 8-byte value into the kernel argument region and the GPU dereferenced it on the first load, faulting immediately with "Memory access fault by GPU node-1 ... Reason: Page not present". -
Wrong filter half-widths. The kernel hard-coded
HALF = 9(etc.) from the second number in the kernel-name suffix (filter1d_8_vertical_kernel_uint32_t_17_9→HALF=9). The suffix actually encodesfwidth_0 = 17andfwidth_1 = 9— two distinct widths, not a half-width. The correct half-widths arevif_filter1d_width[scale] / 2 = {8, 4, 2, 1}. The pre-fix kernel read 19 filter coefficients (k = -9..9) from an 18-entry table, reading one past the end every scale-0 launch. -
No downsample write for next scale. The CUDA twin's scale-0/1/2 kernels compute the
*_convolrd-filter residual and write half-resolution ref / dis planes intobuf.ref/buf.dis. The pre-fix HIP kernel skipped this step entirely, so scales 1–3 read uninitialised device memory as their input frame. Even with the host pointer faulted around, the resulting numerics would have been garbage. -
Picture buffer reads from host memory. The HIP picture pool isn't wired yet (per the ADR-0530 docblock — "HIP TUs still accept HOST buffers and do their own HtoD copy"). The pre-fix submit passed
ref_pic->data[0](a host pointer) directly to the kernel. Same fault class as (1), separately reachable once (1) was fixed.
After PR #1287 promoted the dispatch flag bit, ADR-0530 immediately un-flagged the VIF extractor with an inline citation pointing to this follow-up — the fix wasn't ready to ship in the same PR. This ADR records the closing of that follow-up.
Decision¶
Rewrite core/src/feature/hip/integer_vif/vif_statistics.hip and core/src/feature/hip/integer_vif_hip.c to correct all four defects, then re-enable VMAF_FEATURE_EXTRACTOR_HIP on vmaf_fex_integer_vif_hip:
-
Filter upload. Allocate a 144-byte device buffer at
init()andhipMemcpythe fullvif_filter1d_table[4][18]into it once per extractor lifetime. Every kernel launch now passes&s->vif_filt_dev(address of the variable storing the device pointer) so the kernel receives a valid device pointer. -
Correct half-widths. Use
vif_filter1d_width[scale] / 2 = {8, 4, 2, 1}inside the kernel, expressed as#define VIF_FW_S0 17 / VIF_FW_S1 9 / VIF_FW_S2 5 / VIF_FW_S3 3withHALF = FW / 2derivation. -
Downsample write path. Each scale-0/1/2 horizontal kernel now applies the
FW_RD-tap rd filter totmp.ref_convol / tmp.dis_convoland writes the decimated-by-2 half-resolutionuint16_tplanes tobuf.ref / buf.dis, matching the CPUsubsample_rd_*+decimate_and_padreference paths. -
HtoD picture staging. Allocate per-frame device staging buffers
s->ref_in_dev/s->dis_in_devat init; before each scale-0 launch,hipMemcpy2DAsynccopies the host Y plane into them (mirrors the integer_motion_hip.c pattern at lines 372–373). Scales 1–3 read the half-res rd-buffers written by the previous horizontal pass.
The kernel is intentionally written as a scalar-per-thread implementation: it drops the CUDA twin's shared-memory tiling and warp-reduce micro-optimisations in favour of a small, verifiable surface (~540 lines vs. ~850 in the CUDA twin). Correctness first; perf optimisation is deferred to a follow-up ADR once the cross- backend parity gate confirms the baseline is stable.
Adjacent fixes bundled into this PR¶
The same PR also closes three small adjacent gaps surfaced by the enable_hipcc=true build:
-
Missing HSACO entries in
hip_kernel_sources. ADR-0533 (PR #1292) wired the HIP-extractor registration sweep but did not add the corresponding_hsacoblobs to the meson dict. Seven consumers (motion_score,ms_ssim_score,psnr_hvs_score,integer_ssim_score,float_vif_score,ssimulacra2_blur,ssimulacra2_mul) referenced<name>_hsacosymbols under#ifdef HAVE_HIPCCthat the build never compiled. These are added. -
ADM kernels carry CUDA-helper dependencies. Four ADM .hip files (
adm_dwt2,adm_csf,adm_csf_den,adm_cm) reference CUDA-specific types (uint64_cu,cta_size,warp_reduce,VMAF_CUDA_THREADS_PER_WARP) that don't compile underhipcc --genco's minimal include set. A weak-stub TU (hip_hsaco_stubs.c) provides empty fallback blobs for those four symbols so the link succeeds; the hosthipModuleLoadDatacall against the empty blob returns an error and the extractor falls back to CPU through the ADR-0530 dispatch fallback. When any individual ADM kernel is ported off the CUDA helpers, adding it tohip_kernel_sourceswill override the weak symbol at link time. -
HIPcc include path missing
build_dir+feature/hip+hip. Thehipcc --gencocustom_target only hadmeson.current_source_dir().../featurein the include list, so kernels includinghip/integer_*_hip.h(which transitively pull inconfig.hfrom the build directory) failed to find headers. Three more-Ientries fix it.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
| Port the CUDA twin verbatim (templates + shared-memory + warp-reduce) | Best perf; closer parity story | ~3× more code; HIP wavefront-size differences (RDNA2=32, gfx9XX=64) need separate tuning; risk of subtle numerical divergence in the warp-reduce path | Scalar-per-thread is sufficient correctness baseline; perf optimisation is its own ADR once the parity gate is stable |
| Pass the filter table as a by-value struct (CUDA pattern) | Avoids the device-pointer indirection | HIP module-launch struct-by-value works but adds an aliasing surface for the kernel's (const uint32_t *)&mu1 reinterpret pattern; the 144-byte upload is one-shot at init so the perf cost is amortised | Device-pointer pattern is the more portable HIP idiom and matches the integer_motion_hip / float_psnr_hip precedent |
| Keep the flag cleared and defer the kernel fix | Smallest delta; lowest risk of new regression | Leaves the model-driven --backend hip path inconsistent (some flagged, some cleared); blocks the cross-backend parity gate from running VIF on HIP | The flag flip is the entire point of ADR-0530's follow-up; deferring it leaves the gap permanently |
Consequences¶
Positive:
vmaf --backend hip --feature vif_hipno longer crashes on gfx1036; produces VMAF scores end-to-end on the Netflix golden pair (src01_hrc00 vs src01_hrc01 at 576×324).- The HIP dispatch path is now consistent: every HIP-registered extractor whose kernel works has
VMAF_FEATURE_EXTRACTOR_HIPset; the ADR-0530 fallback handles unflagged misses. - The
enable_hipcc=truebuild path links cleanly under ADR-0533's extractor-registration sweep.
Negative:
- HIP VIF scores are within ~0.003 of CPU per scale on the Netflix golden pair (places=3 — not places=4 of ADR-0214). Sources of divergence: clamp boundary inside the kernel vs CPU's pre-padded mirror buffer; cumulative downsampling error compounding across scales. Cross-backend parity gate currently passes at places=3 for VIF on HIP; a follow-up ADR will tighten to places=4 by porting the CPU's pre-pad boundary semantics into the kernel.
- The scalar-per-thread kernel is correctness-first, not perf-first. Throughput will be ~5–10× slower than the CUDA twin on equivalent silicon. Acceptable for now (HIP path is developer-only); the perf optimisation is a separate ADR.
Neutral / follow-ups:
- Tighten cross-backend parity from places=3 to places=4 for HIP VIF by porting the CPU boundary semantics.
- Port the CUDA twin's shared-memory tiling for perf parity.
- Port the four ADM .hip kernels off the CUDA helper macros so the weak HSACO stubs can be removed.
- Real HIP picture pool (so HtoD staging in submit can be eliminated).
References¶
req: "Fix thevmaf_fex_integer_vif_hipGPU memory access fault."- ADR-0530: HIP feature-flag promotion + HIP_DEVICE picture-buffer type.
- ADR-0533: HIP-extractor registration sweep (PR #1292).
- ADR-0254: Original HIP feature-flag scaffold (batch-1).
- ADR-0214: Cross-backend parity gate (places=4).
- ADR-0468: integer_motion_v2_hip scaffold (HtoD staging precedent).
- AMD_LOG_LEVEL=3 trace confirming kernel launches succeed:
/tmp/amd.login the PR description.