HIP (AMD ROCm) compute backend¶
Status (2026-05-18):
vmaf --backend hipis end-to-end working on AMD ROCm hosts following ADR-0519. The library-sidevmaf_hip_import_statewas promoted from-ENOSYSto a real implementation; the CLI now produces a valid VMAF JSON on any AMD GPU visible to ROCm. Verified on AMD gfx1036 (Radeon 680M) inside thevmaf-dev-mcpcontainer: VMAF = 76.66783 on the Netflix golden src01 pair, bit-exact match against the CPU backend (delta = 0; meets theplaces=4cross-backend gate from ADR-0214 with room to spare). HIP joins CUDA / SYCL / Metal as a fully working runtime-selected backend. (The Vulkan backend was removed in ADR-0726.)Dispatch posture (2026-05-18, updated per ADR-0530):
vmaf_fex_integer_motion_hipnow carriesVMAF_FEATURE_EXTRACTOR_HIPand is selectable from the model-driven dispatch when a HIP state has been imported (vmaf --backend hipimplies that import). TheVMAF_PICTURE_BUFFER_TYPE_HIP_DEVICEenum entry has been added for the future HIP picture pool; pictures still arrive asVMAF_PICTURE_BUFFER_TYPE_HOSTfor now and the HIP feature TUs perform their own HtoD copies (hipMemcpy2DAsync). End-to-end verification:--backend hip --feature integer_motionproduces a clean VMAF JSON with VMAF = 76.71 on the Netflix src01 pair (vs CPU 76.67, well inside the places=4 cross-backend gate from ADR-0214); 48hipModuleLaunchKernel(calculate_motion_score_kernel_8bpc)launches per 48-frame clip confirm the HIP kernel is actually dispatching.The other HIP-flagged extractors (
vif_hip,psnr_hip,ciede_hip,float_moment_hip,motion_v2_hip,float_motion_hip,float_ssim_hip,float_psnr_hip,cambi_hip,float_adm_hip,ssimulacra2_hip) remain unflagged pending per-extractor end-to-end verification. Each will be promoted by its own follow-up PR + ADR after a successful reproducer + cross-backend numerical check.Status (2026-05-29): 19 distinct HIP
VmafFeatureExtractordescriptors are registered infeature_extractor_list[]and resolve viavmaf_get_feature_extractor_by_name(<name>). (The table below lists 21 rows: two of them —integer_ciede_hipandinteger_moment_hip— are alias names that resolve to the canonicalciede_hip/float_moment_hipdescriptors, not separate registrations.) ADR-0523 wired the first long-missing entry (integer_motion_hip); ADR-0533 swept in the remaining six (float_vif_hip,integer_adm_hip→adm_hip,integer_ms_ssim_hip,psnr_hvs_hip,integer_ssim_hip,ssimulacra2_hip). The three legacy-API plumbing TUs (adm_hip.c,vif_hip.c,motion_hip.c) carry noVmafFeatureExtractorstruct — they hold older_init/_run/_destroyhelpers and are intentionally not registered. Two stale rename-scaffold duplicates (integer_ciede_hip.c,integer_moment_hip.c) are unwired inhip_sourcesbecause the canonical TUs (ciede_hip.c,float_moment_hip.c) already register the same extractor.
Extractor Feature name Added in integer_psnr_hippsnr_hipADR-0241 float_psnr_hipfloat_psnr_hipADR-0254 ciede_hip(nowinteger_ciede_hip)ciede_hipADR-0259 / PR #1016 float_moment_hipfloat_moment_hipADR-0260 integer_motion_v2_hipmotion_v2_hipADR-0267 float_motion_hipfloat_motion_hipADR-0373 float_ssim_hipfloat_ssim_hipADR-0375 float_vif_hipfloat_vif_hipADR-0379 integer_psnr_hvs_hippsnr_hvs_hipPR #995 integer_cambi_hipcambi_hipPR #996 ssimulacra2_hipssimulacra2_hipPR #1000 integer_vif_hipinteger_vif_hipPR #1001 integer_motion_hipinteger_motion_hipPR #1004 integer_adm_hipinteger_adm_hipPR #1007 integer_ciede_hipciede_hip(alias)PR #1016 integer_moment_hipinteger_moment_hipPR #1017 integer_ms_ssim_hipms_ssim_hipADR-0285 / PR #1013 integer_ssim_hipinteger_ssim_hipPR #999 float_adm_hipfloat_adm_hipADR-0468 / PR #1024 speed_chroma_hipspeed_chroma_hipADR-0567 / ADR-0852 speed_temporal_hipspeed_temporal_hipADR-0567 / ADR-0852 All 19 registered kernels require
enable_hip=true+enable_hipcc=true. (float_ansnr_hipwas removed together with the CPU extractor in commit 70ed8b3ce3 / PR #38; it no longer appears in this table.) Withoutenable_hipcc, the scaffold-ENOSYSposture is preserved. The three stubs (adm_hip,vif_hip,motion_hip) use an older_init/_run/_destroyAPI shape that predates the HSACO kernel template; they remain at-ENOSYSpending an API redesign.
Building¶
ROCm 7.0 or later is required; 7.2.4 is the version tested in CI and the dev container.
meson setup build -Denable_cuda=false -Denable_sycl=false \
-Denable_hip=true -Denable_hipcc=true
ninja -C build
meson test -C build
enable_hipcc=false (the default) compiles the HIP C host runtime but skips the hipcc-compiled kernel objects; every extractor returns -ENOSYS at init(). Set both flags to true to compile and link the real device kernels.
The scaffold has zero hard runtime dependencies — no ROCm SDK, no hipcc, no amdhip64. The Meson build files include an optional dependency('hip-lang', required: false) probe so a host that already has ROCm installed will see the dependency resolve; the scaffold compiles cleanly without it.
-Dhip_gfx_targets (HSACO fat-binary targets)¶
hipcc --genco produces an HSACO blob for each --offload-arch target. The Meson build discovers the targets in this order:
- The
-Dhip_gfx_targets=<csv>operator override (explicit). rocm_agent_enumerator(filters togfx*lines).hipconfig --amdgpu-target.- The hard-coded fallback list
gfx90a,gfx1030,gfx1036,gfx1100(CDNA2 server + RDNA2 desktop + AMD Raphael APU iGPU + RDNA3).
Steps 2 and 3 only succeed when the build host can see a real GPU. Inside a no-GPU build sandbox (BuildKit, CI) both probes return empty and the build falls through to step 4. The fallback was gfx90a only until ADR-0546; that narrow fallback shipped libvmaf.so binaries that failed at runtime on the fork's own dev host (AMD Raphael APU gfx1036, override-mapped to gfx1030 via HSA_OVERRIDE_GFX_VERSION=10.3.0) with hip_fatbin.cpp: No compatible code objects found for: gfx1030. Widening the fallback closed that failure mode without changing what an operator with a configured GPU sees.
Operators that need a smaller fat binary (image size, build time) can pin a single target:
Multi-target operator overrides take a comma-separated list (one --offload-arch per target):
The HIP HSACO targets: line in the Meson configure output shows the resolved list for the current build.
Runtime¶
When built with HIP and device kernels, the backend is available for explicit opt-in:
./build/tools/vmaf --feature psnr_hip --reference ref.yuv ...
./build/tools/vmaf --feature float_psnr_hip --reference ref.yuv ...
./build/tools/vmaf --feature integer_vif_hip --reference ref.yuv ...
./build/tools/vmaf --feature integer_adm_hip:adm_skip_scale0=true --reference ref.yuv ...
FFmpeg backend selector: hip_device=N (patch 0011-libvmaf-wire-hip-backend-selector.patch in ffmpeg-patches/; see ADR-0380).
Source layout¶
core/src/hip/ # HIP runtime (common, picture_hip, dispatch_strategy)
core/src/feature/hip/ # per-feature kernels
integer_psnr_hip.c # uint64 atomic-SSE warp-64 __shfl_down
float_psnr_hip.c # float (ref-dis)^2 reduction per block
float_motion_hip.c # 5x5 Gaussian blur + per-block float SAD
float_moment_hip.c # four uint64 atomic accumulator kernel
float_ssim_hip.c # two-pass separable 11-tap Gaussian kernel
float_vif_hip.c # multi-scale VIF float pipeline
float_adm_hip.c # ADM float pipeline (ADR-0468)
ciede_hip.c # legacy alias for integer_ciede_hip
integer_ciede_hip.c # YUV->Lab, CIEDE2000 dE, warp-64 shfl_down
integer_motion_v2_hip.c # raw-pixel ping-pong, 5-tap Gaussian diff
integer_motion_hip.c # 5-tap Gaussian blur + warp-reduced SAD
integer_moment_hip.c # four uint64 atomic accumulator (integer)
integer_psnr_hvs_hip.c # PSNR-HVS frequency-weighted distortion
integer_ssim_hip.c # two-pass separable Gaussian + SSIM combine
integer_ms_ssim_hip.c # multi-scale SSIM (5 scales, biorthogonal LPF)
integer_adm_hip.c # ADM DWT2 + CSF + CM + decouple pipeline
integer_vif_hip.c # multi-scale VIF integer pyramid
integer_cambi_hip.c # CAMBI banding detection
ssimulacra2_hip.c # SSIMULACRA2 (host YUV->XYB + GPU IIR blur)
adm_hip.c # stub — returns -ENOSYS (legacy API)
vif_hip.c # stub — returns -ENOSYS (legacy API)
motion_hip.c # stub — returns -ENOSYS (legacy API)
Kernel notes¶
integer_psnr_hip— uint64 atomic-SSE kernel, warp-64__shfl_downreduction. Emitspsnr_y.float_psnr_hip— float (ref-dis)² reduction per block. Emitsfloat_psnr.float_motion_hip— temporal extractor. 5×5 separable Gaussian blur + per-block float SAD partials, blur ping-pong (blur[2]), first-framecompute_sad=0short-circuit, motion2 tail emission inflush(). EmitsVMAF_feature_motion_score+VMAF_feature_motion2_score.float_moment_hip— four uint64 atomic accumulator kernel (ref1st, dis1st, ref2nd, dis2nd), warp-64 two-uint32-shuffle reduction. Host divides by w×h. Emits fourfloat_moment_*features.float_ssim_hip— two-pass separable 11-tap Gaussian kernel. Pass 1 (horiz): five intermediate float buffers over (W-10)×H. Pass 2 (vert + SSIM combine): per-block float partial sum over (W-10)×(H-10). Host accumulates in double. Emitsfloat_ssim.integer_ciede_hip— HtoD copies of all 6 YUV planes (ref + dis Y/U/V), per-pixel YUV→Lab conversion, CIEDE2000 ΔE accumulation per block, host log10 transform. Emitsciede2000. Warp-64__shfl_downwithout mask.integer_motion_v2_hip— temporal extractor. Raw-pixel ping-pong (pix[2]), separable 5-tap Gaussian diff filter with arithmetic right-shift (critical for bit-exactness vs CPU — see ADR-0138/0139 and PR #587 AVX2 srlv_epi64 regression), single int64 atomic SAD accumulator, host-sidemin(cur, next)fold inflush(). EmitsVMAF_integer_feature_motion_v2_sad_score+VMAF_integer_feature_motion2_v2_score.integer_motion_hip— 5-tap Gaussian blur + warp-reduced SAD, ping-pong frame buffer; mirrorsinteger_motion_cuda.ccall-graph. EmitsVMAF_feature_motion2_score.integer_psnr_hvs_hip— frequency-weighted distortion per 8×8 block, porting the CUDA twin. Emitspsnr_hvs+ per-channel variants.integer_ssim_hip— two-pass separable 11-tap Gaussian SSIM, GCN/RDNA warp-size-64 adaptation. Emitsinteger_ssim.integer_ms_ssim_hip— multi-scale SSIM over 5 pyramid levels; 9-tap biorthogonal LPF decimation + separable 11-tap Gaussian per scale. Emitsfloat_ms_ssim. Per ADR-0285.integer_adm_hip— full ADM DWT2 + CSF + CM + decouple pipeline (five kernel files). Mirrorsinteger_adm_cuda.c. Emitsadm2+ per-scale values.integer_vif_hip— multi-scale VIF integer pyramid; respectsvif_skip_scale0(PR #1063) andvif_enhn_gain_limit. Emitsvif_scale0..3.integer_cambi_hip— CAMBI banding detection; full HIP port per PR #996 (ADR-0345 Phase 3). Emitscambi.ssimulacra2_hip— host-side YUV→XYB + GPU IIR blur + host double-precision combine; mirrors the CUDA twin. Emitsssimulacra2.float_adm_hip— ADM float pipeline, ninth kernel-template consumer (ADR-0468). Mirrorsfloat_adm_cuda.c. Emitsfloat_adm2.float_vif_hip— multi-scale VIF float pipeline; respectsvif_skip_scale0(PR #1180). Emitsfloat_vif_scale0..3.
Remaining stubs¶
adm_hip, vif_hip, and motion_hip use the older _init/_run/_destroy API shape that requires a separate VmafFeatureExtractor redesign before promotion. Each returns -ENOSYS at init(). Tracked in docs/state.md.
Caveats¶
enable_hipisbooleandefaulting to false.enable_hipcc(alsoboolean, default false) controls whetherhipcc-compiled kernel objects are linked. Both must betruefor real GPU computation.- HIP runtime types (
hipDevice_t,hipStream_t) cross the public ABI asuintptr_t. This keepslibvmaf_hip.hfree of<hip/hip_runtime.h>, mirroring the pattern Vulkan adopted in ADR-0184. - No CI runner with a real AMD GPU exists on GitHub-hosted infrastructure. The CI compile lane (
Build — Ubuntu HIP) runs with-Denable_hip=truebut-Denable_hipcc=false, so kernels are not compiled or exercised on CI.
References¶
- ADR-0212 — the original scaffold.
- ADR-0241 — first consumer (
integer_psnr_hip). - ADR-0254 — second consumer (
float_psnr_hip). - ADR-0259 — third consumer.
- ADR-0260 — fourth consumer (
float_moment_hip). - ADR-0266 — fifth consumer (
float_ansnr_hip), retained for historical traceability. The kernel and its CPU twin were removed in ADR-0709 (PR #38) — ANSNR is no longer a registered feature on any backend. - ADR-0267 — sixth consumer (
motion_v2_hip). - ADR-0372 — batch-1 kernels.
- ADR-0373 — batch-2 kernels.
- ADR-0375 — batch-3 kernels.
- ADR-0377 — batch-4 kernels.
- ADR-0379 —
float_vif_hip. - ADR-0380 — FFmpeg selector.
- ADR-0468 —
float_adm_hip. - ADR-0523 — register
vmaf_fex_integer_motion_hip. - ADR-0533 — full HIP-extractor registration sweep (six more TUs wired into
hip_sources+feature_extractor_list[]). - Research-0033 — AMD market-share + ROCm Linux maturity survey.
ADR-0537: integer_vif_hip kernel fix (2026-05-18)¶
The integer VIF HIP extractor now runs end-to-end on AMD gfx1036 inside the vmaf-dev-mcp container:
docker exec vmaf-dev-mcp vmaf \
--reference /workspace/python/test/resource/yuv/src01_hrc00_576x324.yuv \
--distorted /workspace/python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8 \
--backend hip --feature vif_hip --json --output /tmp/vif_hip.json
Reports the four VIF scale scores within places=3 of CPU on the Netflix golden pair. The places=4 parity target is tracked as an ADR-0537 follow-up — the residual ~0.001–0.003 per-scale delta comes from the kernel's edge-clamp boundary vs CPU's pre-padded mirror boundary (cumulative across downsamples).
Four defects fixed (see ADR-0537):
- The 4×18
vif_filter1d_tableis uploaded to a device buffer at init (the pre-fix kernel was handed a host pointer that the GPU faulted on). - Filter half-widths corrected from
{9,5,3,0}(parsed from the kernel- name suffix — the wrong number) to{8,4,2,1}(=vif_filter1d_width [scale] / 2). Pre-fix read 19/11/7/1 coefficients per output pixel from an 18-entry table. - Added the rd-filter downsample-write path so scales 1–3 read the half- resolution planes the previous horizontal pass produced. Pre-fix left them uninitialised.
- Picture buffers are staged into device memory via
hipMemcpy2DAsyncbefore scale-0 reads them (mirrors theinteger_motion_hip.cpattern).
Adjacent fixes bundled in the same PR:
- Missing HSACO entries (
motion_score,ms_ssim_score,psnr_hvs_score,integer_ssim_score,float_vif_score,ssimulacra2_blur,ssimulacra2_mul) added tohip_kernel_sources— ADR-0533 wired the extractor registration sweep but not the corresponding kernel compilation. - Weak-stub TU
hip_hsaco_stubs.cprovides empty fallback_hsacosymbols for the four ADM kernels (adm_dwt2,adm_csf,adm_csf_den,adm_cm) that don't yet build standalone viahipcc --gencobecause they reference CUDA-specific helper macros. As individual kernels port to standalone-buildable.hipsources, their weak-stub line is deleted fromhip_hsaco_stubs.cin the same PR — ADR-0539 establishes that pattern, starting withfloat_vif_score_hsacowhose real kernel has shipped atcore/src/feature/hip/float_vif/float_vif_score.hipsince ADR-0379 / PR #1025. hipcc --gencoinclude path addsmeson.current_build_dir()+feature/hip+hipso kernel sources can resolveconfig.h/integer_*_hip.hheaders.
Re-enables VMAF_FEATURE_EXTRACTOR_HIP on vmaf_fex_integer_vif_hip — ADR-0530 had cleared it pending this fix.
ADR-0539 — integer_moment HIP kernel registration (2026-05-18)¶
Closes the last unresolved-symbol gap in the enable_hipcc=true HIP build. Adds a new entry to hip_kernel_sources:
The key is distinct from the pre-existing moment_score key (which points at hip/float_moment/moment_score.hip). The two keys emit different _hsaco symbols (integer_moment_score_hsaco vs moment_score_hsaco) consumed by integer_moment_hip.c and float_moment_hip.c respectively.
End-to-end verification on the Netflix src01_hrc00 ↔ src01_hrc01 576×324 pair (HIP vs CPU, --backend hip|cpu --feature psnr|psnr_hvs|float_moment):
| Feature | HIP | CPU | delta |
|---|---|---|---|
psnr_y mean | 30.755064 | 30.755064 | 0.000000 |
psnr_cb mean | 38.449441 | 38.449441 | 0.000000 |
psnr_cr mean | 40.991910 | 40.991910 | 0.000000 |
psnr_hvs mean | 31.330446 | 31.330446 | 0.000000 |
psnr_hvs_y mean | 30.578766 | 30.578766 | 0.000000 |
psnr_hvs_cb mean | 37.258498 | 37.258498 | 0.000000 |
psnr_hvs_cr mean | 38.200260 | 38.200260 | 0.000000 |
float_moment_ref1st mean | 59.788567 | 59.788567 | 0.000000 |
float_moment_dis1st mean | 61.332007 | 61.332007 | 0.000000 |
float_moment_ref2nd mean | 4696.668388 | 4696.668388 | 0.000000 |
float_moment_dis2nd mean | 4798.659574 | 4798.659574 | 0.000000 |
All within places=4 of CPU (in fact bit-exact: delta=0.000000).
After this PR no weak HSACO stubs back any of the three integer-domain PSNR / PSNR-HVS / moment extractors — only the four ADM kernels remain on the ADR-0536 stub path pending their own CUDA-helper-macro port.
Per-kernel hipcc flag dispatch (ADR-0539)¶
core/src/meson.build defines a hip_cu_extra_flags dict alongside hip_kernel_sources so individual HSACO compilations can opt into non-default flags without changing the global hipcc command line. The fall-through (hip_cu_extra_flags.get(name, [])) is byte-identical to the prior command line for any kernel not listed.
| Kernel | Extra hipcc flags | Reason |
|---|---|---|
ssimulacra2_blur | -ffp-contract=off | Recursive Gaussian IIR pole-tracking depends on IEEE-754 add/mul ordering; allowing FMA fusion of n2 * sum - d1 * prev drifts the cascade past the places=2 parity gate within a handful of pyramid levels. Mirrors the CUDA twin's --fmad=false. |
Mirrors the established cuda_cu_extra_flags dict in the same meson file (used by float_adm_score and ssimulacra2_blur on the CUDA side). When porting float_adm device code to HIP, add a matching entry.
ADR-0539: integer ADM HIP kernels — real implementation (2026-05-18)¶
The four ADM kernels (adm_dwt2, adm_csf, adm_csf_den, adm_cm) that the ADR-0537 sub-bundle had left as weak HSACO fallbacks now build standalone via hipcc --genco and are registered in hip_kernel_sources. The xxd-embedded strong symbols replace the weak slots in hip_hsaco_stubs.c (which is now ADM-stub-free).
End-to-end on AMD gfx1036 inside vmaf-dev-mcp:
docker exec vmaf-dev-mcp vmaf \
--reference /workspace/python/test/resource/yuv/src01_hrc00_576x324.yuv \
--distorted /workspace/python/test/resource/yuv/src01_hrc01_576x324.yuv \
--width 576 --height 324 --pixel_format 420 --bitdepth 8 \
--backend hip --feature adm --json --output /tmp/adm_hip.json
Bit-exact vs CPU on the Netflix golden src01 pair (delta = 0.000000):
| Feature | CPU | HIP | Diff |
|---|---|---|---|
integer_adm | 0.934506 | 0.934506 | 0.000000 |
integer_adm2 | 0.934506 | 0.934506 | 0.000000 |
integer_adm3 | 0.953973 | 0.953973 | 0.000000 |
integer_adm_scale0 | 0.907897 | 0.907897 | 0.000000 |
integer_adm_scale1 | 0.893864 | 0.893864 | 0.000000 |
integer_adm_scale2 | 0.929998 | 0.929998 | 0.000000 |
integer_adm_scale3 | 0.964951 | 0.964951 | 0.000000 |
The CUDA twin's per-warp __shfl_down_sync reduction (cuda_helper.cuh::warp_reduce) is replaced by per-thread atomicAdd on the 64-bit unsigned accumulator. AMD wavefronts are 64 wide (not the 32 the CUDA shuffle mask hard-codes); per-thread atomicAdd is bit-exact since uint64 addition is associative and commutative. Same pattern vif_statistics.hip adopted (ADR-0537).
See ADR-0539.
ADR-1103: integer_vif_hip boundary fix — places=4 parity achieved (2026-06-13)¶
After the ADR-0563 carry-bit fix, integer_vif_hip still produced a residual parity gap of places~2.75 (max |HIP−CPU| ≈ 0.0018 per scale) on the Netflix src01 576×324 pair. The root cause was a boundary-condition mismatch: all filter-loop reads used clamp_i (replicate-edge), while the CPU reference uses a symmetric reflect (PADDING_SQ_DATA in integer_vif.h) and the CUDA twin uses a "two-bounce mirror" in its shared-memory load stage.
The fix replaces clamp_i with mirror2_i in all six filter-loop reads in vif_statistics.hip. Verification on gfx1030 (RDNA2, wave32):
| Scale | Max |HIP−CPU| (post-fix) | Places | |-------|------------------------|--------| | scale0 | 0.0000010 | ~6.00 | | scale1 | 0.0000010 | ~6.00 | | scale2 | 0.0000010 | ~6.00 | | scale3 | 0.0000010 | ~6.00 |
All 48 Netflix src01 frames meet places=4. Pooled VMAF delta: 0.000017 (places~4.7). The in-repo parity test tolerance is tightened from 1e-3 to 1e-4 per ADR-0214 and ADR-0566.
See ADR-1103.