GPU SpEED — covariance geometry + eigenvalue-basis correctness (2026-06-20)¶
RTX-4090 bit-parity verification of the GPU SpEED extractors.
Scope: the GPU SpEED feature extractors — core/src/feature/cuda/{speed_chroma_cuda.c, speed_temporal_cuda.c} + core/src/feature/cuda/speed/speed_score.cu, and their HIP / SYCL twins (hip/speed_{chroma,temporal}_hip.c + hip/speed/speed_score.hip; sycl/speed_{chroma,temporal}_sycl.cpp). The CPU reference is core/src/feature/speed.c (est_params, compute_mean, speed_internal_compute_eigenvalues).
This digest explains two algorithm-correctness bugs that made every GPU SpEED run produce wrong scores, the heap-safety and device-pointer bugs found alongside them, and the bit-parity verification method that proved the fix on an RTX 4090. It is the audit trail behind the "GPU SpEED now matches CPU" user-visible numeric correction.
1. Why these bugs survived to the audit¶
SpEED is the only feature whose GPU twins were never validated against the CPU before this PR. The reason is structural, not accidental:
- The CUDA / HIP SpEED parity tests (
test_cuda_speed_{chroma,temporal}_parity,test_hip_speed_{chroma,temporal}_parity) require a real GPU at runtime. The hosted CI runners have no GPU, so these tests compile but never execute — they skip cleanly. Nothing in the green CI signal ever exercised the GPU SpEED math. - The SYCL chroma parity test (
test_sycl_speed_chroma_parity) did run on the Arc CI lane, but its fixture (256×144 → 128×72 chroma) was below the SpEEDNUM_SCALESpyramid minimum, so the extractor returned-EINVALat init and the test never reached the kernels. It was further masked whilespeed_chroma_syclwas unregistered (pre-#1004).
Net effect: the GPU SpEED kernels were a blind port of the CPU formula whose output nobody had ever compared to the reference. The audit forced the comparison by (a) building a real GPU parity harness in the dev-mcp container on the RTX 4090, and (b) enlarging the SYCL fixture to 320×320 so the chroma plane (160×160) clears the minimum.
2. Bug A — per-tile vs global covariance geometry (~7× wrong scores)¶
2.1 What the CPU does¶
est_params in speed.c computes, for a downscaled plane, one global 5×5-element covariance matrix. For each of the 25 phase-shift elements elem = (er, ec) with er, ec ∈ [0, 5), it forms a phase-shifted full-plane submatrix M_elem[y][x] = plane[(y+er)*stride + (x+ec)] and:
compute_meangives the scalar global meanmeans[elem]over the whole submatrix (meansis indexed[0, 25)— one mean per element).- The covariance entry
cov[i][j]is the global covariance of submatrixM_iagainst submatrixM_j, i.e.Σ_{x,y} (M_i[y][x] − means[i]) · (M_j[y][x] − means[j]) / N, summed over every(x, y)in the submatrix and divided byNonce.
There is no spatial tiling. The matrix is 25×25, built from 25 scalar means over the full phase-shifted plane.
2.2 What the GPU port did wrong¶
The original speed_means_kernel / speed_cov_kernel decomposed the plane into num_blocks spatial tiles and computed per-tile, block-local statistics:
meanswas laid outmeans[elem * num_blocks + tile_idx]— a separate mean per(element, tile)pair, sampled attile_y*5 + er(a per-tile origin) instead of the globaler.speed_cov_kernellooped overnum_blockstiles, summednum_blocksdisplaced block-local covariances, and effectively divided per-tile rather than by the single globalN.
This is a genuinely different matrix. The per-tile mean subtraction removes inter-tile structure that the global covariance keeps, and the tile-summed normalisation rescales the entries. The eigenvalues of the resulting matrix feed the SpEED entropy, so the error propagates straight into the score: measured ~7× low vs CPU on the RTX 4090 (chroma cpu=19.84 vs cuda=2.89 before the geometry fix).
2.3 The fix¶
Rewrite both kernels to the CPU's global formulation:
speed_means_kernel→ 25 work-items, each computing the single global mean at element(er, ec); writesmeans[elem]forelem ∈ [0, 25). Themeans[]buffer stays over-allocated at25 × num_blocks(launch geometry and allocation unchanged), but only[0, 25)is written and read.speed_cov_kernel→ a global submatrix sweep with scalar global meansmeans[x_index]/means[y_index], dividing byNonce, no tile loop.
Launch geometry is preserved so the surrounding allocation / dispatch code is untouched. After this fix test_cuda_speed_temporal_parity passes bit-parity; chroma improved from 7× low to within 2× — surfacing Bug B.
3. Bug B — shared vs separate ref/dis covariance + eigenvalue basis¶
3.1 What the CPU does¶
est_params is called twice per frame — once on the reference plane, once on the distorted plane — producing two independent covariance matrices and, after the linear solve + eigendecomposition, two independent eigenvalue arrays. The reference entropy is computed from the reference eigenvalues; the distorted entropy from the distorted eigenvalues. They are never crossed.
3.2 What the GPU port did wrong¶
The CUDA host path computed the reference correctly, then saved the reference covariance and restored it before the distorted CPU-linalg. As a result the distorted linear-system solution AND the distorted eigenvalues were derived from the reference covariance. The shared speed_score_kernel then used a single eigenvalue array for both ref and dis entropy.
This only matters when ref ≠ dis. The temporal SpEED feature compares consecutive frames, where ref ≈ dis, so the bug was masked there — temporal parity passed even with the shared basis. The chroma feature compares the reference and distorted pictures, where ref ≠ dis, so the distorted path was ~2× high. RTX-4090 instrumentation confirmed the mechanism precisely: reference variance / entropy matched the CPU bit-exactly, while the distorted variance / entropy diverged (var 5.11 vs 3.54, entropy 211 vs 190).
3.3 The fix¶
- Drop the cov save/restore so the distorted path keeps the distorted covariance (and therefore distorted solve + distorted eigenvalues).
- Add a
d_eigenvalues_refdevice buffer; after the reference linalg,cuMemcpyDtoD(HIP:hipMemcpyDtoD; SYCL:q.memcpyDtoD) the reference eigenvalues aside into it. speed_score_kernelnow takes bothref_eigenvaluesanddis_eigenvalues; the ref entropy readsref_eigenvalues[k], the dis entropy readsdis_eigenvalues[k].
After this fix test_cuda_speed_chroma_parity and test_cuda_speed_temporal_parity both pass bit-parity (< 1e-4) vs CPU on the RTX 4090.
4. Co-located safety bugs (same extractors, found by the same audit)¶
These are not algorithm bugs but were in the same files and crashed the extractor before the math could even be compared:
-
Device-pointer SEGV (CUDA chroma + temporal).
extract_channel/extract_fex_stran the host-sidepicture_copy()directly onref_pic->data[channel]/data[0], which in the CUDA pipeline is aCUdeviceptr(likeadm/vif_cuda). The host read of GPU memory SEGV'd every frame. Fix:cuMemcpyDtoHthe plane into host staging, alias into a tempVmafPicture, thenpicture_copy. (For temporal a localcuCtxPushCurrentis needed because the GPU pipeline pushes the context later.) -
Eigenvalue-scratch heap corruption (RETRAIN-CRITICAL; all six GPU SpEED extractors).
speed_internal_compute_eigenvalueslays its scratch out asA[n*n] + d[n] + sd[n] + tmp[2*n] = n*n + 4*nfloats, but every GPU caller allocated onlyn*n + 3*n→si_tri_multiplywroten(= 25) floats past the end →free(): invalid next size. Silent ~100-byte overwrite on normal heaps; hard crash underMALLOC_PERTURB_/ ASan. The CPU path uses its own correctly-sized buffer. Fixed the+4*nsizing in all six. -
CPU heap-buffer-overflow (
speed.c, the CI all-backends SEGV).speed_init_dimensions()returns-EINVALfor planes too small for theNUM_SCALESpyramid, butspeed_init()/init_chroma()ignored the return;submatrix_height = truncated_height − block_size + 1then underflowedsize_tto ~2^64 andcompute_mean()walked off the buffer. The SYCL twin already checked this return; the CPU path did not. Fix: reorder thetruncated == 0guard before the subtraction and propagate the return throughspeed_init()+init_chroma(). Also: temporal init ignoredspeed_init()'s return entirely, so an in-range-but-invalidspeed_kernelscale/ bad prescale leftfloat_stride = 0→ zero-size mallocs → OOB inpicture_copy. Captured + checked. -
SYCL solve-kernel divergent-barrier deadlock (DEVICE_LOST on Intel Arc).
launch_solveearly-returned idle lanes (lane >= SP_ELEMENTS) and surplus warps before a work-groupgroup_barrierthey were required to reach → the group deadlocked on strict-barrier GPUs. Fix: gate the work with anactiveflag, keep all work-items in the barrier loop. -
Init-OOM resource leaks (CUDA + HIP). CUDA init
fail_poppopped the context and returned-EIOwithoutfree_cuda_buffers(_st)→ device + pinned-host buffers leaked on partial alloc failure. HIP init's post-alloc NULL check did a barereturn -ENOMEMthat bypassed thefree_cpucleanup label → up to 10/12 CPU scratch buffers leaked. Both routed to the cleanup path.
5. Verification method — RTX-4090 bit-parity¶
The method that turned "blind port" into "verified correct":
- Build in the dev-mcp container (
docker exec vmaf-dev-mcp …) with CUDA enabled, so the RTX 4090 is live. The container bakes in the CUDA toolchain and avoids the host-build toolchain drift (CLAUDE §15). - Run the GPU parity tests
test_cuda_speed_chroma_parityandtest_cuda_speed_temporal_parity. Each scores the same (ref, dis) pair on the CPU and on CUDA and asserts agreement atplaces=4(≤ 1e-4, the cross-backend tolerance of ADR-0214). - Instrument the intermediates when a test failed: dump per-element variance and entropy for the ref and dis paths separately. This is what localised Bug B — the ref intermediates matched bit-exactly while the dis intermediates diverged, pointing straight at the shared-basis save/restore.
- ASan +
MALLOC_PERTURB_sweep on the SYCL CPU-OOB repro and the full Arc-GPUspeed_chromapipeline to confirm the heap-scratch and barrier-deadlock fixes (nofree(): invalid next size, no DEVICE_LOST).
Result on the RTX 4090: both CUDA SpEED parity tests PASS bit-parity vs CPU after the full fix. The HIP and SYCL ports apply the identical algorithm but could not be run-verified on this host (no AMD GPU; the Arc lacks fp64 for the SpEED double-precision path) — CI / fp64 hardware compile- and run-verifies those legs.
Alternatives considered (ADR-0108 decision matrix; no separate ADR)¶
These are correctness fixes, so the "decision" was which shape of fix to take, not whether to fix:
| Decision point | Chosen | Rejected alternative | Why |
|---|---|---|---|
| Covariance geometry | Rewrite kernels to the CPU global formulation (scalar means [0,25), divide by N once) | Keep the per-tile decomposition and try to reconcile it numerically | The per-tile matrix is a different matrix; no rescaling recovers the global covariance. Match the reference exactly. |
means[] allocation | Leave over-allocated at 25 × num_blocks, use [0,25) | Shrink to 25 floats | Shrinking changes launch geometry + the surrounding alloc/dispatch code; over-allocation is harmless and keeps the diff minimal and rebase-safe. |
| Ref/dis eigenvalue basis | Separate d_eigenvalues_ref buffer + DtoD stash; score kernel takes two arrays | Keep the cov save/restore and one eigenvalue array | The save/restore forces dis to use the ref basis — the root cause. The CPU uses two independent bases; the GPU must too. |
| Where to verify | RTX-4090 bit-parity at places=4 in the dev-mcp container | Trust the blind port / CI-skip | CI has no GPU; the only way to know the math is correct is to run it on real silicon against the CPU reference. |
| HIP / SYCL legs | Port the identical verified algorithm, compile-verify, defer run-verify to CI / fp64 hardware | Block the PR until AMD + fp64 hardware is available | The algorithm is proven on CUDA; the twins are line-for-line ports. Blocking would leave the known-wrong kernels in tree longer. |
References¶
- CPU reference:
core/src/feature/speed.c—est_params,compute_mean,speed_internal_compute_eigenvalues,speed_init_dimensions. - ADR-0214 — cross-backend numeric tolerance (places=4 / 1e-4).
- CLAUDE §15 — dev-mcp container as the canonical GPU run environment.
- PR #1029 commit chain (9 commits): heap-OOB + eig-scratch + SYCL deadlock → device-pointer downloads → geometry fix → init-return checks → separate ref/dis basis → HIP/SYCL ports.