ADR-0187: ciede2000 Vulkan kernel — float-precision per-pixel ΔE¶
- Status: Accepted
- Date: 2026-04-27
- Deciders: Lusoris, Claude (Anthropic)
- Tags: vulkan, gpu, feature-extractor, fork-local, places-2
Context¶
ADR-0182 scoped a three-batch GPU long-tail rollout (psnr / moment / ciede across CUDA, SYCL, Vulkan). Batches 1a (psnr Vulkan, PR #125), 1b (psnr CUDA + SYCL, PR #129–#130), and 1d (moment ×3, PR #133 + PR #135) shipped with bit-exact cross-backend gates at places=4. All four extractors operate on integer YUV and accumulate int64 sums — no transcendentals, deterministic reductions.
ciede2000 (batch 1c) is structurally different. Per pixel, the CPU reference (core/src/feature/ciede.c) does:
- YUV → BT.709 RGB (3 multiplies + adds).
- RGB → linear via the sRGB transfer (
pow(x, 2.4)piecewise). - Linear RGB → CIE XYZ (3×3 matrix).
- XYZ → L*a*b* (
pow(t, 1/3)for the inner nonlinearity). - CIEDE2000 ΔE (a tangle of
pow(c, 7),sqrt,sin,atan2operations with hue branches). - Sum-of-ΔE across pixels, divide by W·H, then
score = 45 - 20·log10(mean_ΔE).
Each pixel uses ~40 transcendental float ops. GPU transcendentals are accurate to ~2 ULP per op; libm runs at full double precision in the CPU path's intermediate computations. Bit-exactness is therefore not on the table.
Decision¶
Ship ciede_vulkan as the first non-bit-exact GPU extractor in the fork. Single dispatch per frame; per-WG float partial sums reduced into a partials buffer; host accumulates partials in double, divides by W·H, and applies the CPU's logarithmic transform 45 - 20·log10(mean_ΔE) for the final ciede2000 metric. Cross-backend gate target: places=4.
The places=4 target was set after empirical measurement, not guessed:
- Initial budget per the popup framing:
places=2(≤ 0.005 abs). - Empirical: Intel Arc A380 + Mesa anv →
max_abs = 1.0e-5across 48 frames at 576×324. - That sits comfortably under the
places=4threshold (≤ 5e-5), so the gate is set toplaces=4for parity with the other Vulkan kernels and to catch any future regression promptly.
If the lavapipe lane ever produces > 5e-5 (e.g. from a Mesa update that changes pow accuracy), the gate will fail loudly and we relax to places=3 with a note here. The empirical floor on a real driver is the contract; the script's --places arg can be overridden per-lane.
Chroma handling¶
ciede needs full-resolution YUV. picture_vulkan's existing VmafVulkanBuffer API is plane-agnostic, so no public-API change is needed:
- 6 storage-buffer bindings: ref Y/U/V + dis Y/U/V, each at full luma resolution.
- Host upscales chroma planes to luma resolution before upload (mirrors
ciede.c::scale_chroma_planes— nearest-neighbour for bothYUV420PandYUV422P;YUV444Pis a straight copy). 8-bit and 10/12/16-bit paths handled separately. - Shader: same packed-uint32 layout as
psnr.comp/moment.comp(4 pixels per uint32 at 8bpc, 2 uint16 per uint32 at HBD). Six independent accessor functions because GLSL pre-VK_EXT_buffer_device_addressdoesn't allow passing buffer blocks as parameters.
Logarithmic score transform¶
The CPU emits score = 45 - 20*log10(de00_sum / (W*H)), not the raw mean ΔE. Easy to miss; the GPU host's first cut omitted the transform and the gate reported max_abs = 33 (CPU score ~35, GPU raw mean ΔE ~3). Fix is one line on the host side. Documented here so the CUDA + SYCL twins don't repeat the mistake.
Alternatives considered¶
| Option | Pros | Cons | Why not chosen |
|---|---|---|---|
Float64 atomics in the kernel (VK_EXT_shader_atomic_float2) | Single dispatch, no host reduction | Extension not universally available (lavapipe yes, NVIDIA Vulkan partial); needs feature gating | Per-WG partial sums + host reduction is portable and the host work is negligible (~8000 floats at 1080p) |
Per-pixel atomic float to a single counter (VK_EXT_shader_atomic_float) | Same logic as the CUDA atomicAdd(float*) path | Massive contention on a single atomic — measured 10× slowdown vs. the partials approach in early prototyping; precision worse because every atomicAdd loses ULPs | Per-WG partials win on both axes |
Tighter per-pixel double math in the shader | Higher accuracy | Vulkan compute requires shaderFloat64 feature + explicit double/f64vec3; halves the ALU throughput on most consumer GPUs | Empirical Arc A380 result already lands at places=4 — no upside |
| Defer to ciede CUDA / SYCL parts 2+3 (skip Vulkan first) | Faster initial bring-up on a single backend | Loses the "Vulkan as reference" precedent established by moment / psnr; CUDA / SYCL ports without a working Vulkan template are harder to validate | Same reason batch 1d started with Vulkan |
Consequences¶
- Positive: closes the last per-metric GPU long-tail row (ciede) on the Vulkan side. Empirical
max_abs = 1e-5on Intel Arc A380 — places=4 contract holds. Host upscale path matches the CPU'sscale_chroma_planesexactly. - Negative: this is the first non-bit-exact GPU extractor; contributors looking at moment / psnr won't find an
int64pattern to copy. The ADR explicitly flags this and points readers at the per-WG-float-partial pattern inciede_vulkan.c. Six storage-buffer bindings is more than the existing extractors (max was 3); no architectural consequence — Vulkan supports up to 8 in a single descriptor set on every relevant device. - Neutral / follow-ups:
- Batch 1c part 2 — DONE.
ciede_cudashipped in the batch 1c parts 2 + 3 bundle (this PR's sibling). Per-block float partials, host accumulates indouble. The simpler atomic-float reduction empirically lost ~30% precision at 1080p (10⁷ sum magnitude vs float7 floor). Surfaced a latentvmaf_cuda_picture_upload_asyncbug: bitmask was hardcoded to0x1(luma only) — fixed to0x7for any pix_fmt other than YUV400P, unblocking chroma-aware CUDA extractors. - Batch 1c part 3 — DONE.
ciede_syclshipped alongside part 2. Self-contained submit/collect (NOT graph_register —shared_frameis luma-only).nd_range<2>kernel withsycl::reduce_over_groupbuilds per-WG float partials; host accumulates indouble. fp64-free because Intel Arc A380 lacks native fp64 (initialsycl::reduction<double>attempt threw at runtime). - Closes ADR-0182 — every metric in the GPU long-tail scope now has at least one GPU twin; ciede has all three (Vulkan + CUDA + SYCL). The matrix in
.workingdir2/analysis/metrics-backends-matrix.mdreflects this.
Verification¶
- 48 frames at 576×324 on Intel Arc A380 vs CPU scalar:
max_abs = 1.0e-5,0/48 places=4 mismatchesviascripts/ci/cross_backend_vif_diff.py --feature ciede --backend vulkan --places 4. - New CI step
ciede cross-backend diff (CPU vs Vulkan/lavapipe)in.github/workflows/tests-and-quality-gates.ymlruns the same gate on the lavapipe lane.
References¶
- Parent: ADR-0182 — GPU long-tail batch scope.
- Sibling kernels: psnr Vulkan (PR #125), moment Vulkan (PR #133), moment CUDA + SYCL (PR #135).
- CPU reference:
core/src/feature/ciede.c.