Research-0973: Master CI regressions — verified reproduction and root-cause analysis¶
Date: 2026-05-31 Author: Lusoris (with Claude Code) Companion ADR: ADR-0973 Master tip at investigation: 4948b771c
Scope¶
Two CI regressions on master were reported. The hard requirement was "reproduce EACH failure locally in the vmaf-dev-mcp container BEFORE writing any fix. No guessing." This digest records:
- The diagnosis hypothesis (from the task brief).
- The verification path inside the container.
- The exact commands and captured output for each failure (pre-fix and post-fix).
- The compiler-asm forensic evidence backing the second fix.
Container environment¶
$ docker inspect vmaf-dev-mcp --format '{{json .Mounts}}'
[{"Type":"bind","Source":"/home/kilian/dev/vmaf","Destination":"/workspace","Mode":"ro",...},
...]
The repo mount at /workspace is read-only, so the worktree was tar-piped into the container's /tmp/wt/ (writable). All build / test invocations below target that path.
Toolchain in the container:
- gcc 15.2.0 / ld.bfd 2.46
- Intel(R) oneAPI DPC++/C++ Compiler 2026.0.0 (
icx --version) - meson 1.10.1, ninja 1.13.2
Compiler version note: the failing CI job uses intel-oneapi-compiler-dpcpp-cpp-2025.3 (per .github/workflows/build.yml). The container has 2026.0.0. The FMA-contraction behaviour reproduced identically on both major versions, so the local reproduction is a faithful proxy.
Failure 1 — test_metal_float_ms_ssim_parity (3 macOS jobs)¶
Hypothesis¶
Per task brief: FIXTURE_H 144 is below the float_ms_ssim minimum admissible dimension. The check at core/src/feature/float_ms_ssim.c:131-138 enforces min_dim = GAUSSIAN_LEN << (SCALES - 1) = 11 << 4 = 176. CPU init returns -EINVAL. The test's first vmaf_read_pictures call therefore fails before the Metal path runs.
Reproduction (pre-fix)¶
The test itself is gated by enable_metal=enabled or (auto and darwin) and needs the Apple Foundation framework to compile its Metal sources — not buildable on Linux. Instead the diagnosis was verified through the production CLI, which exercises the exact same vmaf_use_feature("float_ms_ssim")
vmaf_read_picturescode path that the test'srun_cpu_float_ms_ssimuses.
$ docker exec vmaf-dev-mcp bash -c \
"cd /tmp/wt/core && meson setup build-fix \
-Denable_cuda=false -Denable_sycl=false 2>&1 | tail -3"
Found ninja-1.13.2 at /usr/bin/ninja
[...]
$ docker exec vmaf-dev-mcp bash -c \
"cd /tmp/wt/core && ninja -C build-fix tools/vmaf 2>&1 | tail -3"
[141/143] Linking target src/libvmaf.so.3.0.0
[143/143] Linking target tools/vmaf
$ docker exec vmaf-dev-mcp bash -c \
"head -c $((256*144*3/2)) /dev/zero > /tmp/test_256x144.yuv && \
/tmp/wt/core/build-fix/tools/vmaf \
-r /tmp/test_256x144.yuv -d /tmp/test_256x144.yuv \
-w 256 -h 144 -p 420 -b 8 --feature float_ms_ssim --no_prediction \
2>&1 | tail -5"
libvmaf ERROR float_ms_ssim: input resolution 256x144 is too small; \
the 5-level 11-tap MS-SSIM pyramid requires at least 176x176 (Netflix#1414)
problem reading pictures
problem flushing context
The error text matches the gate at core/src/feature/float_ms_ssim.c:131-138 verbatim. The macOS test fails on the immediately-following mu_assert("CPU: vmaf_read_pictures failed", !err), which is exactly the assertion message the CI surfaced.
Independent corroboration: the existing test core/test/test_float_ms_ssim_min_dim.c proves the 176 floor:
$ docker exec vmaf-dev-mcp bash -c \
"cd /tmp/wt/core && meson test -C build-fix test_float_ms_ssim_min_dim 2>&1 | tail -5"
1/1 fast - libvmaf:test_float_ms_ssim_min_dim OK 0.00s
Ok: 1
Fail: 0
Fix¶
core/test/test_metal_float_ms_ssim_parity.c: FIXTURE_H 144u → 192u. 192 = 176 (the floor) rounded up to a multiple of 16 for clean pyramid downsamples (256/16 = 16, 192/16 = 12 — both stay integer at every scale).
Verification (post-fix)¶
$ docker exec vmaf-dev-mcp bash -c \
"head -c $((256*192*3/2)) /dev/zero > /tmp/test_256x192.yuv && \
/tmp/wt/core/build-fix/tools/vmaf \
-r /tmp/test_256x192.yuv -d /tmp/test_256x192.yuv \
-w 256 -h 192 -p 420 -b 8 --feature float_ms_ssim --no_prediction \
2>&1 | tail -5"
# (no error; vmaf exits 0)
The CPU twin in the test (run_cpu_float_ms_ssim) will now succeed, and run_metal_float_ms_ssim returns cleanly with -ENODEV on non-Metal hosts (the skip path the test explicitly handles).
Sibling audit¶
grep FIXTURE_H core/test/test_metal_*.c found two more tests with FIXTURE_H 144u: test_metal_float_moment_parity.c and test_metal_float_motion_parity.c. Both validate features without a 176-floor (no EINVAL gate in float_moment.c or motion.c), so 144 is fine for them — no change required.
Failure 2 — test_ssimulacra2_simd::test_xyb (Linux all-backends)¶
Hypothesis¶
Per task brief: icpx may emit FMA contractions on the AVX2 linear_rgb_to_xyb_avx2 function despite #pragma STDC FP_CONTRACT OFF, producing bit-divergence from the scalar reference.
Investigation¶
The hypothesis was partially wrong on direction: the AVX2 SIMD path uses explicit _mm256_mul_ps + _mm256_add_ps intrinsics with no FMA intrinsics (grep fmadd core/src/feature/x86/ssimulacra2_avx2.c in linear_rgb_to_xyb_avx2 returns 0). It is the inline scalar reference ref_linear_rgb_to_xyb in core/test/test_ssimulacra2_simd.c that gets contracted.
Reproduction (pre-fix)¶
$ docker exec vmaf-dev-mcp bash -c \
"source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1 && \
cd /tmp/wt/core && \
CC=icx CXX=icpx meson setup build-icpx \
-Denable_cuda=false -Denable_sycl=true 2>&1 | tail -5"
libvmaf 3.0.0
User defined options
enable_cuda: false
enable_sycl: true
Found ninja-1.13.2 at /usr/bin/ninja
$ docker exec vmaf-dev-mcp bash -c \
"source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1 && \
cd /tmp/wt/core && \
ninja -C build-icpx test/test_ssimulacra2_simd 2>&1 | tail -3 && \
./build-icpx/test/test_ssimulacra2_simd 2>&1 | tail -10"
[129/129] Linking target test/test_ssimulacra2_simd
test_multiply: pass
test_xyb: fail, linear_rgb_to_xyb SIMD not bit-identical to scalar
2 tests run, 1 failed
Failure reproduced. Exact CI message.
Forensic verification — compiler-emitted code¶
The test TU is compiled with -ffp-contract=off -fp-model=precise (verified by inspecting build-icpx/build.ninja):
$ docker exec vmaf-dev-mcp bash -c \
"cd /tmp/wt/core/build-icpx && \
awk '/^build test\\/test_ssimulacra2_simd\\.p\\/test_ssimulacra2_simd\\.c\\.o:/{flag=1; print; next} \
flag && /^ /{print; if(/ARGS = /) {flag=0}}' build.ninja"
build test/test_ssimulacra2_simd.p/test_ssimulacra2_simd.c.o: c_COMPILER ../test/test_ssimulacra2_simd.c
ARGS = ... -ffp-contract=off -fp-model=precise -mavx2 -mfma
Both flags are present. Yet the emitted assembly contains 242 vfmadd* instructions in the test TU (verified via icx -S with the same flags):
$ docker exec vmaf-dev-mcp bash -c \
"source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1 && \
cd /tmp/wt/core/build-icpx && \
icx -O3 -mavx2 -mfma -ffp-contract=off -fp-model=precise \
-std=c11 -Isrc -I../src -I../src/feature -Iinclude -I../include \
-Itest -I../test -S -o /tmp/test_xyb.s ../test/test_ssimulacra2_simd.c 2>&1 | tail -5; \
grep -c vfmadd /tmp/test_xyb.s"
242
Excerpt from the loop body of the inlined ref_linear_rgb_to_xyb:
vmovups 32(%rbx,%rax,4), %ymm3 # r
vmovups 2804(%rbx,%rax,4), %ymm4 # g
vmovups 5576(%rbx,%rax,4), %ymm2 # b
vbroadcastss .LCPI3_2(%rip), %ymm1 # m01 = 0.622
vmulps %ymm1, %ymm4, %ymm1
vfmadd231ps %ymm14, %ymm3, %ymm1 # FMA: m01*g += kM00*r ← contracted
vfmadd231ps %ymm15, %ymm2, %ymm1 # FMA: m01*g + kM00*r += kM02*b
vaddps %ymm1, %ymm10, %ymm1 # + kOpsinBias
The corresponding SIMD lib libx86_ssimulacra2_avx2.a (compiled with the same strict-FP flags + -ffp-contract=off) emits zero vfmadd:
$ docker exec vmaf-dev-mcp bash -c \
"objdump -d /tmp/wt/core/build-icpx/src/libx86_ssimulacra2_avx2.a 2>&1 | grep -c vfmadd"
0
So under icx 2025.3 / 2026.0, neither -fp-model=precise nor -ffp-contract=off nor #pragma STDC FP_CONTRACT OFF (per inline source comments in the test TU build wiring at core/test/meson.build:32-34) suppresses FMA contraction in inline scalar code. Only #pragma clang fp contract(off) does — verified directly with a 4-line test program:
$ docker exec vmaf-dev-mcp bash -c \
"cat > /tmp/icx_test.c <<'EOF'
#pragma clang fp contract(off)
float chain(float r, float g, float b) {
return 0.30f * r + 0.622f * g + 0.078f * b + 0.0037930732552754493f;
}
EOF
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1 && \
icx -O3 -mavx2 -mfma -ffp-contract=off -fp-model=precise \
-S -o - /tmp/icx_test.c | grep -E 'vfmadd|vmulss|vaddss' | head"
vmulss .LCPI0_0(%rip), %xmm0, %xmm0
vmulss .LCPI0_1(%rip), %xmm1, %xmm1
vaddss %xmm1, %xmm0, %xmm0
vmulss .LCPI0_2(%rip), %xmm2, %xmm1
vaddss %xmm1, %xmm0, %xmm0
vaddss .LCPI0_3(%rip), %xmm0, %xmm0
No vfmadd. icx is clang-based, so the clang FP pragma is honoured; the documented Intel pragmas (#pragma float_control) are not.
Fix¶
Add a file-scope #pragma clang fp contract(off) to core/test/test_ssimulacra2_simd.c, paired with a -Wunknown-pragmas suppression for GCC (mirrors the existing pattern in core/src/feature/x86/ssimulacra2_host_avx2.c). Production SIMD and production scalar paths are untouched — no score drift.
Verification (post-fix)¶
Under icpx (the failing job):
$ docker exec vmaf-dev-mcp bash -c \
"source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1 && \
cd /tmp/wt/core && \
ninja -C build-icpx test/test_ssimulacra2_simd 2>&1 | tail -3 && \
./build-icpx/test/test_ssimulacra2_simd 2>&1 | tail -16"
[2/2] Linking target test/test_ssimulacra2_simd
test_multiply: pass
test_xyb: pass
test_downsample: pass
test_ssim: pass
test_edge: pass
test_blur: pass
test_ptlr_420_8: pass
test_ptlr_420_10: pass
test_ptlr_444_8: pass
test_ptlr_444_10: pass
test_ptlr_422_8: pass
test_host_xyb: pass
test_host_downsample: pass
13 tests run, 13 passed
Under GCC (the existing CPU build):
$ docker exec vmaf-dev-mcp bash -c \
"cd /tmp/wt/core && \
ninja -C build-fix test/test_ssimulacra2_simd 2>&1 | tail -3 && \
./build-fix/test/test_ssimulacra2_simd 2>&1 | tail -16"
[19/19] Linking target test/test_ssimulacra2_simd
test_multiply: pass
test_xyb: pass
test_downsample: pass
test_ssim: pass
test_edge: pass
test_blur: pass
test_ptlr_420_8: pass
test_ptlr_420_10: pass
test_ptlr_444_8: pass
test_ptlr_444_10: pass
test_ptlr_422_8: pass
test_host_xyb: pass
test_host_downsample: pass
13 tests run, 13 passed
Both compilers pass. The fast suite (49 tests) also passes end-to-end on the CPU GCC build (meson test -C build-fix --suite=fast → all OK).
Why not the obvious "switch to FMA on both sides" fix¶
The natural sibling to ADR-0891 (which unified picture_to_linear_rgb on explicit FMA in both scalar and SIMD) would be to do the same for linear_rgb_to_xyb: switch the AVX2/AVX-512 SIMD to _mm*_fmadd_ps intrinsics and the scalar reference to fmaf(). This was rejected because:
- The production scalar extractor
core/src/feature/ssimulacra2.clinear_rgb_to_xybis compiled with GCC's default-ffp-contract=offand emits non-FMA code. If the AVX2 SIMD switched to_mm256_fmadd_ps, the production scalar and production SIMD scores would diverge by ~1 ULP per pixel on GCC builds — breaking the cross-CPU-path invariant. - The test-TU-scoped pragma fix has zero impact on production binaries. It only changes what icx emits in
test_ssimulacra2_simd.c. The.textoflibvmaf.soandtools/vmafis byte-identical pre- and post-fix.
Open question for future audit¶
core/src/feature/x86/ssimulacra2_host_avx2.c and core/src/feature/x86/ssimulacra2_avx512.c use #pragma STDC FP_CONTRACT OFF for their scalar tail loops. Per the icx behaviour documented here, that pragma is ignored on icx. The tail loops live inside the strict-FP-flagged static lib (-ffp-contract=off -fp-model=precise) whose SIMD body uses intrinsics-only, so divergence is not surfaced today — those scalar tails only run when plane_sz % 8 != 0 (AVX2) / plane_sz % 16 != 0 (AVX-512). Worth tightening to #pragma clang fp contract(off) in a follow-up.
References¶
- ADR-0153 — Netflix#1414
float_ms_ssimmin-dim init check. - ADR-0161 / ADR-0162 / ADR-0163 — SSIMULACRA 2 SIMD bit-exact contract.
- ADR-0214 — cross-backend parity gate.
- ADR-0589 — Metal SSIM L/C/S parity bound.
- ADR-0891 — explicit
fmaf()unification forpicture_to_linear_rgb. .github/workflows/build.yml— theall-backendsjob recipe pinningintel-oneapi-compiler-dpcpp-cpp-2025.3.