ADR-0024: Preserve Netflix source-of-truth tests verbatim¶

Status: Accepted
Date: 2026-04-17
Deciders: Lusoris, Claude (Anthropic)
Tags: testing, ci, license

Context¶

VMAF's numerical correctness is not arbitrary — it is defined by Netflix's reference test outputs. The fork adds SYCL/CUDA/SIMD paths that must match CPU golden scores. If golden assertions drift, cross-backend correctness claims become meaningless. User quote: "those 3 testpair files with scores by netflix... only the cpu ones... thats one normal and 2 checkerboard tests I believe".

Decision¶

We will preserve Netflix's source-of-truth tests verbatim as the canonical ground-truth gate for VMAF numerical correctness (CPU only). The three Netflix reference test pairs are: (1) normal src01_hrc00_576x324.yuv ↔ src01_hrc01_576x324.yuv, (2) checkerboard A checkerboard_1920_1080_10_3_0_0.yuv ↔ checkerboard_1920_1080_10_3_1_0.yuv (1-px shift), (3) checkerboard B same ref ↔ checkerboard_1920_1080_10_3_10_0.yuv (10-px shift). Run in CI on the Linux x86_64 job as a required status check; not in pre-commit (too slow). Fork-added tests live in separate files/dirs; Netflix golden assertions are never modified. YUVs live in python/test/resource/yuv/; golden scores are hardcoded assertAlmostEqual assertions in python/test/quality_runner_test.py, vmafexec_test.py, vmafexec_feature_extractor_test.py, feature_extractor_test.py, result_test.py. No connection to testdata/scores_cpu_*.json (those are fork-added GPU/SIMD snapshots, not Netflix golden data).

Alternatives considered¶

Option	Pros	Cons	Why not chosen
Regenerate goldens from our CPU path	Easier to maintain	Loses Netflix as source of truth; makes the fork its own oracle	Defeats the whole correctness claim
Replace with synthetic fixtures	Flexible	Not validated against upstream	Same
Keep Netflix assertions, separate fork tests (chosen)	Preserves authoritative baseline; fork tests evolve freely	Two test locations	Rationale: Netflix is the upstream oracle

This decision was a default — the alternative was unacceptable.

Consequences¶

Positive: any SIMD/GPU change that breaks CPU goldens is caught immediately; upstream sync does not have to reconcile assertion drift.
Negative: fork-added tests must live in separate files to avoid accidental edits.
Neutral / follow-ups: CLAUDE.md §12 rule 1 hard-codes "never modify Netflix golden assertions".

References¶

Source: req (user: "those 3 testpair files with scores by netflix... only the cpu ones... thats one normal and 2 checkerboard tests I believe")
Related ADRs: ADR-0009 (snapshot regeneration), ADR-0037