Research digests¶

Iteration-time research notes for the lusoris vmaf fork. Each digest captures what was investigated and why for a fork-local workstream — source links, alternatives weighed, prior art, dead ends.

These are not ADRs:

An ADR records a decision and its alternatives at the moment it was made. The body is frozen once Accepted.
A research digest records the learning behind that decision (and the iterations that followed). It can be amended as new evidence arrives, the same way a lab notebook is.

A typical workstream has one ADR (the decision) and one research digest (the supporting investigation). Some PRs reuse an existing digest by linking; that is fine.

When to write one¶

Required by ADR-0108 on every fork-local PR that makes a non-trivial design choice. PRs without a design choice (e.g., a one-line bug fix in fork-added code) state "no research digest needed: trivial" in the PR description and skip the file. Reuse over duplication: if the workstream already has a digest, link it from the new PR instead of starting a parallel one.

Format¶

Each file is named NNNN-kebab-case-topic.md with a 4-digit zero-padded ID assigned in commit order. The structure mirrors 0000-template.md:

# Research-NNNN: <short, descriptive title>

- **Status**: Active | Superseded by Research-MMMM | Archived
- **Workstream**: <ADR-NNNN, ADR-MMMM, ...>
- **Last updated**: YYYY-MM-DD

## Question         — what was the unknown going in
## Sources          — papers, upstream docs, Netflix issues, prior PRs
## Findings         — what was learned, with citations
## Alternatives explored — what didn't work and why
## Open questions   — what is still unknown
## Related          — ADRs, PRs, issues

Conventions:

IDs are assigned in commit order and never reused.
Digests are amendable — update the Last updated date when you add findings. To replace one entirely, add Status: Superseded by Research-MMMM and write a new file.
Cite sources inline with [link text](URL) so readers can verify.
Keep one digest per workstream, not per PR. Cross-link from the PR description.

Index¶

ID	Title	Status	Workstream
0001	Cache shape for `bisect-model-quality` nightly	Active	ADR-0109
0002	Automating process-ADR enforcement (0100 / 0105 / 0106 / 0108)	Active	ADR-0124
0003	SSIMULACRA 2 port source selection + upstream-drift strategy	Active	ADR-0126
0004	Vulkan compute backend — loader, shader language, allocator, DMABUF import	Active	ADR-0127
0005	Embedded MCP in libvmaf — threading, JSON library, SSE server, Power-of-10 fit	Active	ADR-0128
0006	Tiny-AI PTQ int8 — accuracy targets, ORT API comparison, calibration sourcing	Active	ADR-0129
0007	SSIMULACRA 2 scalar port — YUV handling, blur deviation, snapshot tooling	Active	ADR-0126, ADR-0130
0008	MS-SSIM decimate SIMD — FLOP accounting, summation order, bit-exactness	Active	ADR-0125
0010	Is Netflix about to ship a SpEED-driven VMAF successor? (informational)	Active	—
0011	`_iqa_convolve` AVX2 — bit-exactness via `__m256d`, kernel invariants, Amdahl	Active	ADR-0138
0012	SSIM SIMD bit-exactness to scalar — where the ULP drifted	Active	ADR-0139
0013	SIMD DX framework — audit + NEON bit-exactness port	Active	ADR-0140
0014	`psnr_hvs` NEON sister port — half-wide split strategy, aarch64 gotchas, QEMU verification limits	Active	ADR-0160
0015	SSIMULACRA 2 AVX2 + AVX-512 + NEON — per-lane `cbrtf`, left-to-right summation, 2×2 downsample deinterleave	Active	ADR-0161
0016	SSIMULACRA 2 IIR blur SIMD — row-batching with gather (horizontal), column-SIMD (vertical), bit-exact to scalar	Active	ADR-0162
0017	SSIMULACRA 2 `picture_to_linear_rgb` SIMD — per-lane scalar reads, SIMD matmul, per-lane scalar `powf`	Active	ADR-0163
0018	SSIMULACRA 2 snapshot-JSON regression gate — why fork self-consistency beats libjxl/Pacidus cross-check at this scope	Active	ADR-0164
0031	Intel AI-PC NPU + EP applicability to tiny-AI / `dnn/` — verdict: defer NPU; iGPU already covered by OpenVINO EP	Active	— (backlog T7-9)
0046	`vmaf_tiny_v3` (mlp_medium 6→32→16→1, 769 params) vs v2 (mlp_small 257 params): 4-corpus parquet, identical recipe; Netflix LOSO mean PLCC 0.9986 ± 0.0015 vs v2's 0.9978 ± 0.0021 (+0.0008 mean, -29 % std). Decision matrix + per-fold table; ship-alongside-v2 recommendation.	Active	ADR-0241
0048	`vmaf_tiny_v4` (mlp_large, 3 073 params) — does the architecture ladder saturate? Verdict: yes, +0.0001 mean PLCC vs v3 (below 1 std). Ladder stops at v4.	Active	ADR-0242
0053	`iqa_convolve` block-of-N tap widen — failed-attempt post-mortem; per-tap widen is load-bearing for bit-exactness, block-of-4 reorder mismatches scalar on 27.67 % of pixels (10 M Monte Carlo)	Active	ADR-0138
0054	`precise` decoration audit on `vif.comp` + `ciede.comp` — Step A of the Vulkan 1.4 bump path. ciede improves 19× (42/48 → 5/48 mismatches at NVIDIA driver 595.71); vif decorated correctly but the 1.4 regression is not in the tagged float ops. Step B stays blocked.	Active	ADR-0269, ADR-0264
0055	Root-causes the residual 5/48 NVIDIA-Vulkan ciede2000 places=4 mismatch (1.78× threshold, max abs `8.9e-05`) deferred from PR #346. Triangulates double-CPU vs experimental float-CPU vs NVIDIA-Vulkan: f32-CPU matches NVIDIA-GPU to ~6e-7 on the 5 failing high-ΔE frames. Conclusion: structural f32/f64 colour-space-chain precision gap, not a driver fast-math bug. Mitigations rejected; documented as fork debt.	Active	ADR-0273
0085	Vendor-neutral VVC (H.266) GPU encode landscape — survey of NVENC (Ada+ silicon only), AMD AMF / Intel QSV (decode-only in 2026), `VK_KHR_video_encode_h266` (unratified), HIP / SYCL ports of VVenC (3–6 eng-month effort), NN-VC tools via ONNXRuntime EPs (vendor-neutral today), ZLUDA (rejected). Cost / risk / value matrix + three-tier rollout recommendation feeding ADR-0315.	Active	ADR-0315
0086	Usage-doc coverage audit against ADR-shipped surfaces — 255 ADRs scanned, 46 GOOD / 31 BACKFILL / 178 N/A; identifies 5 highest-leverage gaps (vmaf-tune codec adapters, `--score-backend`, `--cache`, Vulkan image import, HDR + sample-clip) for full backfill in this PR; remaining 26 land as ADR-cited stubs.	Active	ADR-0100, ADR-0167
0090	Phase-A-promotion audit (2026-05-08) — repo-wide scan for surfaces still flagged "Phase A only / scaffold-only / Phase B pending" whose follow-up wiring hasn't shipped. 5 production-blocking promotions (HDR not actually wired into `iter_rows`; 15 of 17 codec adapters bypass `ffmpeg_codec_args`; `vmaf-tune fast` has no CLI subcommand; embedded MCP and HIP runtimes still `-ENOSYS`), 12 cosmetic doc-drift items, 9 ADRs ready for `Proposed→Accepted`. Recommended sprint plan + sibling-agent coordination notes.	Active	ADR-0237, ADR-0261, ADR-0276, ADR-0209
0091	End-to-end integration audit of every shipped libvmaf feature extractor against an 8-rung ladder (CPU → backends → SIMD → corpus → trainer → predictor → docs → tests). 22 extractors inventoried; 0 score 8/8. Engine rungs (1-3) mostly green; learning rungs (4-6) red across the board because `CORPUS_ROW_KEYS` captures only `vmaf_score` and `ShotFeatures` accepts no libvmaf metric outputs. Surprise findings: `vmaf_fex_ssim` (integer SSIM) is defined but never registered — dead symbol since CPU registration list ships without it. Top-5 promotions ranked by AI-stack ROI.	Active	—
0126	`vmaf-tune` HDR dispatch coverage — widens the central `hdr_codec_args()` table for AV1 NVENC, HEVC/AV1 QSV, HEVC/AV1 AMF, HEVC VideoToolbox, and libaom while keeping private SEI flags limited to verified families.	Active	ADR-0300
0135	CHUG/K150K extractor I/O cost breakdown and Win 1 + Win 2 optimisations — per-clip cost audit from perf-audit §6; replaces O(N²) parquet flush with at-end-only write via JSONL staging; adds ffprobe skip from CHUG sidecar geometry; decision matrix for in-memory vs streaming vs DuckDB; projected wall-time savings for 5992-clip CHUG run.	Active	— (perf-audit-pipeline-2026-05-16.md §6)
0136	HDR/UGC dataset license + access audit (2026-05-15) — evaluates 13 candidate corpora from Audit Slice C.7; 6 datasets classified ACTIONABLE-NOW (Beyond8Bits, HDRSDR-VQA, LIVE HDR Database, IPI-MobileHDRVQA, HDR-VDC, CHUG already active); 5 BLOCKED on access or license; 1 BLOCKED on infrastructure. HDRSDR-VQA's 6-display pairwise design surfaces the new panel/display-aware workstream scoped in ADR-0459.	Active	ADR-0459

| 0053 | Post-merge CPU profile 2026-05-03 — perf top-10 after PRs #310–#321; surfaces 3 new opt targets (convolve widen, SSIM double reduction, VIF gather elimination) | Active | — | | 0081 | Real-corpus retrain methodology for the fr_regressor_v2 deep ensemble — corpus-size sufficiency (9 ref + 70 dis @ .workingdir2/netflix/), 9-fold LOSO sizing inherited from the deterministic ADR-0291 baseline, seed-diversity hyperparameters, and the Seeking_25fps weak-fold diagnostic for HOLD-on-spread cases. | Active | ADR-0309 | | 0089 | CPU double vs Vulkan float stage bisect on the residual NVIDIA-Vulkan integer_vif_scale2 45/48-frame places=4 mismatch at API 1.4 (T-VK-VIF-1.4-RESIDUAL). Static SPIR-V re-verification confirms only 5 FP-arithmetic ops in vif.comp and all 5 are NoContraction-decorated post-PR #346 — SPIR-V mitigation surface is exhausted. SYCL counter-example (same f32 contract, passes the gate) rules out a pure f32-vs-f64 class issue. Localised root cause: NVIDIA shaderFloatControls2-v2 codegen flip at API 1.4 on a non-IEEE-bound default (reciprocal-multiply, fast-rsq) outside the SPIR-V declarable surface. Phase-2 shader fix not warranted; recommends per-stage NVIDIA dynamic dump or places=3 override ADR. | Active | ADR-0264, ADR-0269 | | 0090 | Per-commit triage of the 41 upstream commits binned SKIP-doc-or-format by the /sync-upstream Pass-2 heuristic on 2026-05-08. Splits into 5 PORT_NOW (motion_v2 mirroring bugfix, motion_v2 option cluster + prev_prev_ref API, two cambi internals), 18 PORT_LATER (python/test MyTestCase migration, blocked on agent-E worktree), 4 DEFER_INDEFINITELY, 1 PORTED_SILENTLY (662fb9ce semaphores → fork commit e5a52e74), 12 MERGE_BOUNDARY. Surfaces the riskiest item: the python/test mass-port is +5 600 LOC and crosses Netflix-golden assertions in feature_extractor_test.py. | Active | — (companion to Research-0089) | | 0135 | Vulkan dispatch overhead characterization — T7-18: startup dominated by uncached vkCreateComputePipelines; per-frame fence/submit overhead ruled out; pipeline-cache fix recommended | Active | T7-18 | | 0091 | CAMBI CUDA integration trade-offs (T3-15a): per-thread 49-read vs shared-memory SAT for the spatial-mask kernel; synchronous vs async ring-buffer DtoH for the 5-scale pipeline; host_pinned slot reuse for score storage; two compile-time bugs found and fixed (cuMemcpyDtoH arg order, VMAF_FEATURE_DISPATCH_SEQUENTIAL non-existence). Predecessor: Research-0032 (Vulkan twin). | Active | ADR-0360 |

| 0734 | CUDA VIF filter1d.cu ncu hotpath profile on RTX 4090 (sm_89). Primary bottleneck: launch-width-limited (0.84 waves) + register pressure (56 regs/thread). Top kernel filter1d_8_horizontal_kernel_2_17_9 = 35 % of VIF filter time. Three optimization candidates: increase val_per_thread 2→4, reduce register live range, add __ldg() for smem loads. | Active | — | (Index seeded by ADR-0108's adoption PR; backfilled digests for the existing major workstreams will be added as their authors revisit the corresponding code.) | 0135 | CAMBI CUDA spatial-mask SLM tile -- design analysis: img-tile correctness bug, 26x read reduction via direct zd_tile load, bank-conflict accepted at uint8 row access | Active | ADR-0464 | | 0751 | Cross-backend 4K (3840x2160) baseline (CPU + CUDA) and PR #79 adm_cm_line_kernel_8 A/B at 4K. RTX 4090 medians: vif CUDA 147 fps, adm CUDA 161 fps. filter1d fully saturated at 4K (253 waves, 69.7% occ). adm_cm __launch_bounds__ win is zero at 4K (-0.3%) vs -9.3% at 1080p (register-bound regime only). ms_ssim_decimate scale 0 saturated at 4K (88.1% occ). | Active | — |