Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).
Scorecard — all reproduce exactly:
LongMemEval
raw R@5 96.6% (500/500) ✅
hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅
hybrid_v4 + minimax rerank R@5 99.2% (496/500) *
hybrid_v4 + minimax rerank R@10 100.0% (500/500) *
LoCoMo (session, top-10)
raw 60.3% (1986q) ✅
hybrid v5 88.9% (1986q) ✅
ConvoMem all-categories (250 items) 92.9% ✅
MemBench all-categories (8500) 80.3% ✅
* The minimax-m2.7:cloud rerank run replicates the "100%" claim
with a different LLM family (no Anthropic dependency). R@10 is
a perfect reproduction; R@5 misses 4 questions that the
published Haiku run caught — consistent with BENCHMARKS.md's own
disclosure that hybrid_v4 includes three question-specific fixes
developed by inspecting misses, i.e. teaching to the test.
The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.
Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.