benchmarks: add v3.3.0 reproduction results + 50/450 split

Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).

Scorecard — all reproduce exactly:

  LongMemEval
    raw R@5                            96.6% (500/500)   
    hybrid_v4 held-out 450 R@5         98.4% (442/450)   
    hybrid_v4 + minimax rerank R@5     99.2% (496/500)   *
    hybrid_v4 + minimax rerank R@10   100.0% (500/500)   *

  LoCoMo (session, top-10)
    raw                                60.3% (1986q)     
    hybrid v5                          88.9% (1986q)     

  ConvoMem all-categories (250 items)   92.9%            
  MemBench all-categories (8500)        80.3%            

* The minimax-m2.7:cloud rerank run replicates the "100%" claim
  with a different LLM family (no Anthropic dependency). R@10 is
  a perfect reproduction; R@5 misses 4 questions that the
  published Haiku run caught — consistent with BENCHMARKS.md's own
  disclosure that hybrid_v4 includes three question-specific fixes
  developed by inspecting misses, i.e. teaching to the test.

The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.

Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.
This commit is contained in:
Igor Lins e Silva
2026-04-14 21:21:11 -03:00
parent ca0682abe3
commit 61d02e10fe
9 changed files with 331251 additions and 0 deletions
File diff suppressed because one or more lines are too long