Remaining in-repo surfaces carrying the same retracted or broken
claims as the public pages fixed in the previous two commits.
CONTRIBUTING.md
- "Palace structure matters ... 34% retrieval improvement" → reframed
as scoping (same rewording applied to the website equivalents).
benchmarks/BENCHMARKS.md
- Add a prominent "Important caveat" block at the top of the
"Comparison vs Published Systems" table explaining that R@5
(retrieval recall) and QA accuracy are different metrics, with
citations to Mastra, Mem0, and Supermemory's own published
methodology pages. Annotate the specific competitor rows whose
numbers are QA accuracy, not retrieval recall.
- Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4
→ 100 step was tuned on 3 specific wrong answers (already disclosed
further down in the doc under "Benchmark Integrity"); the honest
hybrid figure is held-out 98.4%.
- Fix the broken clone URL — `aya-thekeeper/mempal` no longer points
at anything; now `MemPalace/mempalace`.
benchmarks/README.md + benchmarks/HYBRID_MODE.md
- Same clone-URL fix applied.
CHANGELOG.md
- Add a ### Documentation entry under [Unreleased] v3.3.0 that names
#875 and summarises the scope of the rewrite.
Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).
Scorecard — all reproduce exactly:
LongMemEval
raw R@5 96.6% (500/500) ✅
hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅
hybrid_v4 + minimax rerank R@5 99.2% (496/500) *
hybrid_v4 + minimax rerank R@10 100.0% (500/500) *
LoCoMo (session, top-10)
raw 60.3% (1986q) ✅
hybrid v5 88.9% (1986q) ✅
ConvoMem all-categories (250 items) 92.9% ✅
MemBench all-categories (8500) 80.3% ✅
* The minimax-m2.7:cloud rerank run replicates the "100%" claim
with a different LLM family (no Anthropic dependency). R@10 is
a perfect reproduction; R@5 misses 4 questions that the
published Haiku run caught — consistent with BENCHMARKS.md's own
disclosure that hybrid_v4 includes three question-specific fixes
developed by inspecting misses, i.e. teaching to the test.
The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.
Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.
The rerank pipeline was hardcoded to Anthropic's /v1/messages.
Add a backend flag so the same code path can be exercised with
any OpenAI-compatible endpoint — local Ollama, Ollama Cloud,
or any gateway that speaks /v1/chat/completions.
Enables independent verification of the "100% with Haiku rerank"
claim by running the full benchmark with a different LLM family
(e.g. minimax-m2.7:cloud) and zero Anthropic dependency.
Both longmemeval_bench.py and locomo_bench.py:
- llm_rerank*() gain backend= / base_url= kwargs
- CLI: --llm-backend {anthropic,ollama}, --llm-base-url
- API key required only when backend=anthropic (diary/palace modes still require it)
- Parse last integer in response (reasoning models emit multi-int output)
- Fallback to message.reasoning when content is empty
- Raise max_tokens to 1024 for reasoning models
The `_load_api_key()` function in longmemeval_bench.py and locomo_bench.py
searched for API keys in a fixed path (`~/.config/lu/keys.json`) using
personal key names (`anthropic_milla`, `anthropic_claude_code_main`).
This leaks internal infrastructure details into the public codebase and
trains contributors to store credentials in a non-standard location
rather than using the standard ANTHROPIC_API_KEY env var.
Simplified to: CLI flag > env var > empty string. Updated help text
and HYBRID_MODE.md docs to match.
Co-authored-by: Tadao <tadao@travisfixes.com>
The module-level `ssl._create_default_https_context = ssl._create_unverified_context`
disables certificate verification for ALL urllib requests in the process,
not just the benchmark's HuggingFace downloads. This silently exposes
the benchmark runner to MITM attacks.
If a specific environment needs to skip verification (e.g. corporate proxy),
users can set `PYTHONHTTPSVERIFY=0` or pass a custom ssl context per-request
rather than globally patching the ssl module.
Co-authored-by: Tadao <tadao@travisfixes.com>