mempalace

Author	SHA1	Message	Date
Ben Sigman	ced1fc955d	Merge pull request #897 from MemPalace/docs/honest-benchmarks-and-readme docs: honest benchmarks + README/site rewrite (#875)	2026-04-14 20:35:29 -07:00
Igor Lins e Silva	bf3b9c5979	docs: #875 follow-up — repo surfaces + reproduction URLs + CHANGELOG Remaining in-repo surfaces carrying the same retracted or broken claims as the public pages fixed in the previous two commits. CONTRIBUTING.md - "Palace structure matters ... 34% retrieval improvement" → reframed as scoping (same rewording applied to the website equivalents). benchmarks/BENCHMARKS.md - Add a prominent "Important caveat" block at the top of the "Comparison vs Published Systems" table explaining that R@5 (retrieval recall) and QA accuracy are different metrics, with citations to Mastra, Mem0, and Supermemory's own published methodology pages. Annotate the specific competitor rows whose numbers are QA accuracy, not retrieval recall. - Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4 → 100 step was tuned on 3 specific wrong answers (already disclosed further down in the doc under "Benchmark Integrity"); the honest hybrid figure is held-out 98.4%. - Fix the broken clone URL — `aya-thekeeper/mempal` no longer points at anything; now `MemPalace/mempalace`. benchmarks/README.md + benchmarks/HYBRID_MODE.md - Same clone-URL fix applied. CHANGELOG.md - Add a ### Documentation entry under [Unreleased] v3.3.0 that names #875 and summarises the scope of the rewrite.	2026-04-14 21:38:00 -03:00
Igor Lins e Silva	61d02e10fe	benchmarks: add v3.3.0 reproduction results + 50/450 split Addresses #875: every internal BENCHMARKS.md claim reproduced on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings, seed=42 for the LongMemEval dev/held-out split). Scorecard — all reproduce exactly: LongMemEval raw R@5 96.6% (500/500) ✅ hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅ hybrid_v4 + minimax rerank R@5 99.2% (496/500) * hybrid_v4 + minimax rerank R@10 100.0% (500/500) * LoCoMo (session, top-10) raw 60.3% (1986q) ✅ hybrid v5 88.9% (1986q) ✅ ConvoMem all-categories (250 items) 92.9% ✅ MemBench all-categories (8500) 80.3% ✅ * The minimax-m2.7:cloud rerank run replicates the "100%" claim with a different LLM family (no Anthropic dependency). R@10 is a perfect reproduction; R@5 misses 4 questions that the published Haiku run caught — consistent with BENCHMARKS.md's own disclosure that hybrid_v4 includes three question-specific fixes developed by inspecting misses, i.e. teaching to the test. The committed 50/450 split is the deterministic (seed=42) split BENCHMARKS.md references but wasn't previously in the repo. Full result JSONLs include every question, every retrieved id, and every score — auditable end-to-end.	2026-04-14 21:21:11 -03:00
Igor Lins e Silva	ca0682abe3	benchmarks: apply ruff-format to llm_rerank (trivial line wrap)	2026-04-14 21:20:54 -03:00
Igor Lins e Silva	8df7b9bf2c	benchmarks: add --llm-backend ollama for non-Anthropic rerank The rerank pipeline was hardcoded to Anthropic's /v1/messages. Add a backend flag so the same code path can be exercised with any OpenAI-compatible endpoint — local Ollama, Ollama Cloud, or any gateway that speaks /v1/chat/completions. Enables independent verification of the "100% with Haiku rerank" claim by running the full benchmark with a different LLM family (e.g. minimax-m2.7:cloud) and zero Anthropic dependency. Both longmemeval_bench.py and locomo_bench.py: - llm_rerank*() gain backend= / base_url= kwargs - CLI: --llm-backend {anthropic,ollama}, --llm-base-url - API key required only when backend=anthropic (diary/palace modes still require it) - Parse last integer in response (reasoning models emit multi-int output) - Fallback to message.reasoning when content is empty - Raise max_tokens to 1024 for reasoning models	2026-04-14 21:20:14 -03:00
travisBREAKS	89206107fa	fix(bench): remove hardcoded credential paths from benchmark runners (#177 ) The `_load_api_key()` function in longmemeval_bench.py and locomo_bench.py searched for API keys in a fixed path (`~/.config/lu/keys.json`) using personal key names (`anthropic_milla`, `anthropic_claude_code_main`). This leaks internal infrastructure details into the public codebase and trains contributors to store credentials in a non-standard location rather than using the standard ANTHROPIC_API_KEY env var. Simplified to: CLI flag > env var > empty string. Updated help text and HYBRID_MODE.md docs to match. Co-authored-by: Tadao <tadao@travisfixes.com>	2026-04-11 23:14:36 -07:00
travisBREAKS	d8b2db696f	fix(bench): remove global SSL verification bypass in convomem_bench (#176 ) The module-level `ssl._create_default_https_context = ssl._create_unverified_context` disables certificate verification for ALL urllib requests in the process, not just the benchmark's HuggingFace downloads. This silently exposes the benchmark runner to MITM attacks. If a specific environment needs to skip verification (e.g. corporate proxy), users can set `PYTHONHTTPSVERIFY=0` or pass a custom ssl context per-request rather than globally patching the ssl module. Co-authored-by: Tadao <tadao@travisfixes.com>	2026-04-11 23:14:12 -07:00
bensig	6d8c462219	fix: resolve ruff lint and format errors across codebase Fix E402 import ordering, F841 unused variable, F541 unnecessary f-strings, F401 unused import, and auto-format 6 files.	2026-04-04 18:37:17 -07:00
bensig	0f8fa8c7d5	bench: add benchmark runners, results docs, and test suite Benchmarks: LongMemEval, LoCoMo, ConvoMem, MemBench runners with methodology docs and hybrid retrieval analysis. Tests: config, miner, convo_miner, normalize — 9 tests, all passing.	2026-04-04 18:33:42 -07:00

9 Commits