diff --git a/CHANGELOG.md b/CHANGELOG.md index 804d485..dd01968 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -41,6 +41,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), - Add `docs/CLOSETS.md` — closet layer overview - Fix stale `milla-jovovich/*` org URLs in website and plugin manifests (#787) - Fix remaining stale org URLs in contributor docs (#808) +- Rewrite `README.md` and `mempalaceofficial.com` benchmark pages to remove category-error cross-system comparisons (R@5 retrieval recall had been listed next to competitor QA accuracy under one column), remove the retracted "+34% palace boost" claim from the surfaces where it had remained, replace the `100%` Haiku-rerank headline with the honest held-out `98.4%` R@5, drop the LoCoMo `100%` top-50 row (retrieval-bypass artefact), and fix the broken `aya-thekeeper/mempal` reproduction URL (#875) +- Add `docs/HISTORY.md` as the canonical home for corrections, retractions, and public notices; move the 2026-04-07 "Note from Milla & Ben" and the 2026-04-11 impostor-domain notice out of `README.md` +- Add v3.3.0 reproduction result JSONLs and the deterministic `seed=42` 50/450 LongMemEval split under `benchmarks/` — every BENCHMARKS.md claim reproduces exactly ### Internal - Add test coverage for `mine_lock`, closets, entity metadata, BM25, and diary diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 2772b11..9c6501d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -82,7 +82,7 @@ If you're planning a significant change, open an issue first to discuss the appr - **Verbatim first**: Never summarize user content. Store exact words. - **Local first**: Everything runs on the user's machine. No cloud dependencies. - **Zero API by default**: Core features must work without any API key. -- **Palace structure matters**: Wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement. Respect the hierarchy. +- **Palace structure is scoping, not magic**: Wings, halls, and rooms act as metadata filters in the underlying vector store. They keep retrieval predictable when a palace holds many unrelated projects or people. Respect the hierarchy — but don't present it as a novel retrieval mechanism. ## Community diff --git a/benchmarks/BENCHMARKS.md b/benchmarks/BENCHMARKS.md index f806e5d..77a963e 100644 --- a/benchmarks/BENCHMARKS.md +++ b/benchmarks/BENCHMARKS.md @@ -41,23 +41,57 @@ Both are real. Both are reproducible. Neither is the whole picture alone. ## Comparison vs Published Systems (LongMemEval) -| # | System | R@5 | LLM Required | Which LLM | Notes | +> **Important caveat — read before quoting this table.** +> MemPal's `R@5` in this table is **retrieval recall**: is the labelled +> session for this question inside the top-5 retrieved candidates? +> +> Several of the other systems below publish **end-to-end QA accuracy** — +> a different metric that scores whether the system's generated answer +> is correct. Retrieval recall and QA accuracy are not comparable; a +> system can have 100% retrieval recall and 40% QA accuracy, and vice +> versa. +> +> - **Mastra's 94.87%** is binary QA accuracy with GPT-5-mini, per +> [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory). +> - **Supermemory ASMR's ~99%** is QA accuracy with an 8-/12-agent +> ensemble, and the authors explicitly frame it as an experimental +> proof-of-concept, not production, per +> [their ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/). +> - **Mem0** does not publish a LongMemEval number; their published +> metric is LoCoMo QA accuracy (~66.9%), per +> [mem0.ai/research](https://mem0.ai/research). +> +> The table is kept here as a historical record of how the comparison +> was originally framed. Public-facing pages (`README.md`, +> `mempalaceofficial.com`) no longer present this table, per issue +> [#875](https://github.com/MemPalace/mempalace/issues/875). For a fair +> head-to-head, run the same metric on the same split. + +| # | System | R@5 (retrieval recall, unless noted) | LLM Required | Which LLM | Notes | |---|---|---|---|---|---| -| 1 | **MemPal (hybrid v4 + rerank)** | **100%** | Optional | Haiku | Reproducible, 500/500 | -| 2 | Supermemory ASMR | ~99% | Yes | Undisclosed | Research only, not in production | +| 1 | **MemPal (hybrid v4 + Haiku rerank)** | **100%** | Optional | Haiku | 500/500 — but the 99.4%→100% step tuned on 3 specific wrong answers (see "Benchmark Integrity" below). Held-out 450q is 98.4%. | +| 2 | Supermemory ASMR | ~99% *(QA accuracy, not R@5)* | Yes | Ensemble of Gemini 2.0 Flash / GPT-4o-mini | Experimental, not production, per authors | | 3 | MemPal (hybrid v3 + rerank) | 99.4% | Optional | Haiku | Reproducible | | 3 | MemPal (palace + rerank) | 99.4% | Optional | Haiku | Independent architecture | -| 4 | Mastra | 94.87% | Yes | GPT-5-mini | — | -| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Highest zero-API score published** | -| 6 | Hindsight | 91.4% | Yes | Gemini-3 | — | -| 7 | Supermemory (production) | ~85% | Yes | Undisclosed | — | -| 8 | Stella (dense retriever) | ~85% | None | None | Academic baseline | -| 9 | Contriever | ~78% | None | None | Academic baseline | +| 4 | Mastra | 94.87% *(QA accuracy, not R@5)* | Yes | GPT-5-mini | Different metric — not directly comparable to R@5 | +| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Reproducible, 500/500** | +| 6 | MemPal hybrid v4 held-out 450 | 98.4% | None | None | Honest generalisable hybrid-pipeline figure | +| 7 | Hindsight | 91.4% *(per their release, metric unverified)* | Yes | Gemini-3 | Check their published methodology | +| 8 | Stella (dense retriever) | ~85% | None | None | Academic retrieval baseline | +| 9 | Contriever | ~78% | None | None | Academic retrieval baseline | | 10 | BM25 (sparse) | ~70% | None | None | Keyword baseline | -**MemPal raw (96.6%) is the highest published LongMemEval score that requires no API key, no cloud, and no LLM at any stage.** +The MemPal raw 96.6% is the headline we ship on public surfaces: it's +retrieval recall, it requires no API key, and it reproduces. -**MemPal hybrid v4 + Haiku rerank (100%) is the first perfect score on LongMemEval — 500/500 questions, all 6 question types at 100%.** +The MemPal hybrid v4 + Haiku rerank 100% remains an internal +result — reproducible with `--mode hybrid_v4 --llm-rerank` — but we +don't quote it on public pages because the final 0.6% was reached by +inspecting three specific wrong answers (see "Benchmark Integrity" +below), which is teaching to the test. The honest generalisable figure +when an LLM is in the loop is the held-out 98.4% R@5 on 450 unseen +questions, or the model-agnostic 99.2% R@5 / 100% R@10 we reproduced +with minimax-m2.7 on the full 500. --- @@ -308,9 +342,9 @@ The palace classifies each question into one of 5 halls. Pass 1 searches only wi ### Setup ```bash -git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git -cd mempal -pip install chromadb pyyaml +git clone https://github.com/MemPalace/mempalace.git +cd mempalace +pip install -e ".[dev]" mkdir -p /tmp/longmemeval-data curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \ https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json diff --git a/benchmarks/HYBRID_MODE.md b/benchmarks/HYBRID_MODE.md index 6843e98..37f315e 100644 --- a/benchmarks/HYBRID_MODE.md +++ b/benchmarks/HYBRID_MODE.md @@ -196,9 +196,9 @@ python benchmarks/longmemeval_bench.py data/longmemeval_s_cleaned.json --mode hy ```bash # Setup -git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git -cd mempal -pip install chromadb +git clone https://github.com/MemPalace/mempalace.git +cd mempalace +pip install -e ".[dev]" # Download data mkdir -p /tmp/longmemeval-data diff --git a/benchmarks/README.md b/benchmarks/README.md index 6e041fb..417ef05 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -1,13 +1,13 @@ -# MemPal Benchmarks — Reproduction Guide +# MemPalace Benchmarks — Reproduction Guide Run the exact same benchmarks we report. Clone, install, run. ## Setup ```bash -git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git -cd mempal -pip install chromadb pyyaml +git clone https://github.com/MemPalace/mempalace.git +cd mempalace +pip install -e ".[dev]" ``` ## Benchmark 1: LongMemEval (500 questions)