Merge pull request #897 from MemPalace/docs/honest-benchmarks-and-readme

docs: honest benchmarks + README/site rewrite (#875)
2026-04-14 20:35:29 -07:00
parent db4c52e8be 107685930d
commit ced1fc955d
16 changed files with 633 additions and 823 deletions
@@ -41,23 +41,57 @@ Both are real. Both are reproducible. Neither is the whole picture alone.

 ## Comparison vs Published Systems (LongMemEval)

-| # | System | R@5 | LLM Required | Which LLM | Notes |
+> **Important caveat — read before quoting this table.**
+> MemPal's `R@5` in this table is **retrieval recall**: is the labelled
+> session for this question inside the top-5 retrieved candidates?
+>
+> Several of the other systems below publish **end-to-end QA accuracy** —
+> a different metric that scores whether the system's generated answer
+> is correct. Retrieval recall and QA accuracy are not comparable; a
+> system can have 100% retrieval recall and 40% QA accuracy, and vice
+> versa.
+>
+> - **Mastra's 94.87%** is binary QA accuracy with GPT-5-mini, per
+>   [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory).
+> - **Supermemory ASMR's ~99%** is QA accuracy with an 8-/12-agent
+>   ensemble, and the authors explicitly frame it as an experimental
+>   proof-of-concept, not production, per
+>   [their ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/).
+> - **Mem0** does not publish a LongMemEval number; their published
+>   metric is LoCoMo QA accuracy (~66.9%), per
+>   [mem0.ai/research](https://mem0.ai/research).
+>
+> The table is kept here as a historical record of how the comparison
+> was originally framed. Public-facing pages (`README.md`,
+> `mempalaceofficial.com`) no longer present this table, per issue
+> [#875](https://github.com/MemPalace/mempalace/issues/875). For a fair
+> head-to-head, run the same metric on the same split.
+
+| # | System | R@5 (retrieval recall, unless noted) | LLM Required | Which LLM | Notes |
 |---|---|---|---|---|---|
-| 1 | **MemPal (hybrid v4 + rerank)** | **100%** | Optional | Haiku | Reproducible, 500/500 |
-| 2 | Supermemory ASMR | ~99% | Yes | Undisclosed | Research only, not in production |
+| 1 | **MemPal (hybrid v4 + Haiku rerank)** | **100%** | Optional | Haiku | 500/500 — but the 99.4%→100% step tuned on 3 specific wrong answers (see "Benchmark Integrity" below). Held-out 450q is 98.4%. |
+| 2 | Supermemory ASMR | ~99% *(QA accuracy, not R@5)* | Yes | Ensemble of Gemini 2.0 Flash / GPT-4o-mini | Experimental, not production, per authors |
 | 3 | MemPal (hybrid v3 + rerank) | 99.4% | Optional | Haiku | Reproducible |
 | 3 | MemPal (palace + rerank) | 99.4% | Optional | Haiku | Independent architecture |
-| 4 | Mastra | 94.87% | Yes | GPT-5-mini | — |
-| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Highest zero-API score published** |
-| 6 | Hindsight | 91.4% | Yes | Gemini-3 | — |
-| 7 | Supermemory (production) | ~85% | Yes | Undisclosed | — |
-| 8 | Stella (dense retriever) | ~85% | None | None | Academic baseline |
-| 9 | Contriever | ~78% | None | None | Academic baseline |
+| 4 | Mastra | 94.87% *(QA accuracy, not R@5)* | Yes | GPT-5-mini | Different metric — not directly comparable to R@5 |
+| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Reproducible, 500/500** |
+| 6 | MemPal hybrid v4 held-out 450 | 98.4% | None | None | Honest generalisable hybrid-pipeline figure |
+| 7 | Hindsight | 91.4% *(per their release, metric unverified)* | Yes | Gemini-3 | Check their published methodology |
+| 8 | Stella (dense retriever) | ~85% | None | None | Academic retrieval baseline |
+| 9 | Contriever | ~78% | None | None | Academic retrieval baseline |
 | 10 | BM25 (sparse) | ~70% | None | None | Keyword baseline |

-**MemPal raw (96.6%) is the highest published LongMemEval score that requires no API key, no cloud, and no LLM at any stage.**
+The MemPal raw 96.6% is the headline we ship on public surfaces: it's
+retrieval recall, it requires no API key, and it reproduces.

-**MemPal hybrid v4 + Haiku rerank (100%) is the first perfect score on LongMemEval — 500/500 questions, all 6 question types at 100%.**
+The MemPal hybrid v4 + Haiku rerank 100% remains an internal
+result — reproducible with `--mode hybrid_v4 --llm-rerank` — but we
+don't quote it on public pages because the final 0.6% was reached by
+inspecting three specific wrong answers (see "Benchmark Integrity"
+below), which is teaching to the test. The honest generalisable figure
+when an LLM is in the loop is the held-out 98.4% R@5 on 450 unseen
+questions, or the model-agnostic 99.2% R@5 / 100% R@10 we reproduced
+with minimax-m2.7 on the full 500.

 ---

@@ -308,9 +342,9 @@ The palace classifies each question into one of 5 halls. Pass 1 searches only wi
 ### Setup

 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 mkdir -p /tmp/longmemeval-data
 curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
@@ -196,9 +196,9 @@ python benchmarks/longmemeval_bench.py data/longmemeval_s_cleaned.json --mode hy

 ```bash
 # Setup
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"

 # Download data
 mkdir -p /tmp/longmemeval-data
@@ -1,13 +1,13 @@
-# MemPal Benchmarks — Reproduction Guide
+# MemPalace Benchmarks — Reproduction Guide

 Run the exact same benchmarks we report. Clone, install, run.

 ## Setup

 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 ```

 ## Benchmark 1: LongMemEval (500 questions)