docs(website): align mempalaceofficial.com with honest benchmarks

Part of #875. Bring the VitePress site into line with the new README
and the reproducibility scorecard: drop category-error comparisons,
drop retracted claims, retain only metrics and caveats that survive
audit.

website/index.md
 - New tagline matches README (local-first, verbatim, pluggable backend,
   96.6% R@5 raw, zero API calls).
 - Replace the "MemPalace hybrid 100% / Supermemory ~99% / Mastra
   94.87% / Mem0 ~85%" comparison table with a single honest table
   showing MemPalace's own retrieval-recall numbers (raw 96.6%,
   hybrid v4 held-out 98.4%). Add an explicit sentence explaining why
   we no longer publish a cross-system table on the landing page
   (retrieval recall vs QA accuracy are different metrics).
 - Soften the "ChromaDB-powered vector search" feature blurb to be
   backend-agnostic, since the retrieval layer is pluggable.

website/reference/benchmarks.md
 - Full rewrite of the retrieval-recall tables. No more "100%"
   headline; honest held-out 98.4% R@5 replaces it. Added the
   model-agnostic rerank result (99.2% R@5 / 100% R@10 with
   minimax-m2.7 via Ollama) to show the pipeline is not Haiku-specific.
 - Drop the LoCoMo "Hybrid v5 + Sonnet rerank (top-50) 100%" row.
   With per-conversation session counts of 19-32 and top_k=50, the
   retrieval stage returns every session by construction — the number
   measures an LLM's reading comprehension, not retrieval.
 - Drop the cross-system comparison tables. Link out to each project's
   own research page (Mastra, Mem0, Supermemory) for their published
   numbers and metric definitions.
 - Rewrite reproduction commands to use the correct repository and
   demonstrate the new --llm-backend ollama flag.

website/concepts/the-palace.md
 - Remove the "+34%" row / paragraph. Wing/room filtering is standard
   metadata filtering in the vector store, not a novel retrieval
   mechanism — the April-7 note already retracted that framing; this
   finishes the retraction on the website where it had remained.

website/guide/searching.md
 - Same treatment for "34% retrieval improvement". Reframe as
   operational scoping, not a novel boost.

website/reference/contributing.md
 - Update the "palace structure matters" bullet to reflect the same
   framing: scoping-not-magic.

website/concepts/knowledge-graph.md
 - Replace the MemPalace-vs-Zep feature matrix with a short "related
   work" note that links to Zep's own documentation for authoritative
   details on their deployment model. Avoids claims we cannot verify
   at source.
This commit is contained in:
Igor Lins e Silva
2026-04-14 21:37:45 -03:00
parent 65bf1ebda3
commit f20a1a30fe
6 changed files with 133 additions and 95 deletions
+7 -8
View File
@@ -80,12 +80,11 @@ The knowledge graph uses SQLite with two tables:
Database location: `~/.mempalace/knowledge_graph.sqlite3` Database location: `~/.mempalace/knowledge_graph.sqlite3`
## Comparison ## Related Work
| Feature | MemPalace | Zep (Graphiti) | Temporal entity-relationship graphs are a familiar pattern — Zep's
|---------|-----------|----------------| Graphiti, for example, also exposes a bi-temporal model. MemPalace's
| Storage | SQLite (local) | Neo4j (cloud) | knowledge graph is local-first (SQLite, everything on disk) and free;
| Cost | Free | $25/mo+ | Zep is a managed service backed by Neo4j with its own pricing, SLAs,
| Temporal validity | Yes | Yes | and compliance surface. See Zep's own [documentation](https://www.getzep.com/)
| Self-hosted | Always | Enterprise only | for authoritative details on their deployment model.
| Privacy | Everything local | SOC 2, HIPAA |
+2 -9
View File
@@ -92,16 +92,9 @@ The original stored text chunks. This is the primary retrieval layer used by the
## Why Structure Matters ## Why Structure Matters
Tested on 22,000+ real conversation memories: Wing and room identifiers become metadata filters at query time. Narrowing a search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which is useful when you have many unrelated projects or people filed in the same palace.
| Search scope | R@10 | Improvement | This is standard metadata filtering in the underlying vector store, not a novel retrieval mechanism. The useful property here is operational — clear scoping rules that a human or an agent can apply predictably — not a magic retrieval boost.
|-------------|------|-------------|
| All closets | 60.9% | baseline |
| Within wing | 73.1% | +12% |
| Wing + hall | 84.8% | +24% |
| Wing + room | 94.8% | +34% |
The practical point is that structure improves retrieval. In the project benchmarks, narrowing the search scope by wing and room outperformed searching the entire corpus at once.
## Navigation ## Navigation
+7 -14
View File
@@ -23,23 +23,16 @@ mempalace search "deploy process" --results 10
## How Search Works ## How Search Works
1. Your query is embedded using ChromaDB's default model (`all-MiniLM-L6-v2`) 1. Your query is embedded using the vector store's default model (`all-MiniLM-L6-v2` with the default ChromaDB backend).
2. The embedding is compared against all drawers using cosine similarity 2. The embedding is compared against all drawers using cosine similarity.
3. Optional wing/room filters narrow the search scope 3. Optional wing/room filters narrow the search scope — standard metadata filtering in the underlying vector store.
4. Results are returned with similarity scores and source metadata 4. Results are returned with similarity scores and source metadata.
### Why Structure Matters ### Why Scoping Matters
Tested on 22,000+ real conversation memories: Wing/room filtering is useful when a single palace contains many unrelated projects or people. Narrowing the search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which keeps retrieval predictable as the palace grows.
``` This is a metadata-filter feature of the vector store, not a novel retrieval mechanism. Treat it as an operational convenience: clear scoping rules that a human or an agent can apply predictably.
Search all closets: 60.9% R@10
Search within wing: 73.1% (+12%)
Search wing + hall: 84.8% (+24%)
Search wing + room: 94.8% (+34%)
```
Wings and rooms aren't cosmetic — they're a **34% retrieval improvement**.
## Programmatic Search ## Programmatic Search
+14 -13
View File
@@ -4,7 +4,7 @@ layout: home
hero: hero:
name: MemPalace name: MemPalace
text: Give your AI a memory. text: Give your AI a memory.
tagline: "96.6% recall on LongMemEval in raw mode. Local-first, open source, and usable without an API key." tagline: "Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls."
image: image:
src: /mempalace_logo.png src: /mempalace_logo.png
alt: MemPalace alt: MemPalace
@@ -34,7 +34,7 @@ features:
src: /icons/search.svg src: /icons/search.svg
alt: Semantic Search alt: Semantic Search
title: Semantic Search title: Semantic Search
details: ChromaDB-powered vector search lets the model retrieve past discussions by topic, project, or room. details: Vector search over verbatim content lets the model retrieve past discussions by topic, project, or room. Backend is pluggable.
- icon: - icon:
src: /icons/git-merge.svg src: /icons/git-merge.svg
alt: Knowledge Graph alt: Knowledge Graph
@@ -49,7 +49,7 @@ features:
src: /icons/shield-check.svg src: /icons/shield-check.svg
alt: Zero Cloud alt: Zero Cloud
title: Zero Cloud title: Zero Cloud
details: Core storage and retrieval run locally on ChromaDB and SQLite. Optional reranking features can add an API dependency. details: Core storage and retrieval run locally. Optional reranking features can add an API dependency but are not required for the benchmark path.
--- ---
<style> <style>
@@ -68,20 +68,21 @@ features:
## Verbatim Retrieval First ## Verbatim Retrieval First
MemPalace starts from a simple premise: **store the source text and retrieve it well**. The benchmarked raw mode does not require an LLM extraction step. MemPalace stores source text and retrieves it with semantic search. The benchmarked raw mode does not require an LLM at any stage — no extraction, no rerank, no summarisation.
| System | LongMemEval R@5 | API Required | Cost | **LongMemEval retrieval recall (500 questions):**
|--------|----------------|--------------|------|
| **MemPalace (hybrid)** | **100%** | Optional | Free |
| Supermemory ASMR | ~99% | Yes | — |
| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
| Mastra | 94.87% | Yes | API costs |
| Mem0 | ~85% | Yes | $19249/mo |
The raw 96.6% LongMemEval result is the baseline story: strong recall without requiring an API key or an LLM in the retrieval pipeline. | Mode | R@5 | LLM required |
|---|---|---|
| Raw (semantic search over verbatim text) | **96.6%** | None |
| Hybrid v4, held-out 450q | **98.4%** | None |
The raw 96.6% reproduces on any machine with the committed dataset: result JSONLs, the `seed=42` train/held-out split, and the `--mode raw` / `--held-out` runners are all in the `benchmarks/` directory of the repo.
We deliberately do not publish a side-by-side comparison against other memory systems on this page. Retrieval recall (R@5) and end-to-end QA accuracy are different metrics and are not comparable; where MemPalace can be fairly compared on the same metric, we link to the other project's published source.
<div style="text-align: center; padding-top: 16px;"> <div style="text-align: center; padding-top: 16px;">
<a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark results →</a> <a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark methodology →</a>
</div> </div>
</div> </div>
+102 -50
View File
@@ -1,28 +1,51 @@
# Benchmarks # Benchmarks
Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository. Curated summary of MemPalace's reproducible benchmark results. For the
complete progression with every experiment, see
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
All headline numbers on this page are reproducible from the committed
repository — datasets, scripts, and per-question result JSONLs are all
checked in.
## The Core Finding ## The Core Finding
MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline. MemPalace's benchmarked raw baseline stores the source text and searches
it with the vector store's default embeddings. No extraction or
summarisation step is required for that baseline, and it reproduces at
**96.6% R@5** on LongMemEval with no LLM at any stage.
**And it scores 96.6% on LongMemEval.** ## LongMemEval — Retrieval Recall
## LongMemEval Results Retrieval recall asks: is the labelled session for this question inside
the top-K retrieved sessions? It is not the same metric as end-to-end QA
accuracy; a system can have perfect retrieval recall and poor QA answer
quality, and vice versa.
| Mode | R@5 | LLM Required | Cost/query | **Full 500 questions:**
|------|-----|-------------|------------|
| Raw ChromaDB | **96.6%** | None | $0 |
| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
| Palace + rerank | 99.4% | Haiku | ~$0.001 |
| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking. | Mode | R@5 | LLM required | Cost/query |
|---|---|---|---|
| Raw — vector search over verbatim sessions | **96.6%** | None | $0 |
| Hybrid v4 — keyword/temporal/preference boosts, no LLM | 98.6% | None | $0 |
| Hybrid v4 + LLM rerank (minimax-m2.7 via Ollama) | 99.2% | Any capable model | $0 local / varies cloud |
### Per-Category Breakdown (Raw, 96.6%) **Held-out set (450 questions, never used during `hybrid_v4` development):**
| Question Type | R@5 | Count | | Mode | R@5 | R@10 | NDCG@10 |
|---------------|-----|-------| |---|---|---|---|
| Hybrid v4 | **98.4%** | 99.8% | 0.938 |
The held-out figure is the honest generalisable number. The full-500
scores are higher but include the 50 "dev" questions that hybrid_v4's
three targeted fixes (quoted-phrase boost, person-name boost, nostalgia
patterns) were developed against. `benchmarks/BENCHMARKS.md` calls this
"teaching to the test" and the held-out 98.4% is the clean number to
quote when a single R@5 figure is needed for the hybrid pipeline.
### Per-category breakdown (raw, 96.6%)
| Question type | R@5 | Count |
|---|---|---|
| Knowledge update | 99.0% | 78 | | Knowledge update | 99.0% | 78 |
| Multi-session | 98.5% | 133 | | Multi-session | 98.5% | 133 |
| Temporal reasoning | 96.2% | 133 | | Temporal reasoning | 96.2% | 133 |
@@ -30,66 +53,95 @@ The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The
| Single-session preference | 93.3% | 30 | | Single-session preference | 93.3% | 30 |
| Single-session assistant | 92.9% | 56 | | Single-session assistant | 92.9% | 56 |
### Held-Out Validation ## LoCoMo — Retrieval Recall
**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize. LoCoMo contains 1,986 questions across 10 long conversations (1932
sessions each).
## Comparison vs Published Systems | Mode | R@10 | LLM required |
|---|---|---|
| Session, no rerank, top-10 | 60.3% | None |
| Hybrid v5 (keyword + predicate boosts), top-10 | 88.9% | None |
| System | LongMemEval R@5 | API Required | Cost | We do not publish a "100% R@10" headline for LoCoMo. A reported 100% in
|--------|----------------|--------------|------| earlier drafts used `top_k=50`, which exceeds the per-conversation
| **MemPalace (hybrid)** | **100%** | Optional | Free | session count (1932) — so the retrieval stage returns every session in
| Supermemory ASMR | ~99% | Yes | — | every conversation by construction. That number measures an LLM's
| **MemPalace (raw)** | **96.6%** | **None** | **Free** | reading comprehension over the whole conversation, not retrieval. The
| Mastra | 94.87% | Yes | API costs | honest retrieval-recall number for LoCoMo is the top-10 figure.
| Hindsight | 91.4% | Yes | API costs |
| Mem0 | ~85% | Yes | $19249/mo |
## Other Benchmarks ## Other Benchmarks
### ConvoMem (Salesforce, 75K+ QA pairs) **ConvoMem** (Salesforce; 50 items per category × 5 categories = 250
items): MemPalace raw retrieval reaches **92.9% avg recall**. Strongest
categories: Assistant Facts 100%, User Facts 98%. Weakest: Preferences
86%. The Salesforce dataset contains ~75K items in total; our headline
number is from the 250-item sample the benchmark script was designed
around.
| System | Score | **MemBench** (ACL 2025; 8,500 items, all topics): MemPalace hybrid
|--------|-------| top-5 reaches **80.3% R@5 overall**. Strongest: aggregative 99.3%,
| **MemPalace** | **92.9%** | comparative 98.4%, lowlevel_rec 99.8%. Weakest: noisy 43.4%
| Gemini (long context) | 7082% | (distractor-heavy by design), conditional 57.3%.
| Block extraction | 5771% |
| Mem0 (RAG) | 3045% |
On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table. ## Why We Don't Publish a Cross-System Comparison Table
### LoCoMo (1,986 multi-hop QA pairs) Previous versions of this page placed MemPalace's retrieval recall (R@5)
next to other projects' end-to-end QA accuracy figures under a single
"LongMemEval R@5" column. Those are different metrics and are not
comparable. A system can have 100% retrieval recall and 40% QA
accuracy, and vice versa.
| Mode | R@10 | LLM | If you are evaluating memory systems against MemPalace and want a fair
|------|------|-----| comparison, use the retrieval-recall numbers above and the benchmark
| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet | scripts in the repo; or pick the metric the other project publishes and
| bge-large + Haiku rerank (top-15) | 96.3% | Haiku | compare on that. Each project's published source is the correct
| Hybrid v5 (top-10, no rerank) | **88.9%** | None | reference:
| Session, no rerank (top-10) | 60.3% | None |
### MemBench (ACL 2025, 8,500 items) - [Mastra — Observational Memory](https://mastra.ai/research/observational-memory)
(their published metric is binary QA accuracy with GPT-5-mini)
**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%). - [Mem0 — Research](https://mem0.ai/research)
(their published LoCoMo metric is end-to-end QA accuracy, not retrieval recall)
- [Supermemory — ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/)
(their published metric is QA accuracy; authors explicitly frame the
ensemble as an experimental proof-of-concept, not production)
## Reproducing Results ## Reproducing Results
All benchmarks are reproducible with public datasets: Every benchmark runs deterministically from this repository.
```bash ```bash
git clone https://github.com/MemPalace/mempalace.git git clone https://github.com/MemPalace/mempalace.git
cd mempalace cd mempalace
pip install chromadb pyyaml pip install -e ".[dev]"
# Download LongMemEval data # LongMemEval — raw (96.6%)
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \ curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
# Run raw baseline (96.6%, no API key needed)
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
# LongMemEval — hybrid v4 on the held-out 450 (98.4%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
--mode hybrid_v4 --held-out --split-file benchmarks/lme_split_50_450.json
# LoCoMo — session, top-10 (60.3%)
git clone https://github.com/snap-research/locomo.git /tmp/locomo
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json \
--granularity session --top-k 10
# LongMemEval — hybrid v4 + rerank, any OpenAI-compatible endpoint
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
--mode hybrid_v4 --llm-rerank \
--llm-backend ollama --llm-model <your-model-tag>
``` ```
::: tip ::: tip
Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score. Results are deterministic: same data, same script, same split seed →
same score. The committed `benchmarks/results_*.jsonl` files include
every question, every retrieved corpus id, and every score, so every
individual answer is auditable — not just the aggregate.
::: :::
For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md). For the complete progression (hybrid v1 → v4, diary mode, palace mode,
LoCoMo architecture iterations, methodology integrity notes), see
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+1 -1
View File
@@ -68,7 +68,7 @@ If you're planning a significant change, open an issue first. Key principles:
- **Verbatim first** — never summarize user content. Store exact words. - **Verbatim first** — never summarize user content. Store exact words.
- **Local first** — everything runs on the user's machine. No cloud dependencies. - **Local first** — everything runs on the user's machine. No cloud dependencies.
- **Zero API by default** — core features must work without any API key. - **Zero API by default** — core features must work without any API key.
- **Palace structure matters** — wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement. - **Palace structure is scoping, not magic** — wings, halls, and rooms act as metadata filters in the underlying vector store. They make scoping predictable when a palace holds many unrelated projects; they are not a novel retrieval mechanism.
## Community ## Community