docs(website): align mempalaceofficial.com with honest benchmarks
Part of #875. Bring the VitePress site into line with the new README and the reproducibility scorecard: drop category-error comparisons, drop retracted claims, retain only metrics and caveats that survive audit. website/index.md - New tagline matches README (local-first, verbatim, pluggable backend, 96.6% R@5 raw, zero API calls). - Replace the "MemPalace hybrid 100% / Supermemory ~99% / Mastra 94.87% / Mem0 ~85%" comparison table with a single honest table showing MemPalace's own retrieval-recall numbers (raw 96.6%, hybrid v4 held-out 98.4%). Add an explicit sentence explaining why we no longer publish a cross-system table on the landing page (retrieval recall vs QA accuracy are different metrics). - Soften the "ChromaDB-powered vector search" feature blurb to be backend-agnostic, since the retrieval layer is pluggable. website/reference/benchmarks.md - Full rewrite of the retrieval-recall tables. No more "100%" headline; honest held-out 98.4% R@5 replaces it. Added the model-agnostic rerank result (99.2% R@5 / 100% R@10 with minimax-m2.7 via Ollama) to show the pipeline is not Haiku-specific. - Drop the LoCoMo "Hybrid v5 + Sonnet rerank (top-50) 100%" row. With per-conversation session counts of 19-32 and top_k=50, the retrieval stage returns every session by construction — the number measures an LLM's reading comprehension, not retrieval. - Drop the cross-system comparison tables. Link out to each project's own research page (Mastra, Mem0, Supermemory) for their published numbers and metric definitions. - Rewrite reproduction commands to use the correct repository and demonstrate the new --llm-backend ollama flag. website/concepts/the-palace.md - Remove the "+34%" row / paragraph. Wing/room filtering is standard metadata filtering in the vector store, not a novel retrieval mechanism — the April-7 note already retracted that framing; this finishes the retraction on the website where it had remained. website/guide/searching.md - Same treatment for "34% retrieval improvement". Reframe as operational scoping, not a novel boost. website/reference/contributing.md - Update the "palace structure matters" bullet to reflect the same framing: scoping-not-magic. website/concepts/knowledge-graph.md - Replace the MemPalace-vs-Zep feature matrix with a short "related work" note that links to Zep's own documentation for authoritative details on their deployment model. Avoids claims we cannot verify at source.
This commit is contained in:
@@ -80,12 +80,11 @@ The knowledge graph uses SQLite with two tables:
|
||||
|
||||
Database location: `~/.mempalace/knowledge_graph.sqlite3`
|
||||
|
||||
## Comparison
|
||||
## Related Work
|
||||
|
||||
| Feature | MemPalace | Zep (Graphiti) |
|
||||
|---------|-----------|----------------|
|
||||
| Storage | SQLite (local) | Neo4j (cloud) |
|
||||
| Cost | Free | $25/mo+ |
|
||||
| Temporal validity | Yes | Yes |
|
||||
| Self-hosted | Always | Enterprise only |
|
||||
| Privacy | Everything local | SOC 2, HIPAA |
|
||||
Temporal entity-relationship graphs are a familiar pattern — Zep's
|
||||
Graphiti, for example, also exposes a bi-temporal model. MemPalace's
|
||||
knowledge graph is local-first (SQLite, everything on disk) and free;
|
||||
Zep is a managed service backed by Neo4j with its own pricing, SLAs,
|
||||
and compliance surface. See Zep's own [documentation](https://www.getzep.com/)
|
||||
for authoritative details on their deployment model.
|
||||
|
||||
@@ -92,16 +92,9 @@ The original stored text chunks. This is the primary retrieval layer used by the
|
||||
|
||||
## Why Structure Matters
|
||||
|
||||
Tested on 22,000+ real conversation memories:
|
||||
Wing and room identifiers become metadata filters at query time. Narrowing a search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which is useful when you have many unrelated projects or people filed in the same palace.
|
||||
|
||||
| Search scope | R@10 | Improvement |
|
||||
|-------------|------|-------------|
|
||||
| All closets | 60.9% | baseline |
|
||||
| Within wing | 73.1% | +12% |
|
||||
| Wing + hall | 84.8% | +24% |
|
||||
| Wing + room | 94.8% | +34% |
|
||||
|
||||
The practical point is that structure improves retrieval. In the project benchmarks, narrowing the search scope by wing and room outperformed searching the entire corpus at once.
|
||||
This is standard metadata filtering in the underlying vector store, not a novel retrieval mechanism. The useful property here is operational — clear scoping rules that a human or an agent can apply predictably — not a magic retrieval boost.
|
||||
|
||||
## Navigation
|
||||
|
||||
|
||||
@@ -23,23 +23,16 @@ mempalace search "deploy process" --results 10
|
||||
|
||||
## How Search Works
|
||||
|
||||
1. Your query is embedded using ChromaDB's default model (`all-MiniLM-L6-v2`)
|
||||
2. The embedding is compared against all drawers using cosine similarity
|
||||
3. Optional wing/room filters narrow the search scope
|
||||
4. Results are returned with similarity scores and source metadata
|
||||
1. Your query is embedded using the vector store's default model (`all-MiniLM-L6-v2` with the default ChromaDB backend).
|
||||
2. The embedding is compared against all drawers using cosine similarity.
|
||||
3. Optional wing/room filters narrow the search scope — standard metadata filtering in the underlying vector store.
|
||||
4. Results are returned with similarity scores and source metadata.
|
||||
|
||||
### Why Structure Matters
|
||||
### Why Scoping Matters
|
||||
|
||||
Tested on 22,000+ real conversation memories:
|
||||
Wing/room filtering is useful when a single palace contains many unrelated projects or people. Narrowing the search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which keeps retrieval predictable as the palace grows.
|
||||
|
||||
```
|
||||
Search all closets: 60.9% R@10
|
||||
Search within wing: 73.1% (+12%)
|
||||
Search wing + hall: 84.8% (+24%)
|
||||
Search wing + room: 94.8% (+34%)
|
||||
```
|
||||
|
||||
Wings and rooms aren't cosmetic — they're a **34% retrieval improvement**.
|
||||
This is a metadata-filter feature of the vector store, not a novel retrieval mechanism. Treat it as an operational convenience: clear scoping rules that a human or an agent can apply predictably.
|
||||
|
||||
## Programmatic Search
|
||||
|
||||
|
||||
+14
-13
@@ -4,7 +4,7 @@ layout: home
|
||||
hero:
|
||||
name: MemPalace
|
||||
text: Give your AI a memory.
|
||||
tagline: "96.6% recall on LongMemEval in raw mode. Local-first, open source, and usable without an API key."
|
||||
tagline: "Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls."
|
||||
image:
|
||||
src: /mempalace_logo.png
|
||||
alt: MemPalace
|
||||
@@ -34,7 +34,7 @@ features:
|
||||
src: /icons/search.svg
|
||||
alt: Semantic Search
|
||||
title: Semantic Search
|
||||
details: ChromaDB-powered vector search lets the model retrieve past discussions by topic, project, or room.
|
||||
details: Vector search over verbatim content lets the model retrieve past discussions by topic, project, or room. Backend is pluggable.
|
||||
- icon:
|
||||
src: /icons/git-merge.svg
|
||||
alt: Knowledge Graph
|
||||
@@ -49,7 +49,7 @@ features:
|
||||
src: /icons/shield-check.svg
|
||||
alt: Zero Cloud
|
||||
title: Zero Cloud
|
||||
details: Core storage and retrieval run locally on ChromaDB and SQLite. Optional reranking features can add an API dependency.
|
||||
details: Core storage and retrieval run locally. Optional reranking features can add an API dependency but are not required for the benchmark path.
|
||||
---
|
||||
|
||||
<style>
|
||||
@@ -68,20 +68,21 @@ features:
|
||||
|
||||
## Verbatim Retrieval First
|
||||
|
||||
MemPalace starts from a simple premise: **store the source text and retrieve it well**. The benchmarked raw mode does not require an LLM extraction step.
|
||||
MemPalace stores source text and retrieves it with semantic search. The benchmarked raw mode does not require an LLM at any stage — no extraction, no rerank, no summarisation.
|
||||
|
||||
| System | LongMemEval R@5 | API Required | Cost |
|
||||
|--------|----------------|--------------|------|
|
||||
| **MemPalace (hybrid)** | **100%** | Optional | Free |
|
||||
| Supermemory ASMR | ~99% | Yes | — |
|
||||
| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
|
||||
| Mastra | 94.87% | Yes | API costs |
|
||||
| Mem0 | ~85% | Yes | $19–249/mo |
|
||||
**LongMemEval retrieval recall (500 questions):**
|
||||
|
||||
The raw 96.6% LongMemEval result is the baseline story: strong recall without requiring an API key or an LLM in the retrieval pipeline.
|
||||
| Mode | R@5 | LLM required |
|
||||
|---|---|---|
|
||||
| Raw (semantic search over verbatim text) | **96.6%** | None |
|
||||
| Hybrid v4, held-out 450q | **98.4%** | None |
|
||||
|
||||
The raw 96.6% reproduces on any machine with the committed dataset: result JSONLs, the `seed=42` train/held-out split, and the `--mode raw` / `--held-out` runners are all in the `benchmarks/` directory of the repo.
|
||||
|
||||
We deliberately do not publish a side-by-side comparison against other memory systems on this page. Retrieval recall (R@5) and end-to-end QA accuracy are different metrics and are not comparable; where MemPalace can be fairly compared on the same metric, we link to the other project's published source.
|
||||
|
||||
<div style="text-align: center; padding-top: 16px;">
|
||||
<a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark results →</a>
|
||||
<a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark methodology →</a>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
+102
-50
@@ -1,28 +1,51 @@
|
||||
# Benchmarks
|
||||
|
||||
Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository.
|
||||
Curated summary of MemPalace's reproducible benchmark results. For the
|
||||
complete progression with every experiment, see
|
||||
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
|
||||
All headline numbers on this page are reproducible from the committed
|
||||
repository — datasets, scripts, and per-question result JSONLs are all
|
||||
checked in.
|
||||
|
||||
## The Core Finding
|
||||
|
||||
MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline.
|
||||
MemPalace's benchmarked raw baseline stores the source text and searches
|
||||
it with the vector store's default embeddings. No extraction or
|
||||
summarisation step is required for that baseline, and it reproduces at
|
||||
**96.6% R@5** on LongMemEval with no LLM at any stage.
|
||||
|
||||
**And it scores 96.6% on LongMemEval.**
|
||||
## LongMemEval — Retrieval Recall
|
||||
|
||||
## LongMemEval Results
|
||||
Retrieval recall asks: is the labelled session for this question inside
|
||||
the top-K retrieved sessions? It is not the same metric as end-to-end QA
|
||||
accuracy; a system can have perfect retrieval recall and poor QA answer
|
||||
quality, and vice versa.
|
||||
|
||||
| Mode | R@5 | LLM Required | Cost/query |
|
||||
|------|-----|-------------|------------|
|
||||
| Raw ChromaDB | **96.6%** | None | $0 |
|
||||
| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
|
||||
| Palace + rerank | 99.4% | Haiku | ~$0.001 |
|
||||
| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
|
||||
**Full 500 questions:**
|
||||
|
||||
The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking.
|
||||
| Mode | R@5 | LLM required | Cost/query |
|
||||
|---|---|---|---|
|
||||
| Raw — vector search over verbatim sessions | **96.6%** | None | $0 |
|
||||
| Hybrid v4 — keyword/temporal/preference boosts, no LLM | 98.6% | None | $0 |
|
||||
| Hybrid v4 + LLM rerank (minimax-m2.7 via Ollama) | 99.2% | Any capable model | $0 local / varies cloud |
|
||||
|
||||
### Per-Category Breakdown (Raw, 96.6%)
|
||||
**Held-out set (450 questions, never used during `hybrid_v4` development):**
|
||||
|
||||
| Question Type | R@5 | Count |
|
||||
|---------------|-----|-------|
|
||||
| Mode | R@5 | R@10 | NDCG@10 |
|
||||
|---|---|---|---|
|
||||
| Hybrid v4 | **98.4%** | 99.8% | 0.938 |
|
||||
|
||||
The held-out figure is the honest generalisable number. The full-500
|
||||
scores are higher but include the 50 "dev" questions that hybrid_v4's
|
||||
three targeted fixes (quoted-phrase boost, person-name boost, nostalgia
|
||||
patterns) were developed against. `benchmarks/BENCHMARKS.md` calls this
|
||||
"teaching to the test" and the held-out 98.4% is the clean number to
|
||||
quote when a single R@5 figure is needed for the hybrid pipeline.
|
||||
|
||||
### Per-category breakdown (raw, 96.6%)
|
||||
|
||||
| Question type | R@5 | Count |
|
||||
|---|---|---|
|
||||
| Knowledge update | 99.0% | 78 |
|
||||
| Multi-session | 98.5% | 133 |
|
||||
| Temporal reasoning | 96.2% | 133 |
|
||||
@@ -30,66 +53,95 @@ The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The
|
||||
| Single-session preference | 93.3% | 30 |
|
||||
| Single-session assistant | 92.9% | 56 |
|
||||
|
||||
### Held-Out Validation
|
||||
## LoCoMo — Retrieval Recall
|
||||
|
||||
**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize.
|
||||
LoCoMo contains 1,986 questions across 10 long conversations (19–32
|
||||
sessions each).
|
||||
|
||||
## Comparison vs Published Systems
|
||||
| Mode | R@10 | LLM required |
|
||||
|---|---|---|
|
||||
| Session, no rerank, top-10 | 60.3% | None |
|
||||
| Hybrid v5 (keyword + predicate boosts), top-10 | 88.9% | None |
|
||||
|
||||
| System | LongMemEval R@5 | API Required | Cost |
|
||||
|--------|----------------|--------------|------|
|
||||
| **MemPalace (hybrid)** | **100%** | Optional | Free |
|
||||
| Supermemory ASMR | ~99% | Yes | — |
|
||||
| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
|
||||
| Mastra | 94.87% | Yes | API costs |
|
||||
| Hindsight | 91.4% | Yes | API costs |
|
||||
| Mem0 | ~85% | Yes | $19–249/mo |
|
||||
We do not publish a "100% R@10" headline for LoCoMo. A reported 100% in
|
||||
earlier drafts used `top_k=50`, which exceeds the per-conversation
|
||||
session count (19–32) — so the retrieval stage returns every session in
|
||||
every conversation by construction. That number measures an LLM's
|
||||
reading comprehension over the whole conversation, not retrieval. The
|
||||
honest retrieval-recall number for LoCoMo is the top-10 figure.
|
||||
|
||||
## Other Benchmarks
|
||||
|
||||
### ConvoMem (Salesforce, 75K+ QA pairs)
|
||||
**ConvoMem** (Salesforce; 50 items per category × 5 categories = 250
|
||||
items): MemPalace raw retrieval reaches **92.9% avg recall**. Strongest
|
||||
categories: Assistant Facts 100%, User Facts 98%. Weakest: Preferences
|
||||
86%. The Salesforce dataset contains ~75K items in total; our headline
|
||||
number is from the 250-item sample the benchmark script was designed
|
||||
around.
|
||||
|
||||
| System | Score |
|
||||
|--------|-------|
|
||||
| **MemPalace** | **92.9%** |
|
||||
| Gemini (long context) | 70–82% |
|
||||
| Block extraction | 57–71% |
|
||||
| Mem0 (RAG) | 30–45% |
|
||||
**MemBench** (ACL 2025; 8,500 items, all topics): MemPalace hybrid
|
||||
top-5 reaches **80.3% R@5 overall**. Strongest: aggregative 99.3%,
|
||||
comparative 98.4%, lowlevel_rec 99.8%. Weakest: noisy 43.4%
|
||||
(distractor-heavy by design), conditional 57.3%.
|
||||
|
||||
On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table.
|
||||
## Why We Don't Publish a Cross-System Comparison Table
|
||||
|
||||
### LoCoMo (1,986 multi-hop QA pairs)
|
||||
Previous versions of this page placed MemPalace's retrieval recall (R@5)
|
||||
next to other projects' end-to-end QA accuracy figures under a single
|
||||
"LongMemEval R@5" column. Those are different metrics and are not
|
||||
comparable. A system can have 100% retrieval recall and 40% QA
|
||||
accuracy, and vice versa.
|
||||
|
||||
| Mode | R@10 | LLM |
|
||||
|------|------|-----|
|
||||
| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet |
|
||||
| bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
|
||||
| Hybrid v5 (top-10, no rerank) | **88.9%** | None |
|
||||
| Session, no rerank (top-10) | 60.3% | None |
|
||||
If you are evaluating memory systems against MemPalace and want a fair
|
||||
comparison, use the retrieval-recall numbers above and the benchmark
|
||||
scripts in the repo; or pick the metric the other project publishes and
|
||||
compare on that. Each project's published source is the correct
|
||||
reference:
|
||||
|
||||
### MemBench (ACL 2025, 8,500 items)
|
||||
|
||||
**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%).
|
||||
- [Mastra — Observational Memory](https://mastra.ai/research/observational-memory)
|
||||
(their published metric is binary QA accuracy with GPT-5-mini)
|
||||
- [Mem0 — Research](https://mem0.ai/research)
|
||||
(their published LoCoMo metric is end-to-end QA accuracy, not retrieval recall)
|
||||
- [Supermemory — ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/)
|
||||
(their published metric is QA accuracy; authors explicitly frame the
|
||||
ensemble as an experimental proof-of-concept, not production)
|
||||
|
||||
## Reproducing Results
|
||||
|
||||
All benchmarks are reproducible with public datasets:
|
||||
Every benchmark runs deterministically from this repository.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/MemPalace/mempalace.git
|
||||
cd mempalace
|
||||
pip install chromadb pyyaml
|
||||
pip install -e ".[dev]"
|
||||
|
||||
# Download LongMemEval data
|
||||
# LongMemEval — raw (96.6%)
|
||||
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
|
||||
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
|
||||
|
||||
# Run raw baseline (96.6%, no API key needed)
|
||||
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
|
||||
|
||||
# LongMemEval — hybrid v4 on the held-out 450 (98.4%)
|
||||
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
|
||||
--mode hybrid_v4 --held-out --split-file benchmarks/lme_split_50_450.json
|
||||
|
||||
# LoCoMo — session, top-10 (60.3%)
|
||||
git clone https://github.com/snap-research/locomo.git /tmp/locomo
|
||||
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json \
|
||||
--granularity session --top-k 10
|
||||
|
||||
# LongMemEval — hybrid v4 + rerank, any OpenAI-compatible endpoint
|
||||
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
|
||||
--mode hybrid_v4 --llm-rerank \
|
||||
--llm-backend ollama --llm-model <your-model-tag>
|
||||
```
|
||||
|
||||
::: tip
|
||||
Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score.
|
||||
Results are deterministic: same data, same script, same split seed →
|
||||
same score. The committed `benchmarks/results_*.jsonl` files include
|
||||
every question, every retrieved corpus id, and every score, so every
|
||||
individual answer is auditable — not just the aggregate.
|
||||
:::
|
||||
|
||||
For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
|
||||
For the complete progression (hybrid v1 → v4, diary mode, palace mode,
|
||||
LoCoMo architecture iterations, methodology integrity notes), see
|
||||
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
|
||||
|
||||
@@ -68,7 +68,7 @@ If you're planning a significant change, open an issue first. Key principles:
|
||||
- **Verbatim first** — never summarize user content. Store exact words.
|
||||
- **Local first** — everything runs on the user's machine. No cloud dependencies.
|
||||
- **Zero API by default** — core features must work without any API key.
|
||||
- **Palace structure matters** — wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement.
|
||||
- **Palace structure is scoping, not magic** — wings, halls, and rooms act as metadata filters in the underlying vector store. They make scoping predictable when a palace holds many unrelated projects; they are not a novel retrieval mechanism.
|
||||
|
||||
## Community
|
||||
|
||||
|
||||
Reference in New Issue
Block a user