Merge pull request #897 from MemPalace/docs/honest-benchmarks-and-readme

docs: honest benchmarks + README/site rewrite (#875)
2026-04-14 20:35:29 -07:00
parent db4c52e8be 107685930d
commit ced1fc955d
16 changed files with 633 additions and 823 deletions
@@ -1,6 +1,9 @@
 repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.0
+    # Keep in lock-step with the ruff version pinned in .github/workflows/ci.yml
+    # (>=0.4.0,<0.5). Using a newer rev here produces a different formatter
+    # output than CI and breaks `ruff format --check` in the lint job.
+    rev: v0.4.10
    hooks:
      - id: ruff
        args: [--fix]
@@ -41,6 +41,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 - Add `docs/CLOSETS.md` — closet layer overview
 - Fix stale `milla-jovovich/*` org URLs in website and plugin manifests (#787)
 - Fix remaining stale org URLs in contributor docs (#808)
+- Rewrite `README.md` and `mempalaceofficial.com` benchmark pages to remove category-error cross-system comparisons (R@5 retrieval recall had been listed next to competitor QA accuracy under one column), remove the retracted "+34% palace boost" claim from the surfaces where it had remained, replace the `100%` Haiku-rerank headline with the honest held-out `98.4%` R@5, drop the LoCoMo `100%` top-50 row (retrieval-bypass artefact), and fix the broken `aya-thekeeper/mempal` reproduction URL (#875)
+- Add `docs/HISTORY.md` as the canonical home for corrections, retractions, and public notices; move the 2026-04-07 "Note from Milla & Ben" and the 2026-04-11 impostor-domain notice out of `README.md`
+- Add v3.3.0 reproduction result JSONLs and the deterministic `seed=42` 50/450 LongMemEval split under `benchmarks/` — every BENCHMARKS.md claim reproduces exactly

 ### Internal
 - Add test coverage for `mine_lock`, closets, entity metadata, BM25, and diary
@@ -82,7 +82,7 @@ If you're planning a significant change, open an issue first to discuss the appr
 - **Verbatim first**: Never summarize user content. Store exact words.
 - **Local first**: Everything runs on the user's machine. No cloud dependencies.
 - **Zero API by default**: Core features must work without any API key.
- **Palace structure matters**: Wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement. Respect the hierarchy.
+- **Palace structure is scoping, not magic**: Wings, halls, and rooms act as metadata filters in the underlying vector store. They keep retrieval predictable when a palace holds many unrelated projects or people. Respect the hierarchy — but don't present it as a novel retrieval mechanism.

 ## Community

@@ -1,744 +1,176 @@
 > [!CAUTION]
-> **SCAM ALERT:** The only official sources for MemPalace are this [GitHub repository](https://github.com/MemPalace/mempalace), the [PyPI package](https://pypi.org/project/mempalace/), and the docs site at **mempalaceofficial.com**. Any other domain claiming to be MemPalace — including `mempalace.tech` — is an impostor and may distribute malware. Never run install scripts from unofficial sites.
+> **Scam alert.** The only official sources for MemPalace are this
+> [GitHub repository](https://github.com/MemPalace/mempalace), the
+> [PyPI package](https://pypi.org/project/mempalace/), and the docs site at
+> **[mempalaceofficial.com](https://mempalaceofficial.com)**. Any other
+> domain — including `mempalace.tech` — is an impostor and may distribute
+> malware. Details and timeline: [docs/HISTORY.md](docs/HISTORY.md).

 <div align="center">

-<img src="assets/mempalace_logo.png" alt="MemPalace" width="280">
+<img src="assets/mempalace_logo.png" alt="MemPalace" width="240">

 # MemPalace

-### The highest-scoring AI memory system ever benchmarked. And it's free.
-
-<br>
-
-Every conversation you have with an AI — every decision, every debugging session, every architecture debate — disappears when the session ends. Six months of work, gone. You start over every time.
-
-Other memory systems try to fix this by letting AI decide what's worth remembering. It extracts "user prefers Postgres" and throws away the conversation where you explained *why*. MemPalace takes a different approach: **store everything, then make it findable.**
-
-**The Palace** — Ancient Greek orators memorized entire speeches by placing ideas in rooms of an imaginary building. Walk through the building, find the idea. MemPalace applies the same principle to AI memory: your conversations are organized into wings (people and projects), halls (types of memory), and rooms (specific ideas). No AI decides what matters — you keep every word, and the structure gives you a navigable map instead of a flat search index.
-
-**Raw verbatim storage** — MemPalace stores your actual exchanges in ChromaDB without summarization or extraction. The 96.6% LongMemEval result comes from this raw mode. We don't burn an LLM to decide what's "worth remembering" — we keep everything and let semantic search find it.
-
-**AAAK (experimental)** — A lossy abbreviation dialect for packing repeated entities into fewer tokens at scale. Readable by any LLM that reads text — Claude, GPT, Gemini, Llama, Mistral — no decoder needed. **AAAK is a separate compression layer, not the storage default**, and on the LongMemEval benchmark it currently regresses vs raw mode (84.2% vs 96.6%). We're iterating. See the [note above](#a-note-from-milla--ben--april-7-2026) for the honest status.
-
-**Local, open, adaptable** — MemPalace runs entirely on your machine, on any data you have locally, without using any external API or services. It has been tested on conversations — but it can be adapted for different types of datastores. This is why we're open-sourcing it.
-
-<br>
+Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls.

 [![][version-shield]][release-link]
 [![][python-shield]][python-link]
 [![][license-shield]][license-link]
 [![][discord-shield]][discord-link]

-<br>
-
-[Quick Start](#quick-start) · [The Palace](#the-palace) · [AAAK Dialect](#aaak-dialect-experimental) · [Benchmarks](#benchmarks) · [MCP Tools](#mcp-server)
-
-<br>
-
-### Highest LongMemEval score ever published — free or paid.
-
-<table>
-<tr>
-<td align="center"><strong>96.6%</strong><br><sub>LongMemEval R@5<br><b>raw mode</b>, zero API calls</sub></td>
-<td align="center"><strong>500/500</strong><br><sub>questions tested<br>independently reproduced</sub></td>
-<td align="center"><strong>$0</strong><br><sub>No subscription<br>No cloud. Local only.</sub></td>
-</tr>
-</table>
-
-<sub>Reproducible — runners in <a href="benchmarks/">benchmarks/</a>. <a href="benchmarks/BENCHMARKS.md">Full results</a>. The 96.6% is from <b>raw verbatim mode</b>, not AAAK or rooms mode (those score lower — see <a href="#a-note-from-milla--ben--april-7-2026">note above</a>).</sub>
-
 </div>

 ---

-## A Note from Milla & Ben — April 7, 2026
+## What it is

-> The community caught real problems in this README within hours of launch and we want to address them directly.
->
-> **What we got wrong:**
->
-> - **The AAAK token example was incorrect.** We used a rough heuristic (`len(text)//3`) for token counts instead of an actual tokenizer. Real counts via OpenAI's tokenizer: the English example is 66 tokens, the AAAK example is 73. AAAK does not save tokens at small scales — it's designed for *repeated entities at scale*, and the README example was a bad demonstration of that. We're rewriting it.
->
-> - **"30x lossless compression" was overstated.** AAAK is a lossy abbreviation system (entity codes, sentence truncation). Independent benchmarks show AAAK mode scores **84.2% R@5 vs raw mode's 96.6%** on LongMemEval — a 12.4 point regression. The honest framing is: AAAK is an experimental compression layer that trades fidelity for token density, and **the 96.6% headline number is from RAW mode, not AAAK**.
->
-> - **"+34% palace boost" was misleading.** That number compares unfiltered search to wing+room metadata filtering. Metadata filtering is a standard ChromaDB feature, not a novel retrieval mechanism. Real and useful, but not a moat.
->
-> - **"Contradiction detection"** exists as a separate utility (`fact_checker.py`) but is not currently wired into the knowledge graph operations as the README implied.
->
-> - **"100% with Haiku rerank"** is real (we have the result files) but the rerank pipeline is not in the public benchmark scripts. We're adding it.
->
-> **What's still true and reproducible:**
->
-> - **96.6% R@5 on LongMemEval in raw mode**, on 500 questions, zero API calls — independently reproduced on M2 Ultra in under 5 minutes by [@gizmax](https://github.com/MemPalace/mempalace/issues/39).
-> - Local, free, no subscription, no cloud, no data leaving your machine.
-> - The architecture (wings, rooms, closets, drawers) is real and useful, even if it's not a magical retrieval boost.
->
-> **What we're doing:**
->
-> 1. Rewriting the AAAK example with real tokenizer counts and a scenario where AAAK actually demonstrates compression
-> 2. Adding `mode raw / aaak / rooms` clearly to the benchmark documentation so the trade-offs are visible
-> 3. Wiring `fact_checker.py` into the KG ops so the contradiction detection claim becomes true
-> 4. Pinning ChromaDB to a tested range (Issue #100), fixing the shell injection in hooks (#110), and addressing the macOS ARM64 segfault (#74)
->
-> **Thank you to everyone who poked holes in this.** Brutal honest criticism is exactly what makes open source work, and it's what we asked for. Special thanks to [@panuhorsmalahti](https://github.com/MemPalace/mempalace/issues/43), [@lhl](https://github.com/MemPalace/mempalace/issues/27), [@gizmax](https://github.com/MemPalace/mempalace/issues/39), and everyone who filed an issue or a PR in the first 48 hours. We're listening, we're fixing, and we'd rather be right than impressive.
->
-> — *Milla Jovovich & Ben Sigman*
+MemPalace stores your conversation history as verbatim text and retrieves
+it with semantic search. It does not summarize, extract, or paraphrase.
+The index is structured — people and projects become *wings*, topics
+become *rooms*, and original content lives in *drawers* — so searches
+can be scoped rather than run against a flat corpus.
+
+The retrieval layer is pluggable. The current default is ChromaDB; the
+interface is defined in [`mempalace/backends/base.py`](mempalace/backends/base.py)
+and alternative backends can be dropped in without touching the rest of
+the system.
+
+Nothing leaves your machine unless you opt in.
+
+Architecture, concepts, and mining flows:
+[mempalaceofficial.com/concepts/the-palace](https://mempalaceofficial.com/concepts/the-palace.html).

 ---

-## An important follow up note regarding fake MemPalace websites - April 11, 2026
-
-Several Community Members (#267, #326, #506) have pointed out there are fake MemPalace websites popping up, including ones with Malware.
-
-The only official MemPalace surfaces are this [GitHub repository](https://github.com/MemPalace/mempalace), the [PyPI package](https://pypi.org/project/mempalace/), and the docs site at [mempalaceofficial.com](https://mempalaceofficial.com). Any other domain — `mempalace.tech` being the one most commonly reported — is not ours.
-
-Thanks to our Community Members for letting us know about the problem.
-
-Stay safe out there.
-
---
-
-## Quick Start
+## Install

 ```bash
 pip install mempalace
-
-# Set up your world — who you work with, what your projects are
 mempalace init ~/projects/myapp
+```

-# Mine your data
-mempalace mine ~/projects/myapp                    # projects — code, docs, notes
-mempalace mine ~/chats/ --mode convos              # convos — Claude, ChatGPT, Slack exports
-mempalace mine ~/chats/ --mode convos --extract general  # general — classifies into decisions, milestones, problems
+## Quickstart

-# Search anything you've ever discussed
+```bash
+# Mine content into the palace
+mempalace mine ~/projects/myapp                    # project files
+mempalace mine ~/chats/ --mode convos              # conversation exports
+
+# Search
 mempalace search "why did we switch to GraphQL"

-# Your AI remembers
-mempalace status
+# Load context for a new session
+mempalace wake-up
 ```

-Three mining modes: **projects** (code and docs), **convos** (conversation exports), and **general** (auto-classifies into decisions, preferences, milestones, problems, and emotional context). Everything stays on your machine.
-
---
-
-## How You Actually Use It
-
-After the one-time setup (install → init → mine), you don't run MemPalace commands manually. Your AI uses it for you. There are two ways, depending on which AI you use.
-
-### With Claude Code (recommended)
-
-Native marketplace install:
-
-```bash
-claude plugin marketplace add MemPalace/mempalace
-claude plugin install --scope user mempalace
-```
-
-Restart Claude Code, then type `/skills` to verify "mempalace" appears.
-
-### With Claude, ChatGPT, Cursor, Gemini (MCP-compatible tools)
-
-```bash
-# Connect MemPalace once
-claude mcp add mempalace -- python -m mempalace.mcp_server
-```
-
-Now your AI has 29 tools available through MCP. Ask it anything:
-
-> *"What did we decide about auth last month?"*
-
-Claude calls `mempalace_search` automatically, gets verbatim results, and answers you. You never type `mempalace search` again. The AI handles it.
-
-MemPalace also works natively with **Gemini CLI** (which handles the server and save hooks automatically) — see the [Gemini CLI Integration Guide](examples/gemini_cli_setup.md).
-
-### With local models (Llama, Mistral, or any offline LLM)
-
-Local models generally don't speak MCP yet. Two approaches:
-
-**1. Wake-up command** — load your world into the model's context:
-
-```bash
-mempalace wake-up > context.txt
-# Paste context.txt into your local model's system prompt
-```
-
-This gives your local model ~600-900 tokens of critical facts (in AAAK if you prefer) before you ask a single question.
-
-**2. CLI search** — query on demand, feed results into your prompt:
-
-```bash
-mempalace search "auth decisions" > results.txt
-# Include results.txt in your prompt
-```
-
-Or use the Python API:
-
-```python
-from mempalace.searcher import search_memories
-results = search_memories("auth decisions", palace_path="~/.mempalace/palace")
-# Inject into your local model's context
-```
-
-Either way — your entire memory stack runs offline. ChromaDB on your machine, Llama on your machine, AAAK for compression, zero cloud calls.
-
---
-
-## The Problem
-
-Decisions happen in conversations now. Not in docs. Not in Jira. In conversations with Claude, ChatGPT, Copilot. The reasoning, the tradeoffs, the "we tried X and it failed because Y" — all trapped in chat windows that evaporate when the session ends.
-
-**Six months of daily AI use = 19.5 million tokens.** That's every decision, every debugging session, every architecture debate. Gone.
-
-| Approach | Tokens loaded | Annual cost |
-|----------|--------------|-------------|
-| Paste everything | 19.5M — doesn't fit any context window | Impossible |
-| LLM summaries | ~650K | ~$507/yr |
-| **MemPalace wake-up** | **~600-900 tokens** | **~$0.70/yr** |
-| **MemPalace + 5 searches** | **~13,500 tokens** | **~$10/yr** |
-
-MemPalace loads ~600-900 tokens of critical facts on wake-up — your team, your projects, your preferences. Then searches only when needed. $10/year to remember everything vs $507/year for summaries that lose context.
-
---
-
-## How It Works
-
-### The Palace
-
-The layout is fairly simple, though it took a long time to get there.
-
-It starts with a **wing**. Every project, person, or topic you're filing gets its own wing in the palace.
-
-Each wing has **rooms** connected to it, where information is divided into subjects that relate to that wing — so every room is a different element of what your project contains. Project ideas could be one room, employees could be another, financial statements another. There can be an endless number of rooms that split the wing into sections. The MemPalace install detects these for you automatically, and of course you can personalize it any way you feel is right.
-
-Every room has a **closet** connected to it, and here's where things get interesting. We've developed an AI language called **AAAK**. Don't ask — it's a whole story of its own. Your agent learns the AAAK shorthand every time it wakes up. Because AAAK is essentially English, but a very truncated version, your agent understands how to use it in seconds. It comes as part of the install, built into the MemPalace code. In our next update, we'll add AAAK directly to the closets, which will be a real game changer — the amount of info in the closets will be much bigger, but it will take up far less space and far less reading time for your agent.
-
-Inside those closets are **drawers**, and those drawers are where your original files live. In this first version, we haven't used AAAK as a closet tool, but even so, the summaries have shown **96.6% recall** in all the benchmarks we've done across multiple benchmarking platforms. Once the closets use AAAK, searches will be even faster while keeping every word exact. But even now, the closet approach has been a huge boon to how much info is stored in a small space — it's used to easily point your AI agent to the drawer where your original file lives. You never lose anything, and all this happens in seconds.
-
-There are also **halls**, which connect rooms within a wing, and **tunnels**, which connect rooms from different wings to one another. So finding things becomes truly effortless — we've given the AI a clean and organized way to know where to start searching, without having to look through every keyword in huge folders.
-
-You say what you're looking for and boom, it already knows which wing to go to. Just *that* in itself would have made a big difference. But this is beautiful, elegant, organic, and most importantly, efficient.
-
-```
-  +------------------------------------------------------------+
-  ¦  WING: Person                                              ¦
-  ¦                                                            ¦
-  ¦    +----------+            +----------+                    ¦
-  ¦    ¦  Room A  ¦  --hall--  ¦  Room B  ¦                    ¦
-  ¦    +----------+            +----------+                    ¦
-  ¦         ¦                                                  ¦
-  ¦         v                                                  ¦
-  ¦    +----------+      +----------+                          ¦
-  ¦    ¦  Closet  ¦ ---> ¦  Drawer  ¦                          ¦
-  ¦    +----------+      +----------+                          ¦
-  +---------+--------------------------------------------------+
-            ¦
-          tunnel
-            ¦
-  +---------+--------------------------------------------------+
-  ¦  WING: Project                                             ¦
-  ¦         ¦                                                  ¦
-  ¦    +----------+            +----------+                    ¦
-  ¦    ¦  Room A  ¦  --hall--  ¦  Room C  ¦                    ¦
-  ¦    +----------+            +----------+                    ¦
-  ¦         ¦                                                  ¦
-  ¦         v                                                  ¦
-  ¦    +----------+      +----------+                          ¦
-  ¦    ¦  Closet  ¦ ---> ¦  Drawer  ¦                          ¦
-  ¦    +----------+      +----------+                          ¦
-  +------------------------------------------------------------+
-```
-
-**Wings** — a person or project. As many as you need.
-**Rooms** — specific topics within a wing. Auth, billing, deploy — endless rooms.
-**Halls** — connections between related rooms *within* the same wing. If Room A (auth) and Room B (security) are related, a hall links them.
-**Tunnels** — connections *between* wings. When Person A and a Project both have a room about "auth," a tunnel cross-references them automatically.
-**Closets** — summaries that point to the original content. (In v3.0.0 these are plain-text summaries; AAAK-encoded closets are coming in a future update — see [Task #30](https://github.com/MemPalace/mempalace/issues/30).)
-**Drawers** — the original verbatim files. The exact words, never summarized.
-
-**Halls** are memory types — the same in every wing, acting as corridors:
- `hall_facts` — decisions made, choices locked in
- `hall_events` — sessions, milestones, debugging
- `hall_discoveries` — breakthroughs, new insights
- `hall_preferences` — habits, likes, opinions
- `hall_advice` — recommendations and solutions
-
-**Rooms** are named ideas — `auth-migration`, `graphql-switch`, `ci-pipeline`. When the same room appears in different wings, it creates a **tunnel** — connecting the same topic across domains:
-
-```
-wing_kai       / hall_events / auth-migration  → "Kai debugged the OAuth token refresh"
-wing_driftwood / hall_facts  / auth-migration  → "team decided to migrate auth to Clerk"
-wing_priya     / hall_advice / auth-migration  → "Priya approved Clerk over Auth0"
-```
-
-Same room. Three wings. The tunnel connects them.
-
-### Why Structure Matters
-
-Tested on 22,000+ real conversation memories:
-
-```
-Search all closets:          60.9%  R@10
-Search within wing:          73.1%  (+12%)
-Search wing + hall:          84.8%  (+24%)
-Search wing + room:          94.8%  (+34%)
-```
-
-Wings and rooms aren't cosmetic. They're a **34% retrieval improvement**. The palace structure is the product.
-
-### The Memory Stack
-
-| Layer | What | Size | When |
-|-------|------|------|------|
-| **L0** | Identity — who is this AI? | ~50 tokens | Always loaded |
-| **L1** | Critical facts — team, projects, preferences | ~120 tokens (AAAK) | Always loaded |
-| **L2** | Room recall — recent sessions, current project | On demand | When topic comes up |
-| **L3** | Deep search — semantic query across all closets | On demand | When explicitly asked |
-
-Your AI wakes up with L0 + L1 (~600-900 tokens) and knows your world. Searches only fire when needed.
-
-### AAAK Dialect (experimental)
-
-AAAK is a lossy abbreviation system — entity codes, structural markers, and sentence truncation — designed to pack repeated entities and relationships into fewer tokens at scale. It is **readable by any LLM that reads text** (Claude, GPT, Gemini, Llama, Mistral) without a decoder, so a local model can use it without any cloud dependency.
-
-**Honest status (April 2026):**
-
- **AAAK is lossy, not lossless.** It uses regex-based abbreviation, not reversible compression.
- **It does not save tokens at small scales.** Short text already tokenizes efficiently. AAAK overhead (codes, separators) costs more than it saves on a few sentences.
- **It can save tokens at scale** — in scenarios with many repeated entities (a team mentioned hundreds of times, the same project across thousands of sessions), the entity codes amortize.
- **AAAK currently regresses LongMemEval** vs raw verbatim retrieval (84.2% R@5 vs 96.6%). The 96.6% headline number is from **raw mode**, not AAAK mode.
- **The MemPalace storage default is raw verbatim text in ChromaDB** — that's where the benchmark wins come from. AAAK is a separate compression layer for context loading, not the storage format.
-
-We're iterating on the dialect spec, adding a real tokenizer for stats, and exploring better break points for when to use it. Track progress in [Issue #43](https://github.com/MemPalace/mempalace/issues/43) and [#27](https://github.com/MemPalace/mempalace/issues/27).
-
-### Contradiction Detection (experimental, not yet wired into KG)
-
-A separate utility (`fact_checker.py`) can check assertions against entity facts. It's not currently called automatically by the knowledge graph operations — this is being fixed (track in [Issue #27](https://github.com/MemPalace/mempalace/issues/27)). When enabled it catches things like:
-
-```
-Input:  "Soren finished the auth migration"
-Output: 🔴 AUTH-MIGRATION: attribution conflict — Maya was assigned, not Soren
-
-Input:  "Kai has been here 2 years"
-Output: 🟡 KAI: wrong_tenure — records show 3 years (started 2023-04)
-
-Input:  "The sprint ends Friday"
-Output: 🟡 SPRINT: stale_date — current sprint ends Thursday (updated 2 days ago)
-```
-
-Facts checked against the knowledge graph. Ages, dates, and tenures calculated dynamically — not hardcoded.
-
---
-
-## Real-World Examples
-
-### Solo developer across multiple projects
-
-```bash
-# Mine each project's conversations
-mempalace mine ~/chats/orion/  --mode convos --wing orion
-mempalace mine ~/chats/nova/   --mode convos --wing nova
-mempalace mine ~/chats/helios/ --mode convos --wing helios
-
-# Six months later: "why did I use Postgres here?"
-mempalace search "database decision" --wing orion
-# → "Chose Postgres over SQLite because Orion needs concurrent writes
-#    and the dataset will exceed 10GB. Decided 2025-11-03."
-
-# Cross-project search
-mempalace search "rate limiting approach"
-# → finds your approach in Orion AND Nova, shows the differences
-```
-
-### Team lead managing a product
-
-```bash
-# Mine Slack exports and AI conversations
-mempalace mine ~/exports/slack/ --mode convos --wing driftwood
-mempalace mine ~/.claude/projects/ --mode convos
-
-# "What did Soren work on last sprint?"
-mempalace search "Soren sprint" --wing driftwood
-# → 14 closets: OAuth refactor, dark mode, component library migration
-
-# "Who decided to use Clerk?"
-mempalace search "Clerk decision" --wing driftwood
-# → "Kai recommended Clerk over Auth0 — pricing + developer experience.
-#    Team agreed 2026-01-15. Maya handling the migration."
-```
-
-### Before mining: split mega-files
-
-Some transcript exports concatenate multiple sessions into one huge file:
-
-```bash
-mempalace split ~/chats/                      # split into per-session files
-mempalace split ~/chats/ --dry-run            # preview first
-mempalace split ~/chats/ --min-sessions 3     # only split files with 3+ sessions
-```
-
---
-
-## Knowledge Graph
-
-Temporal entity-relationship triples — like Zep's Graphiti, but SQLite instead of Neo4j. Local and free.
-
-```python
-from mempalace.knowledge_graph import KnowledgeGraph
-
-kg = KnowledgeGraph()
-kg.add_triple("Kai", "works_on", "Orion", valid_from="2025-06-01")
-kg.add_triple("Maya", "assigned_to", "auth-migration", valid_from="2026-01-15")
-kg.add_triple("Maya", "completed", "auth-migration", valid_from="2026-02-01")
-
-# What's Kai working on?
-kg.query_entity("Kai")
-# → [Kai → works_on → Orion (current), Kai → recommended → Clerk (2026-01)]
-
-# What was true in January?
-kg.query_entity("Maya", as_of="2026-01-20")
-# → [Maya → assigned_to → auth-migration (active)]
-
-# Timeline
-kg.timeline("Orion")
-# → chronological story of the project
-```
-
-Facts have validity windows. When something stops being true, invalidate it:
-
-```python
-kg.invalidate("Kai", "works_on", "Orion", ended="2026-03-01")
-```
-
-Now queries for Kai's current work won't return Orion. Historical queries still will.
-
-| Feature | MemPalace | Zep (Graphiti) |
-|---------|-----------|----------------|
-| Storage | SQLite (local) | Neo4j (cloud) |
-| Cost | Free | $25/mo+ |
-| Temporal validity | Yes | Yes |
-| Self-hosted | Always | Enterprise only |
-| Privacy | Everything local | SOC 2, HIPAA |
-
---
-
-## Specialist Agents
-
-Create agents that focus on specific areas. Each agent gets its own wing and diary in the palace — not in your CLAUDE.md. Add 50 agents, your config stays the same size.
-
-```
-~/.mempalace/agents/
-  ├── reviewer.json       # code quality, patterns, bugs
-  ├── architect.json      # design decisions, tradeoffs
-  └── ops.json            # deploys, incidents, infra
-```
-
-Your CLAUDE.md just needs one line:
-
-```
-You have MemPalace agents. Run mempalace_list_agents to see them.
-```
-
-The AI discovers its agents from the palace at runtime. Each agent:
-
- **Has a focus** — what it pays attention to
- **Keeps a diary** — written in AAAK, persists across sessions
- **Builds expertise** — reads its own history to stay sharp in its domain
-
-```
-# Agent writes to its diary after a code review
-mempalace_diary_write("reviewer",
-    "PR#42|auth.bypass.found|missing.middleware.check|pattern:3rd.time.this.quarter|★★★★")
-
-# Agent reads back its history
-mempalace_diary_read("reviewer", last_n=10)
-# → last 10 findings, compressed in AAAK
-```
-
-Each agent is a specialist lens on your data. The reviewer remembers every bug pattern it's seen. The architect remembers every design decision. The ops agent remembers every incident. They don't share a scratchpad — they each maintain their own memory.
-
-Letta charges $20–200/mo for agent-managed memory. MemPalace does it with a wing.
-
---
-
-## MCP Server
-
-```bash
-# Via plugin (recommended)
-claude plugin marketplace add MemPalace/mempalace
-claude plugin install --scope user mempalace
-
-# Or manually
-claude mcp add mempalace -- python -m mempalace.mcp_server
-```
-
-### 29 Tools
-
-**Palace (read)**
-
-| Tool | What |
-|------|------|
-| `mempalace_status` | Palace overview + AAAK spec + memory protocol |
-| `mempalace_list_wings` | Wings with counts |
-| `mempalace_list_rooms` | Rooms within a wing |
-| `mempalace_get_taxonomy` | Full wing → room → count tree |
-| `mempalace_search` | Semantic search with wing/room filters |
-| `mempalace_check_duplicate` | Check before filing |
-| `mempalace_get_aaak_spec` | AAAK dialect reference |
-
-**Palace (write)**
-
-| Tool | What |
-|------|------|
-| `mempalace_add_drawer` | File verbatim content |
-| `mempalace_delete_drawer` | Remove by ID |
-
-**Knowledge Graph**
-
-| Tool | What |
-|------|------|
-| `mempalace_kg_query` | Entity relationships with time filtering |
-| `mempalace_kg_add` | Add facts |
-| `mempalace_kg_invalidate` | Mark facts as ended |
-| `mempalace_kg_timeline` | Chronological entity story |
-| `mempalace_kg_stats` | Graph overview |
-
-**Navigation**
-
-| Tool | What |
-|------|------|
-| `mempalace_traverse` | Walk the graph from a room across wings |
-| `mempalace_find_tunnels` | Find rooms bridging two wings |
-| `mempalace_graph_stats` | Graph connectivity overview |
-| `mempalace_create_tunnel` | Create explicit cross-wing link between two rooms |
-| `mempalace_list_tunnels` | List all explicit tunnels, filter by wing |
-| `mempalace_delete_tunnel` | Remove a tunnel by ID |
-| `mempalace_follow_tunnels` | Follow tunnels from a room to connected rooms in other wings |
-
-**Drawer Management**
-
-| Tool | What |
-|------|------|
-| `mempalace_get_drawer` | Fetch a single drawer by ID |
-| `mempalace_list_drawers` | Paginated drawer listing |
-| `mempalace_update_drawer` | Update drawer content or metadata |
-
-**Agent Diary**
-
-| Tool | What |
-|------|------|
-| `mempalace_diary_write` | Write AAAK diary entry |
-| `mempalace_diary_read` | Read recent diary entries |
-
-**System**
-
-| Tool | What |
-|------|------|
-| `mempalace_hook_settings` | Get/set hook behavior (silent save, toast) |
-| `mempalace_memories_filed_away` | Check if recent checkpoint was saved |
-| `mempalace_reconnect` | Force DB reconnect after external writes |
-
-The AI learns AAAK and the memory protocol automatically from the `mempalace_status` response. No manual configuration.
-
---
-
-## Auto-Save Hooks
-
-Two hooks for Claude Code that automatically save memories during work:
-
-**Save Hook** — every 15 messages, triggers a structured save. Topics, decisions, quotes, code changes. Also regenerates the critical facts layer.
-
-**PreCompact Hook** — fires before context compression. Emergency save before the window shrinks.
-
-```json
-{
-  "hooks": {
-    "Stop": [{"matcher": "", "hooks": [{"type": "command", "command": "/path/to/mempalace/hooks/mempal_save_hook.sh"}]}],
-    "PreCompact": [{"matcher": "", "hooks": [{"type": "command", "command": "/path/to/mempalace/hooks/mempal_precompact_hook.sh"}]}]
-  }
-}
-```
-
-**Optional auto-ingest:** Set the `MEMPAL_DIR` environment variable to a directory path and the hooks will automatically run `mempalace mine` on that directory during each save trigger (background on stop, synchronous on precompact).
+For Claude Code, Gemini CLI, MCP-compatible tools, and local models, see
+[mempalaceofficial.com/guide/getting-started](https://mempalaceofficial.com/guide/getting-started.html).

 ---

 ## Benchmarks

-Tested on standard academic benchmarks — reproducible, published datasets.
+All numbers below are reproducible from this repository with the commands
+in [`benchmarks/BENCHMARKS.md`](benchmarks/BENCHMARKS.md). Full
+per-question result files are committed under `benchmarks/results_*`.

-| Benchmark | Mode | Score | API Calls |
-|-----------|------|-------|-----------|
-| **LongMemEval R@5** | Raw (ChromaDB only) | **96.6%** | Zero |
-| **LongMemEval R@5** | Hybrid + Haiku rerank | **100%** (500/500) | ~500 |
-| **LoCoMo R@10** | Raw, session level | **60.3%** | Zero |
-| **Personal palace R@10** | Heuristic bench | **85%** | Zero |
-| **Palace structure impact** | Wing+room filtering | **+34%** R@10 | Zero |
+**LongMemEval — retrieval recall (R@5, 500 questions):**

-The 96.6% raw score is the highest published LongMemEval result requiring no API key, no cloud, and no LLM at any stage.
+| Mode | R@5 | LLM required |
+|---|---|---|
+| Raw (semantic search, no heuristics, no LLM) | **96.6%** | None |
+| Hybrid v4, held-out 450q (tuned on 50 dev, not seen during training) | **98.4%** | None |
+| Hybrid v4 + LLM rerank (full 500) | ≥99% | Any capable model |

-### vs Published Systems
+The raw 96.6% requires no API key, no cloud, and no LLM at any stage. The
+hybrid pipeline adds keyword boosting, temporal-proximity boosting, and
+preference-pattern extraction; the held-out 98.4% is the honest
+generalisable figure.

-| System | LongMemEval R@5 | API Required | Cost |
-|--------|----------------|--------------|------|
-| **MemPalace (hybrid)** | **100%** | Optional | Free |
-| Supermemory ASMR | ~99% | Yes | — |
-| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
-| Mastra | 94.87% | Yes (GPT) | API costs |
-| Mem0 | ~85% | Yes | $19–249/mo |
-| Zep | ~85% | Yes | $25/mo+ |
+The rerank pipeline promotes the best candidate out of the top-20
+retrieved sessions using an LLM reader. It works with any reasonably
+capable model — we have reproduced it with Claude Haiku, Claude Sonnet,
+and minimax-m2.7 via Ollama Cloud (no Anthropic dependency). The gap
+between raw and reranked is model-agnostic; we do not headline a "100%"
+number because the last 0.6% was reached by inspecting specific wrong
+answers, which `benchmarks/BENCHMARKS.md` flags as teaching to the test.

---
+**Other benchmarks (full results in [`benchmarks/BENCHMARKS.md`](benchmarks/BENCHMARKS.md)):**

-## All Commands
+| Benchmark | Metric | Score | Notes |
+|---|---|---|---|
+| LoCoMo (session, top-10, no rerank) | R@10 | 60.3% | 1,986 questions |
+| LoCoMo (hybrid v5, top-10, no rerank) | R@10 | 88.9% | Same set |
+| ConvoMem (all categories, 250 items) | Avg recall | 92.9% | 50 per category |
+| MemBench (ACL 2025, 8,500 items) | R@5 | 80.3% | All categories |
+
+We deliberately do not include a side-by-side comparison against Mem0,
+Mastra, Hindsight, Supermemory, or Zep. Those projects publish different
+metrics on different splits, and placing retrieval recall next to
+end-to-end QA accuracy is not an honest comparison. See each project's
+own research page for their published numbers.
+
+**Reproducing every result:**

 ```bash
-# Setup
-mempalace init <dir>                              # guided onboarding + AAAK bootstrap
-
-# Mining
-mempalace mine <dir>                              # mine project files
-mempalace mine <dir> --mode convos                # mine conversation exports
-mempalace mine <dir> --mode convos --wing myapp   # tag with a wing name
-
-# Splitting
-mempalace split <dir>                             # split concatenated transcripts
-mempalace split <dir> --dry-run                   # preview
-
-# Search
-mempalace search "query"                          # search everything
-mempalace search "query" --wing myapp             # within a wing
-mempalace search "query" --room auth-migration    # within a room
-
-# Memory stack
-mempalace wake-up                                 # load L0 + L1 context
-mempalace wake-up --wing driftwood                # project-specific
-
-# Compression
-mempalace compress --wing myapp                   # AAAK compress
-
-# Status
-mempalace status                                  # palace overview
-
-# MCP
-mempalace mcp                                     # show MCP setup command
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
+# see benchmarks/README.md for dataset download commands
+python benchmarks/longmemeval_bench.py /path/to/longmemeval_s_cleaned.json
 ```

-All commands accept `--palace <path>` to override the default location.
-
 ---

-## Configuration
+## Knowledge graph

-### Global (`~/.mempalace/config.json`)
+MemPalace includes a temporal entity-relationship graph with validity
+windows — add, query, invalidate, timeline — backed by local SQLite.
+Usage and tool reference:
+[mempalaceofficial.com/concepts/knowledge-graph](https://mempalaceofficial.com/concepts/knowledge-graph.html).

-```json
-{
-  "palace_path": "/custom/path/to/palace",
-  "collection_name": "mempalace_drawers",
-  "people_map": {"Kai": "KAI", "Priya": "PRI"}
-}
-```
+## MCP server

-### Wing config (`~/.mempalace/wing_config.json`)
+29 MCP tools cover palace reads/writes, knowledge-graph operations,
+cross-wing navigation, drawer management, and agent diaries. Installation
+and the full tool list:
+[mempalaceofficial.com/reference/mcp-tools](https://mempalaceofficial.com/reference/mcp-tools.html).

-Generated by `mempalace init`. Maps your people and projects to wings:
+## Agents

-```json
-{
-  "default_wing": "wing_general",
-  "wings": {
-    "wing_kai": {"type": "person", "keywords": ["kai", "kai's"]},
-    "wing_driftwood": {"type": "project", "keywords": ["driftwood", "analytics", "saas"]}
-  }
-}
-```
+Each specialist agent gets its own wing and diary in the palace.
+Discoverable at runtime via `mempalace_list_agents` — no bloat in your
+system prompt:
+[mempalaceofficial.com/concepts/agents](https://mempalaceofficial.com/concepts/agents.html).

-### Identity (`~/.mempalace/identity.txt`)
+## Auto-save hooks

-Plain text. Becomes Layer 0 — loaded every session.
-
---
-
-## File Reference
-
-| File | What |
-|------|------|
-| `cli.py` | CLI entry point |
-| `config.py` | Configuration loading and defaults |
-| `normalize.py` | Converts 5 chat formats to standard transcript |
-| `mcp_server.py` | MCP server — 29 tools, AAAK auto-teach, memory protocol |
-| `miner.py` | Project file ingest |
-| `convo_miner.py` | Conversation ingest — chunks by exchange pair |
-| `searcher.py` | Semantic search via ChromaDB |
-| `layers.py` | 4-layer memory stack |
-| `dialect.py` | AAAK index format for closet pointers |
-| `knowledge_graph.py` | Temporal entity-relationship graph (SQLite) |
-| `palace_graph.py` | Room-based navigation graph |
-| `onboarding.py` | Guided setup — generates AAAK bootstrap + wing config |
-| `entity_registry.py` | Entity code registry |
-| `entity_detector.py` | Auto-detect people and projects from content |
-| `split_mega_files.py` | Split concatenated transcripts into per-session files |
-| `hooks/mempal_save_hook.sh` | Auto-save every N messages |
-| `hooks/mempal_precompact_hook.sh` | Emergency save before compaction |
-
---
-
-## Project Structure
-
-```
-mempalace/
-├── README.md                  ← you are here
-├── mempalace/                 ← core package (README)
-│   ├── cli.py                 ← CLI entry point
-│   ├── mcp_server.py          ← MCP server (29 tools)
-│   ├── knowledge_graph.py     ← temporal entity graph
-│   ├── palace_graph.py        ← room navigation graph
-│   ├── dialect.py             ← AAAK compression
-│   ├── miner.py               ← project file ingest
-│   ├── convo_miner.py         ← conversation ingest
-│   ├── searcher.py            ← semantic search
-│   ├── onboarding.py          ← guided setup
-│   └── ...                    ← see mempalace/README.md
-├── benchmarks/                ← reproducible benchmark runners
-│   ├── README.md              ← reproduction guide
-│   ├── BENCHMARKS.md          ← full results + methodology
-│   ├── longmemeval_bench.py   ← LongMemEval runner
-│   ├── locomo_bench.py        ← LoCoMo runner
-│   └── membench_bench.py      ← MemBench runner
-├── hooks/                     ← Claude Code auto-save hooks
-│   ├── README.md              ← hook setup guide
-│   ├── mempal_save_hook.sh    ← save every N messages
-│   └── mempal_precompact_hook.sh ← emergency save
-├── examples/                  ← usage examples
-│   ├── basic_mining.py
-│   ├── convo_import.py
-│   └── mcp_setup.md
-├── tests/                     ← test suite (README)
-├── assets/                    ← logo + brand assets
-└── pyproject.toml             ← package config (v3.3.0)
-```
+Two Claude Code hooks save periodically and before context compression:
+[mempalaceofficial.com/guide/hooks](https://mempalaceofficial.com/guide/hooks.html).

 ---

 ## Requirements

 - Python 3.9+
- `chromadb>=0.4.0`
- `pyyaml>=6.0`
+- A vector-store backend (ChromaDB by default)
+- ~300 MB disk for the default embedding model

-No API key. No internet after install. Everything local.
+No API key is required for the core benchmark path.

-```bash
-pip install mempalace
-```
+## Docs

---
+- Getting started → [mempalaceofficial.com/guide/getting-started](https://mempalaceofficial.com/guide/getting-started.html)
+- CLI reference → [mempalaceofficial.com/reference/cli](https://mempalaceofficial.com/reference/cli.html)
+- Python API → [mempalaceofficial.com/reference/python-api](https://mempalaceofficial.com/reference/python-api.html)
+- Full benchmark methodology → [benchmarks/BENCHMARKS.md](benchmarks/BENCHMARKS.md)
+- Release notes → [CHANGELOG.md](CHANGELOG.md)
+- Corrections and public notices → [docs/HISTORY.md](docs/HISTORY.md)

 ## Contributing

-PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for setup and guidelines.
+PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).

 ## License

@@ -41,23 +41,57 @@ Both are real. Both are reproducible. Neither is the whole picture alone.

 ## Comparison vs Published Systems (LongMemEval)

-| # | System | R@5 | LLM Required | Which LLM | Notes |
+> **Important caveat — read before quoting this table.**
+> MemPal's `R@5` in this table is **retrieval recall**: is the labelled
+> session for this question inside the top-5 retrieved candidates?
+>
+> Several of the other systems below publish **end-to-end QA accuracy** —
+> a different metric that scores whether the system's generated answer
+> is correct. Retrieval recall and QA accuracy are not comparable; a
+> system can have 100% retrieval recall and 40% QA accuracy, and vice
+> versa.
+>
+> - **Mastra's 94.87%** is binary QA accuracy with GPT-5-mini, per
+>   [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory).
+> - **Supermemory ASMR's ~99%** is QA accuracy with an 8-/12-agent
+>   ensemble, and the authors explicitly frame it as an experimental
+>   proof-of-concept, not production, per
+>   [their ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/).
+> - **Mem0** does not publish a LongMemEval number; their published
+>   metric is LoCoMo QA accuracy (~66.9%), per
+>   [mem0.ai/research](https://mem0.ai/research).
+>
+> The table is kept here as a historical record of how the comparison
+> was originally framed. Public-facing pages (`README.md`,
+> `mempalaceofficial.com`) no longer present this table, per issue
+> [#875](https://github.com/MemPalace/mempalace/issues/875). For a fair
+> head-to-head, run the same metric on the same split.
+
+| # | System | R@5 (retrieval recall, unless noted) | LLM Required | Which LLM | Notes |
 |---|---|---|---|---|---|
-| 1 | **MemPal (hybrid v4 + rerank)** | **100%** | Optional | Haiku | Reproducible, 500/500 |
-| 2 | Supermemory ASMR | ~99% | Yes | Undisclosed | Research only, not in production |
+| 1 | **MemPal (hybrid v4 + Haiku rerank)** | **100%** | Optional | Haiku | 500/500 — but the 99.4%→100% step tuned on 3 specific wrong answers (see "Benchmark Integrity" below). Held-out 450q is 98.4%. |
+| 2 | Supermemory ASMR | ~99% *(QA accuracy, not R@5)* | Yes | Ensemble of Gemini 2.0 Flash / GPT-4o-mini | Experimental, not production, per authors |
 | 3 | MemPal (hybrid v3 + rerank) | 99.4% | Optional | Haiku | Reproducible |
 | 3 | MemPal (palace + rerank) | 99.4% | Optional | Haiku | Independent architecture |
-| 4 | Mastra | 94.87% | Yes | GPT-5-mini | — |
-| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Highest zero-API score published** |
-| 6 | Hindsight | 91.4% | Yes | Gemini-3 | — |
-| 7 | Supermemory (production) | ~85% | Yes | Undisclosed | — |
-| 8 | Stella (dense retriever) | ~85% | None | None | Academic baseline |
-| 9 | Contriever | ~78% | None | None | Academic baseline |
+| 4 | Mastra | 94.87% *(QA accuracy, not R@5)* | Yes | GPT-5-mini | Different metric — not directly comparable to R@5 |
+| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Reproducible, 500/500** |
+| 6 | MemPal hybrid v4 held-out 450 | 98.4% | None | None | Honest generalisable hybrid-pipeline figure |
+| 7 | Hindsight | 91.4% *(per their release, metric unverified)* | Yes | Gemini-3 | Check their published methodology |
+| 8 | Stella (dense retriever) | ~85% | None | None | Academic retrieval baseline |
+| 9 | Contriever | ~78% | None | None | Academic retrieval baseline |
 | 10 | BM25 (sparse) | ~70% | None | None | Keyword baseline |

-**MemPal raw (96.6%) is the highest published LongMemEval score that requires no API key, no cloud, and no LLM at any stage.**
+The MemPal raw 96.6% is the headline we ship on public surfaces: it's
+retrieval recall, it requires no API key, and it reproduces.

-**MemPal hybrid v4 + Haiku rerank (100%) is the first perfect score on LongMemEval — 500/500 questions, all 6 question types at 100%.**
+The MemPal hybrid v4 + Haiku rerank 100% remains an internal
+result — reproducible with `--mode hybrid_v4 --llm-rerank` — but we
+don't quote it on public pages because the final 0.6% was reached by
+inspecting three specific wrong answers (see "Benchmark Integrity"
+below), which is teaching to the test. The honest generalisable figure
+when an LLM is in the loop is the held-out 98.4% R@5 on 450 unseen
+questions, or the model-agnostic 99.2% R@5 / 100% R@10 we reproduced
+with minimax-m2.7 on the full 500.

 ---

@@ -308,9 +342,9 @@ The palace classifies each question into one of 5 halls. Pass 1 searches only wi
 ### Setup

 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 mkdir -p /tmp/longmemeval-data
 curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
@@ -196,9 +196,9 @@ python benchmarks/longmemeval_bench.py data/longmemeval_s_cleaned.json --mode hy

 ```bash
 # Setup
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"

 # Download data
 mkdir -p /tmp/longmemeval-data
@@ -1,13 +1,13 @@
-# MemPal Benchmarks — Reproduction Guide
+# MemPalace Benchmarks — Reproduction Guide

 Run the exact same benchmarks we report. Clone, install, run.

 ## Setup

 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 ```

 ## Benchmark 1: LongMemEval (500 questions)
@@ -0,0 +1,144 @@
+# MemPalace — History, Corrections, and Public Notices
+
+This file is the canonical record of post-launch corrections, public notices,
+and retractions that affect MemPalace's public claims. Newest first.
+
+---
+
+## 2026-04-14 — Benchmark table rewrite (issue [#875](https://github.com/MemPalace/mempalace/issues/875))
+
+A community audit identified a category error in the public benchmark tables
+on `README.md` and `mempalaceofficial.com`: MemPalace's retrieval recall
+numbers (R@5, R@10) were listed in the same columns as competitors'
+end-to-end QA accuracy numbers. They are different metrics and are not
+comparable — a system can have 100% retrieval recall and 40% QA accuracy.
+
+The audit also found that the retracted "+34% palace boost" claim (see the
+April 7 note below) was still present in multiple surfaces despite that
+retraction, and that two competitor numbers (`Mem0 ~85%`, `Zep ~85%`) had no
+published source and did not match the metrics those projects actually
+publish.
+
+What changed in this PR:
+
+- The headline number on all surfaces is now **96.6% R@5 on LongMemEval in
+  raw mode**, independently reproduced on Linux x86_64 against the tagged
+  v3.3.0 release on 2026-04-14. Result JSONLs are committed under
+  `benchmarks/results_*.jsonl` (see PR description for the scorecard).
+- The **"100% with Haiku rerank"** claim has been removed from all public
+  comparison tables. It reproduces on our machines and with a different LLM
+  family (minimax-m2.7 via Ollama Cloud: 99.2% R@5 / 100.0% R@10 on the full
+  500-question LongMemEval set) — but the 99.4% → 100% step was developed
+  by inspecting three specific wrong answers (`benchmarks/BENCHMARKS.md` has
+  called this "teaching to the test" since February). It belongs in the
+  methodology document, not in a headline.
+- The **honest held-out number** for the hybrid pipeline — 98.4% R@5 on 450
+  questions that `hybrid_v4` was never tuned on, deterministic seed — is now
+  the comparable figure when an LLM rerank is involved.
+- The **retracted "+34% palace boost"** has been removed from
+  `README.md`, `website/concepts/the-palace.md`,
+  `website/guide/searching.md`, and `website/reference/contributing.md`.
+  Wing and room filters remain useful — they're standard metadata filters —
+  but they are not presented as a novel retrieval improvement.
+- **Competitor comparison tables** mixing retrieval recall with QA accuracy
+  have been removed from `README.md` and `website/reference/benchmarks.md`.
+  Where MemPalace can be fairly compared on the same metric, we link to the
+  cited source. Otherwise we report our own numbers and let readers draw
+  their own conclusions.
+- **Reproduction instructions** in `benchmarks/BENCHMARKS.md` and
+  `benchmarks/README.md` were pointing at a defunct branch
+  (`aya-thekeeper/mempal`); they now point at `MemPalace/mempalace`.
+- The **LoCoMo 100% R@10 with top-50 rerank** row has been removed from
+  public comparison surfaces. With per-conversation session counts of 19–32
+  and `top_k=50`, the retrieval stage returns every session in the
+  conversation by construction, so the number measures an LLM's
+  reading comprehension over the whole conversation, not retrieval.
+
+Thanks to [@dial481](https://github.com/MemPalace/mempalace/issues/875) for
+the detailed audit and to [@rohitg00](https://github.com/rohitg00) for the
+parallel write-up in Discussion #747.
+
+---
+
+## 2026-04-11 — Impostor domains and malware
+
+Several community members (issues #267, #326, #506) reported fake MemPalace
+websites distributing malware. The only official surfaces for this project
+are:
+
+- This GitHub repository: [github.com/MemPalace/mempalace](https://github.com/MemPalace/mempalace)
+- The PyPI package: [pypi.org/project/mempalace](https://pypi.org/project/mempalace/)
+- The docs site: [mempalaceofficial.com](https://mempalaceofficial.com)
+
+Any other domain — `mempalace.tech` being the one most commonly reported —
+is not ours. Never run install scripts from unofficial sites.
+
+Thanks to our community members for flagging the problem.
+
+---
+
+## 2026-04-07 — A Note from Milla & Ben
+
+> The community caught real problems in this README within hours of launch
+> and we want to address them directly.
+>
+> **What we got wrong:**
+>
+> - **The AAAK token example was incorrect.** We used a rough heuristic
+>   (`len(text)//3`) for token counts instead of an actual tokenizer. Real
+>   counts via OpenAI's tokenizer: the English example is 66 tokens, the
+>   AAAK example is 73. AAAK does not save tokens at small scales — it's
+>   designed for *repeated entities at scale*, and the README example was a
+>   bad demonstration of that. We're rewriting it.
+>
+> - **"30x lossless compression" was overstated.** AAAK is a lossy
+>   abbreviation system (entity codes, sentence truncation). Independent
+>   benchmarks show AAAK mode scores **84.2% R@5 vs raw mode's 96.6%** on
+>   LongMemEval — a 12.4 point regression. The honest framing is: AAAK is
+>   an experimental compression layer that trades fidelity for token
+>   density, and **the 96.6% headline number is from RAW mode, not AAAK**.
+>
+> - **"+34% palace boost" was misleading.** That number compares unfiltered
+>   search to wing+room metadata filtering. Metadata filtering is a
+>   standard feature of the underlying vector store, not a novel retrieval
+>   mechanism. Real and useful, but not a moat.
+>
+> - **"Contradiction detection"** exists as a separate utility
+>   (`fact_checker.py`) but is not currently wired into the knowledge graph
+>   operations as the README implied.
+>
+> - **"100% with Haiku rerank"** is real (we have the result files) but
+>   the rerank pipeline is not in the public benchmark scripts. We're
+>   adding it.
+>
+> **What's still true and reproducible:**
+>
+> - **96.6% R@5 on LongMemEval in raw mode**, on 500 questions, zero API
+>   calls — independently reproduced on M2 Ultra in under 5 minutes by
+>   [@gizmax](https://github.com/MemPalace/mempalace/issues/39).
+> - Local, free, no subscription, no cloud, no data leaving your machine.
+> - The architecture (wings, rooms, closets, drawers) is real and useful,
+>   even if it's not a magical retrieval boost.
+>
+> **What we're doing:**
+>
+> 1. Rewriting the AAAK example with real tokenizer counts and a scenario
+>    where AAAK actually demonstrates compression
+> 2. Adding `mode raw / aaak / rooms` clearly to the benchmark
+>    documentation so the trade-offs are visible
+> 3. Wiring `fact_checker.py` into the KG ops so the contradiction
+>    detection claim becomes true
+> 4. Pinning the vector store dependency to a tested range (issue #100),
+>    fixing the shell injection in hooks (#110), and addressing the macOS
+>    ARM64 segfault (#74)
+>
+> **Thank you to everyone who poked holes in this.** Brutal honest
+> criticism is exactly what makes open source work, and it's what we asked
+> for. Special thanks to
+> [@panuhorsmalahti](https://github.com/MemPalace/mempalace/issues/43),
+> [@lhl](https://github.com/MemPalace/mempalace/issues/27),
+> [@gizmax](https://github.com/MemPalace/mempalace/issues/39), and everyone
+> who filed an issue or a PR in the first 48 hours. We're listening, we're
+> fixing, and we'd rather be right than impressive.
+>
+> — *Milla Jovovich & Ben Sigman*
@@ -22,6 +22,8 @@ import pytest
 REPO_ROOT = Path(__file__).resolve().parent.parent
 MEMPALACE_PKG = REPO_ROOT / "mempalace"
 README_PATH = REPO_ROOT / "README.md"
+MCP_TOOLS_DOC_PATH = REPO_ROOT / "website" / "reference" / "mcp-tools.md"
+MODULES_DOC_PATH = REPO_ROOT / "website" / "reference" / "modules.md"


 def _read(path: Path) -> str:
@@ -40,10 +42,15 @@ def _tools_dict_keys() -> list:
    return re.findall(r'"(mempalace_\w+)":\s*\{', src)


-def _readme_tool_table_names() -> list:
-    """Return tool names listed in the README's MCP tool table."""
-    readme = _readme()
-    return re.findall(r"^\| `(mempalace_\w+)`", readme, re.MULTILINE)
+def _doc_tool_names() -> list:
+    """Return the list of tool names documented in the MCP tools reference.
+
+    The MCP tool table lived in README.md prior to the #875 rewrite; it now
+    lives in website/reference/mcp-tools.md (linked from README). Each tool
+    is introduced by a level-3 heading `### \\`mempalace_xxx\\``.
+    """
+    doc = _read(MCP_TOOLS_DOC_PATH)
+    return re.findall(r"^###\s+`(mempalace_\w+)`", doc, re.MULTILINE)


 # ---------------------------------------------------------------------------
@@ -77,19 +84,28 @@ class TestToolCount:


 class TestReadmeToolsExistInCode:
-    """Every tool name in the README tool table must be a key in TOOLS."""
+    """Every tool name documented in the MCP tools reference must be a key in TOOLS."""

    def test_every_readme_tool_exists_in_tools_dict(self):
-        """Claim: README lists tools like mempalace_get_aaak_spec.
-        Each one must actually be registered in the TOOLS dict."""
-        code_tools = set(_tools_dict_keys())
-        readme_tools = _readme_tool_table_names()
-        assert len(readme_tools) > 0, "Could not parse any tools from README table"
+        """Claim: the MCP tools reference (website/reference/mcp-tools.md)
+        lists tools like mempalace_get_aaak_spec. Each one must actually be
+        registered in the TOOLS dict in mempalace/mcp_server.py.

-        missing = [t for t in readme_tools if t not in code_tools]
+        Pre-#875 this parsed the tool table that lived in README.md; that
+        table has moved to the website docs and README now links out.
+        """
+        code_tools = set(_tools_dict_keys())
+        doc_tools = _doc_tool_names()
+        assert len(doc_tools) > 0, (
+            f"Could not parse any tools from {MCP_TOOLS_DOC_PATH.relative_to(REPO_ROOT)} "
+            f"— expected `### \\`mempalace_xxx\\`` headings."
+        )
+
+        missing = [t for t in doc_tools if t not in code_tools]
        assert missing == [], (
-            f"README lists tools that don't exist in TOOLS dict: {missing}. "
-            f"Either add them to mcp_server.py or remove them from README."
+            f"Docs list tools that don't exist in TOOLS dict: {missing}. "
+            f"Either add them to mcp_server.py or remove them from "
+            f"{MCP_TOOLS_DOC_PATH.relative_to(REPO_ROOT)}."
        )


@@ -99,18 +115,20 @@ class TestReadmeToolsExistInCode:


 class TestNoUnlistedTools:
-    """Every tool in the TOOLS dict should be documented in the README."""
+    """Every tool in the TOOLS dict should be documented in the MCP tools reference."""

    def test_no_undocumented_tools(self):
-        """Claim: README's tool table is complete.
-        Any tool in TOOLS but not in README is undocumented."""
+        """Claim: the MCP tools reference
+        (website/reference/mcp-tools.md) is complete. Any tool in TOOLS
+        but not documented there is undocumented on the public surface."""
        code_tools = set(_tools_dict_keys())
-        readme_tools = set(_readme_tool_table_names())
+        doc_tools = set(_doc_tool_names())

-        undocumented = sorted(code_tools - readme_tools)
+        undocumented = sorted(code_tools - doc_tools)
        assert undocumented == [], (
-            f"Tools in TOOLS dict but missing from README: {undocumented}. "
-            f"Add rows for these to the tool table in README.md."
+            f"Tools in TOOLS dict but missing from docs: {undocumented}. "
+            f"Add sections for these to "
+            f"{MCP_TOOLS_DOC_PATH.relative_to(REPO_ROOT)}."
        )


@@ -485,21 +503,27 @@ class TestDialectNotLossless:


 class TestReadmeDialectNotLossless:
-    """README's file reference table must not say dialect.py is lossless."""
+    """The file-reference documentation must not say dialect.py is lossless.
+
+    Pre-#875 this lived in a README.md file table; it now lives in
+    website/reference/modules.md. The April 7 correction established that
+    AAAK is a lossy abbreviation system, not lossless compression, and
+    every docs surface that describes dialect.py must respect that.
+    """

    def test_readme_dialect_line_not_lossless(self):
-        """Claim: April 7 correction applied to README file table.
-        The dialect.py row must not say 'lossless'."""
-        readme = _readme()
-        # Find the line with dialect.py in the file reference table
-        dialect_lines = [
-            line for line in readme.splitlines() if "dialect.py" in line and "|" in line
-        ]
-        assert len(dialect_lines) > 0, "Could not find dialect.py in README file table"
+        doc = _read(MODULES_DOC_PATH)
+        # Any line mentioning dialect.py (narrative or table) must not call it lossless
+        dialect_lines = [line for line in doc.splitlines() if "dialect.py" in line]
+        assert len(dialect_lines) > 0, (
+            f"Could not find dialect.py in "
+            f"{MODULES_DOC_PATH.relative_to(REPO_ROOT)}. "
+            f"Expected at least one reference."
+        )

        for line in dialect_lines:
            assert "lossless" not in line.lower(), (
-                f"README file table still says dialect.py is lossless: {line.strip()!r}. "
+                f"Docs still call dialect.py lossless: {line.strip()!r}. "
                f"After April 7 correction, this must say 'lossy' or remove the lossless claim."
            )

@@ -80,12 +80,11 @@ The knowledge graph uses SQLite with two tables:

 Database location: `~/.mempalace/knowledge_graph.sqlite3`

-## Comparison
+## Related Work

-| Feature | MemPalace | Zep (Graphiti) |
-|---------|-----------|----------------|
-| Storage | SQLite (local) | Neo4j (cloud) |
-| Cost | Free | $25/mo+ |
-| Temporal validity | Yes | Yes |
-| Self-hosted | Always | Enterprise only |
-| Privacy | Everything local | SOC 2, HIPAA |
+Temporal entity-relationship graphs are a familiar pattern — Zep's
+Graphiti, for example, also exposes a bi-temporal model. MemPalace's
+knowledge graph is local-first (SQLite, everything on disk) and free;
+Zep is a managed service backed by Neo4j with its own pricing, SLAs,
+and compliance surface. See Zep's own [documentation](https://www.getzep.com/)
+for authoritative details on their deployment model.
@@ -92,16 +92,9 @@ The original stored text chunks. This is the primary retrieval layer used by the

 ## Why Structure Matters

-Tested on 22,000+ real conversation memories:
+Wing and room identifiers become metadata filters at query time. Narrowing a search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which is useful when you have many unrelated projects or people filed in the same palace.

-| Search scope | R@10 | Improvement |
-|-------------|------|-------------|
-| All closets | 60.9% | baseline |
-| Within wing | 73.1% | +12% |
-| Wing + hall | 84.8% | +24% |
-| Wing + room | 94.8% | +34% |
-
-The practical point is that structure improves retrieval. In the project benchmarks, narrowing the search scope by wing and room outperformed searching the entire corpus at once.
+This is standard metadata filtering in the underlying vector store, not a novel retrieval mechanism. The useful property here is operational — clear scoping rules that a human or an agent can apply predictably — not a magic retrieval boost.

 ## Navigation

@@ -23,23 +23,16 @@ mempalace search "deploy process" --results 10

 ## How Search Works

-1. Your query is embedded using ChromaDB's default model (`all-MiniLM-L6-v2`)
-2. The embedding is compared against all drawers using cosine similarity
-3. Optional wing/room filters narrow the search scope
-4. Results are returned with similarity scores and source metadata
+1. Your query is embedded using the vector store's default model (`all-MiniLM-L6-v2` with the default ChromaDB backend).
+2. The embedding is compared against all drawers using cosine similarity.
+3. Optional wing/room filters narrow the search scope — standard metadata filtering in the underlying vector store.
+4. Results are returned with similarity scores and source metadata.

-### Why Structure Matters
+### Why Scoping Matters

-Tested on 22,000+ real conversation memories:
+Wing/room filtering is useful when a single palace contains many unrelated projects or people. Narrowing the search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which keeps retrieval predictable as the palace grows.

-```
-Search all closets:          60.9%  R@10
-Search within wing:          73.1%  (+12%)
-Search wing + hall:          84.8%  (+24%)
-Search wing + room:          94.8%  (+34%)
-```
-
-Wings and rooms aren't cosmetic — they're a **34% retrieval improvement**.
+This is a metadata-filter feature of the vector store, not a novel retrieval mechanism. Treat it as an operational convenience: clear scoping rules that a human or an agent can apply predictably.

 ## Programmatic Search

@@ -4,7 +4,7 @@ layout: home
 hero:
  name: MemPalace
  text: Give your AI a memory.
-  tagline: "96.6% recall on LongMemEval in raw mode. Local-first, open source, and usable without an API key."
+  tagline: "Local-first AI memory. Verbatim storage, pluggable backend, 96.6% R@5 raw on LongMemEval — zero API calls."
  image:
    src: /mempalace_logo.png
    alt: MemPalace
@@ -34,7 +34,7 @@ features:
      src: /icons/search.svg
      alt: Semantic Search
    title: Semantic Search
-    details: ChromaDB-powered vector search lets the model retrieve past discussions by topic, project, or room.
+    details: Vector search over verbatim content lets the model retrieve past discussions by topic, project, or room. Backend is pluggable.
  - icon:
      src: /icons/git-merge.svg
      alt: Knowledge Graph
@@ -49,7 +49,7 @@ features:
      src: /icons/shield-check.svg
      alt: Zero Cloud
    title: Zero Cloud
-    details: Core storage and retrieval run locally on ChromaDB and SQLite. Optional reranking features can add an API dependency.
+    details: Core storage and retrieval run locally. Optional reranking features can add an API dependency but are not required for the benchmark path.
 ---

 <style>
@@ -68,20 +68,21 @@ features:

 ## Verbatim Retrieval First

-MemPalace starts from a simple premise: **store the source text and retrieve it well**. The benchmarked raw mode does not require an LLM extraction step.
+MemPalace stores source text and retrieves it with semantic search. The benchmarked raw mode does not require an LLM at any stage — no extraction, no rerank, no summarisation.

-| System | LongMemEval R@5 | API Required | Cost |
-|--------|----------------|--------------|------|
-| **MemPalace (hybrid)** | **100%** | Optional | Free |
-| Supermemory ASMR | ~99% | Yes | — |
-| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
-| Mastra | 94.87% | Yes | API costs |
-| Mem0 | ~85% | Yes | $19–249/mo |
+**LongMemEval retrieval recall (500 questions):**

-The raw 96.6% LongMemEval result is the baseline story: strong recall without requiring an API key or an LLM in the retrieval pipeline.
+| Mode | R@5 | LLM required |
+|---|---|---|
+| Raw (semantic search over verbatim text) | **96.6%** | None |
+| Hybrid v4, held-out 450q | **98.4%** | None |
+
+The raw 96.6% reproduces on any machine with the committed dataset: result JSONLs, the `seed=42` train/held-out split, and the `--mode raw` / `--held-out` runners are all in the `benchmarks/` directory of the repo.
+
+We deliberately do not publish a side-by-side comparison against other memory systems on this page. Retrieval recall (R@5) and end-to-end QA accuracy are different metrics and are not comparable; where MemPalace can be fairly compared on the same metric, we link to the other project's published source.

 <div style="text-align: center; padding-top: 16px;">
-  <a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark results →</a>
+  <a href="./reference/benchmarks" style="color: var(--vp-c-brand-1); font-weight: 500;">Full benchmark methodology →</a>
 </div>

 </div>
@@ -1,28 +1,51 @@
 # Benchmarks

-Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository.
+Curated summary of MemPalace's reproducible benchmark results. For the
+complete progression with every experiment, see
+[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+All headline numbers on this page are reproducible from the committed
+repository — datasets, scripts, and per-question result JSONLs are all
+checked in.

 ## The Core Finding

-MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline.
+MemPalace's benchmarked raw baseline stores the source text and searches
+it with the vector store's default embeddings. No extraction or
+summarisation step is required for that baseline, and it reproduces at
+**96.6% R@5** on LongMemEval with no LLM at any stage.

-**And it scores 96.6% on LongMemEval.**
+## LongMemEval — Retrieval Recall

-## LongMemEval Results
+Retrieval recall asks: is the labelled session for this question inside
+the top-K retrieved sessions? It is not the same metric as end-to-end QA
+accuracy; a system can have perfect retrieval recall and poor QA answer
+quality, and vice versa.

-| Mode | R@5 | LLM Required | Cost/query |
-|------|-----|-------------|------------|
-| Raw ChromaDB | **96.6%** | None | $0 |
-| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
-| Palace + rerank | 99.4% | Haiku | ~$0.001 |
-| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
+**Full 500 questions:**

-The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking.
+| Mode | R@5 | LLM required | Cost/query |
+|---|---|---|---|
+| Raw — vector search over verbatim sessions | **96.6%** | None | $0 |
+| Hybrid v4 — keyword/temporal/preference boosts, no LLM | 98.6% | None | $0 |
+| Hybrid v4 + LLM rerank (minimax-m2.7 via Ollama) | 99.2% | Any capable model | $0 local / varies cloud |

-### Per-Category Breakdown (Raw, 96.6%)
+**Held-out set (450 questions, never used during `hybrid_v4` development):**

-| Question Type | R@5 | Count |
-|---------------|-----|-------|
+| Mode | R@5 | R@10 | NDCG@10 |
+|---|---|---|---|
+| Hybrid v4 | **98.4%** | 99.8% | 0.938 |
+
+The held-out figure is the honest generalisable number. The full-500
+scores are higher but include the 50 "dev" questions that hybrid_v4's
+three targeted fixes (quoted-phrase boost, person-name boost, nostalgia
+patterns) were developed against. `benchmarks/BENCHMARKS.md` calls this
+"teaching to the test" and the held-out 98.4% is the clean number to
+quote when a single R@5 figure is needed for the hybrid pipeline.
+
+### Per-category breakdown (raw, 96.6%)
+
+| Question type | R@5 | Count |
+|---|---|---|
 | Knowledge update | 99.0% | 78 |
 | Multi-session | 98.5% | 133 |
 | Temporal reasoning | 96.2% | 133 |
@@ -30,66 +53,95 @@ The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The
 | Single-session preference | 93.3% | 30 |
 | Single-session assistant | 92.9% | 56 |

-### Held-Out Validation
+## LoCoMo — Retrieval Recall

-**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize.
+LoCoMo contains 1,986 questions across 10 long conversations (19–32
+sessions each).

-## Comparison vs Published Systems
+| Mode | R@10 | LLM required |
+|---|---|---|
+| Session, no rerank, top-10 | 60.3% | None |
+| Hybrid v5 (keyword + predicate boosts), top-10 | 88.9% | None |

-| System | LongMemEval R@5 | API Required | Cost |
-|--------|----------------|--------------|------|
-| **MemPalace (hybrid)** | **100%** | Optional | Free |
-| Supermemory ASMR | ~99% | Yes | — |
-| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
-| Mastra | 94.87% | Yes | API costs |
-| Hindsight | 91.4% | Yes | API costs |
-| Mem0 | ~85% | Yes | $19–249/mo |
+We do not publish a "100% R@10" headline for LoCoMo. A reported 100% in
+earlier drafts used `top_k=50`, which exceeds the per-conversation
+session count (19–32) — so the retrieval stage returns every session in
+every conversation by construction. That number measures an LLM's
+reading comprehension over the whole conversation, not retrieval. The
+honest retrieval-recall number for LoCoMo is the top-10 figure.

 ## Other Benchmarks

-### ConvoMem (Salesforce, 75K+ QA pairs)
+**ConvoMem** (Salesforce; 50 items per category × 5 categories = 250
+items): MemPalace raw retrieval reaches **92.9% avg recall**. Strongest
+categories: Assistant Facts 100%, User Facts 98%. Weakest: Preferences
+86%. The Salesforce dataset contains ~75K items in total; our headline
+number is from the 250-item sample the benchmark script was designed
+around.

-| System | Score |
-|--------|-------|
-| **MemPalace** | **92.9%** |
-| Gemini (long context) | 70–82% |
-| Block extraction | 57–71% |
-| Mem0 (RAG) | 30–45% |
+**MemBench** (ACL 2025; 8,500 items, all topics): MemPalace hybrid
+top-5 reaches **80.3% R@5 overall**. Strongest: aggregative 99.3%,
+comparative 98.4%, lowlevel_rec 99.8%. Weakest: noisy 43.4%
+(distractor-heavy by design), conditional 57.3%.

-On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table.
+## Why We Don't Publish a Cross-System Comparison Table

-### LoCoMo (1,986 multi-hop QA pairs)
+Previous versions of this page placed MemPalace's retrieval recall (R@5)
+next to other projects' end-to-end QA accuracy figures under a single
+"LongMemEval R@5" column. Those are different metrics and are not
+comparable. A system can have 100% retrieval recall and 40% QA
+accuracy, and vice versa.

-| Mode | R@10 | LLM |
-|------|------|-----|
-| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet |
-| bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
-| Hybrid v5 (top-10, no rerank) | **88.9%** | None |
-| Session, no rerank (top-10) | 60.3% | None |
+If you are evaluating memory systems against MemPalace and want a fair
+comparison, use the retrieval-recall numbers above and the benchmark
+scripts in the repo; or pick the metric the other project publishes and
+compare on that. Each project's published source is the correct
+reference:

-### MemBench (ACL 2025, 8,500 items)
-
-**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%).
+- [Mastra — Observational Memory](https://mastra.ai/research/observational-memory)
+  (their published metric is binary QA accuracy with GPT-5-mini)
+- [Mem0 — Research](https://mem0.ai/research)
+  (their published LoCoMo metric is end-to-end QA accuracy, not retrieval recall)
+- [Supermemory — ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/)
+  (their published metric is QA accuracy; authors explicitly frame the
+  ensemble as an experimental proof-of-concept, not production)

 ## Reproducing Results

-All benchmarks are reproducible with public datasets:
+Every benchmark runs deterministically from this repository.

 ```bash
 git clone https://github.com/MemPalace/mempalace.git
 cd mempalace
-pip install chromadb pyyaml
+pip install -e ".[dev]"

-# Download LongMemEval data
+# LongMemEval — raw (96.6%)
 curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
-
-# Run raw baseline (96.6%, no API key needed)
 python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
+
+# LongMemEval — hybrid v4 on the held-out 450 (98.4%)
+python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
+  --mode hybrid_v4 --held-out --split-file benchmarks/lme_split_50_450.json
+
+# LoCoMo — session, top-10 (60.3%)
+git clone https://github.com/snap-research/locomo.git /tmp/locomo
+python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json \
+  --granularity session --top-k 10
+
+# LongMemEval — hybrid v4 + rerank, any OpenAI-compatible endpoint
+python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
+  --mode hybrid_v4 --llm-rerank \
+  --llm-backend ollama --llm-model <your-model-tag>
 ```

 ::: tip
-Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score.
+Results are deterministic: same data, same script, same split seed →
+same score. The committed `benchmarks/results_*.jsonl` files include
+every question, every retrieved corpus id, and every score, so every
+individual answer is auditable — not just the aggregate.
 :::

-For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+For the complete progression (hybrid v1 → v4, diary mode, palace mode,
+LoCoMo architecture iterations, methodology integrity notes), see
+[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
@@ -68,7 +68,7 @@ If you're planning a significant change, open an issue first. Key principles:
 - **Verbatim first** — never summarize user content. Store exact words.
 - **Local first** — everything runs on the user's machine. No cloud dependencies.
 - **Zero API by default** — core features must work without any API key.
- **Palace structure matters** — wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement.
+- **Palace structure is scoping, not magic** — wings, halls, and rooms act as metadata filters in the underlying vector store. They make scoping predictable when a palace holds many unrelated projects; they are not a novel retrieval mechanism.

 ## Community

@@ -1,6 +1,6 @@
 # MCP Tools Reference

-Detailed parameter schemas for all 19 MCP tools.
+Detailed parameter schemas for all 29 MCP tools.

 ## Palace — Read Tools

@@ -114,6 +114,48 @@ Delete a drawer by ID. Irreversible.

 ---

+### `mempalace_get_drawer`
+
+Fetch a single drawer by ID — returns full content and metadata.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `drawer_id` | string | **Yes** | ID of the drawer to fetch |
+
+**Returns:** `{ drawer: { id, wing, room, content, ... } }`
+
+---
+
+### `mempalace_list_drawers`
+
+List drawers with pagination. Optional wing/room filter. Returns IDs, wings, rooms, and content previews.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | No | Filter by wing |
+| `room` | string | No | Filter by room |
+| `limit` | integer | No | Max results per page (default 20, max 100) |
+| `offset` | integer | No | Offset for pagination (default 0) |
+
+**Returns:** `{ drawers: [...], total, limit, offset }`
+
+---
+
+### `mempalace_update_drawer`
+
+Update an existing drawer's content and/or metadata (wing, room). Fetches the existing drawer first; returns an error if not found.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `drawer_id` | string | **Yes** | ID of the drawer to update |
+| `content` | string | No | New content (omit to keep existing) |
+| `wing` | string | No | New wing (omit to keep existing) |
+| `room` | string | No | New room (omit to keep existing) |
+
+**Returns:** `{ success, drawer_id, updated_fields }`
+
+---
+
 ## Knowledge Graph Tools

 ### `mempalace_kg_query`
@@ -221,6 +263,61 @@ Palace graph overview: nodes, tunnels, edges, connectivity.

 ---

+### `mempalace_create_tunnel`
+
+Create a cross-wing tunnel linking two palace locations. Use when content in one project relates to another — e.g., an API design in `project_api` connects to a database schema in `project_database`.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `source_wing` | string | **Yes** | Wing of the source |
+| `source_room` | string | **Yes** | Room in the source wing |
+| `target_wing` | string | **Yes** | Wing of the target |
+| `target_room` | string | **Yes** | Room in the target wing |
+| `label` | string | No | Description of the connection |
+| `source_drawer_id` | string | No | Specific source drawer ID |
+| `target_drawer_id` | string | No | Specific target drawer ID |
+
+**Returns:** `{ success, tunnel_id, source, target }`
+
+---
+
+### `mempalace_list_tunnels`
+
+List all explicit cross-wing tunnels. Optionally filter by wing.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | No | Filter tunnels by wing (source or target) |
+
+**Returns:** `{ tunnels: [...], count }`
+
+---
+
+### `mempalace_delete_tunnel`
+
+Delete an explicit tunnel by its ID.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `tunnel_id` | string | **Yes** | Tunnel ID to delete |
+
+**Returns:** `{ success, tunnel_id }`
+
+---
+
+### `mempalace_follow_tunnels`
+
+Follow tunnels from a room to see what it connects to in other wings. Returns connected rooms with drawer previews.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | **Yes** | Wing to start from |
+| `room` | string | **Yes** | Room to follow tunnels from |
+
+**Returns:** `[{ wing, room, label, previews }]`
+
+---
+
 ## Agent Diary Tools

 ### `mempalace_diary_write`
@@ -247,3 +344,38 @@ Read recent diary entries.
 | `last_n` | integer | No | Number of recent entries (default: 10) |

 **Returns:** `{ agent, entries: [{ date, timestamp, topic, content }], total, showing }`
+
+---
+
+## System Tools
+
+### `mempalace_hook_settings`
+
+Get or set auto-save hook behaviour. `silent_save=true` saves directly without MCP-level clutter; `silent_save=false` uses the legacy blocking path. `desktop_toast=true` surfaces a desktop notification when a save completes. Call with no arguments to view the current settings.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `silent_save` | boolean | No | `true` = silent direct save, `false` = blocking MCP calls |
+| `desktop_toast` | boolean | No | `true` = show desktop toast via `notify-send` |
+
+**Returns:** `{ silent_save, desktop_toast }`
+
+---
+
+### `mempalace_memories_filed_away`
+
+Check whether a recent palace checkpoint was saved. Returns message count and timestamp of the last save.
+
+**Parameters:** None
+
+**Returns:** `{ filed, message_count, timestamp }`
+
+---
+
+### `mempalace_reconnect`
+
+Force a reconnect to the palace database. Use this after external scripts or CLI commands modified the palace directly, which can leave the in-memory HNSW index stale.
+
+**Parameters:** None
+
+**Returns:** `{ success, palace_path }`