diff --git a/CHANGELOG.md b/CHANGELOG.md
index 804d485..dd01968 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -41,6 +41,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 - Add `docs/CLOSETS.md` — closet layer overview
 - Fix stale `milla-jovovich/*` org URLs in website and plugin manifests (#787)
 - Fix remaining stale org URLs in contributor docs (#808)
+- Rewrite `README.md` and `mempalaceofficial.com` benchmark pages to remove category-error cross-system comparisons (R@5 retrieval recall had been listed next to competitor QA accuracy under one column), remove the retracted "+34% palace boost" claim from the surfaces where it had remained, replace the `100%` Haiku-rerank headline with the honest held-out `98.4%` R@5, drop the LoCoMo `100%` top-50 row (retrieval-bypass artefact), and fix the broken `aya-thekeeper/mempal` reproduction URL (#875)
+- Add `docs/HISTORY.md` as the canonical home for corrections, retractions, and public notices; move the 2026-04-07 "Note from Milla & Ben" and the 2026-04-11 impostor-domain notice out of `README.md`
+- Add v3.3.0 reproduction result JSONLs and the deterministic `seed=42` 50/450 LongMemEval split under `benchmarks/` — every BENCHMARKS.md claim reproduces exactly
 
 ### Internal
 - Add test coverage for `mine_lock`, closets, entity metadata, BM25, and diary
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 2772b11..9c6501d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -82,7 +82,7 @@ If you're planning a significant change, open an issue first to discuss the appr
 - **Verbatim first**: Never summarize user content. Store exact words.
 - **Local first**: Everything runs on the user's machine. No cloud dependencies.
 - **Zero API by default**: Core features must work without any API key.
-- **Palace structure matters**: Wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement. Respect the hierarchy.
+- **Palace structure is scoping, not magic**: Wings, halls, and rooms act as metadata filters in the underlying vector store. They keep retrieval predictable when a palace holds many unrelated projects or people. Respect the hierarchy — but don't present it as a novel retrieval mechanism.
 
 ## Community
 
diff --git a/benchmarks/BENCHMARKS.md b/benchmarks/BENCHMARKS.md
index f806e5d..77a963e 100644
--- a/benchmarks/BENCHMARKS.md
+++ b/benchmarks/BENCHMARKS.md
@@ -41,23 +41,57 @@ Both are real. Both are reproducible. Neither is the whole picture alone.
 
 ## Comparison vs Published Systems (LongMemEval)
 
-| # | System | R@5 | LLM Required | Which LLM | Notes |
+> **Important caveat — read before quoting this table.**
+> MemPal's `R@5` in this table is **retrieval recall**: is the labelled
+> session for this question inside the top-5 retrieved candidates?
+>
+> Several of the other systems below publish **end-to-end QA accuracy** —
+> a different metric that scores whether the system's generated answer
+> is correct. Retrieval recall and QA accuracy are not comparable; a
+> system can have 100% retrieval recall and 40% QA accuracy, and vice
+> versa.
+>
+> - **Mastra's 94.87%** is binary QA accuracy with GPT-5-mini, per
+>   [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory).
+> - **Supermemory ASMR's ~99%** is QA accuracy with an 8-/12-agent
+>   ensemble, and the authors explicitly frame it as an experimental
+>   proof-of-concept, not production, per
+>   [their ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/).
+> - **Mem0** does not publish a LongMemEval number; their published
+>   metric is LoCoMo QA accuracy (~66.9%), per
+>   [mem0.ai/research](https://mem0.ai/research).
+>
+> The table is kept here as a historical record of how the comparison
+> was originally framed. Public-facing pages (`README.md`,
+> `mempalaceofficial.com`) no longer present this table, per issue
+> [#875](https://github.com/MemPalace/mempalace/issues/875). For a fair
+> head-to-head, run the same metric on the same split.
+
+| # | System | R@5 (retrieval recall, unless noted) | LLM Required | Which LLM | Notes |
 |---|---|---|---|---|---|
-| 1 | **MemPal (hybrid v4 + rerank)** | **100%** | Optional | Haiku | Reproducible, 500/500 |
-| 2 | Supermemory ASMR | ~99% | Yes | Undisclosed | Research only, not in production |
+| 1 | **MemPal (hybrid v4 + Haiku rerank)** | **100%** | Optional | Haiku | 500/500 — but the 99.4%→100% step tuned on 3 specific wrong answers (see "Benchmark Integrity" below). Held-out 450q is 98.4%. |
+| 2 | Supermemory ASMR | ~99% *(QA accuracy, not R@5)* | Yes | Ensemble of Gemini 2.0 Flash / GPT-4o-mini | Experimental, not production, per authors |
 | 3 | MemPal (hybrid v3 + rerank) | 99.4% | Optional | Haiku | Reproducible |
 | 3 | MemPal (palace + rerank) | 99.4% | Optional | Haiku | Independent architecture |
-| 4 | Mastra | 94.87% | Yes | GPT-5-mini | — |
-| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Highest zero-API score published** |
-| 6 | Hindsight | 91.4% | Yes | Gemini-3 | — |
-| 7 | Supermemory (production) | ~85% | Yes | Undisclosed | — |
-| 8 | Stella (dense retriever) | ~85% | None | None | Academic baseline |
-| 9 | Contriever | ~78% | None | None | Academic baseline |
+| 4 | Mastra | 94.87% *(QA accuracy, not R@5)* | Yes | GPT-5-mini | Different metric — not directly comparable to R@5 |
+| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Reproducible, 500/500** |
+| 6 | MemPal hybrid v4 held-out 450 | 98.4% | None | None | Honest generalisable hybrid-pipeline figure |
+| 7 | Hindsight | 91.4% *(per their release, metric unverified)* | Yes | Gemini-3 | Check their published methodology |
+| 8 | Stella (dense retriever) | ~85% | None | None | Academic retrieval baseline |
+| 9 | Contriever | ~78% | None | None | Academic retrieval baseline |
 | 10 | BM25 (sparse) | ~70% | None | None | Keyword baseline |
 
-**MemPal raw (96.6%) is the highest published LongMemEval score that requires no API key, no cloud, and no LLM at any stage.**
+The MemPal raw 96.6% is the headline we ship on public surfaces: it's
+retrieval recall, it requires no API key, and it reproduces.
 
-**MemPal hybrid v4 + Haiku rerank (100%) is the first perfect score on LongMemEval — 500/500 questions, all 6 question types at 100%.**
+The MemPal hybrid v4 + Haiku rerank 100% remains an internal
+result — reproducible with `--mode hybrid_v4 --llm-rerank` — but we
+don't quote it on public pages because the final 0.6% was reached by
+inspecting three specific wrong answers (see "Benchmark Integrity"
+below), which is teaching to the test. The honest generalisable figure
+when an LLM is in the loop is the held-out 98.4% R@5 on 450 unseen
+questions, or the model-agnostic 99.2% R@5 / 100% R@10 we reproduced
+with minimax-m2.7 on the full 500.
 
 ---
 
@@ -308,9 +342,9 @@ The palace classifies each question into one of 5 halls. Pass 1 searches only wi
 ### Setup
 
 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 mkdir -p /tmp/longmemeval-data
 curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
   https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
diff --git a/benchmarks/HYBRID_MODE.md b/benchmarks/HYBRID_MODE.md
index 6843e98..37f315e 100644
--- a/benchmarks/HYBRID_MODE.md
+++ b/benchmarks/HYBRID_MODE.md
@@ -196,9 +196,9 @@ python benchmarks/longmemeval_bench.py data/longmemeval_s_cleaned.json --mode hy
 
 ```bash
 # Setup
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 
 # Download data
 mkdir -p /tmp/longmemeval-data
diff --git a/benchmarks/README.md b/benchmarks/README.md
index 6e041fb..417ef05 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -1,13 +1,13 @@
-# MemPal Benchmarks — Reproduction Guide
+# MemPalace Benchmarks — Reproduction Guide
 
 Run the exact same benchmarks we report. Clone, install, run.
 
 ## Setup
 
 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 ```
 
 ## Benchmark 1: LongMemEval (500 questions)