From bf3b9c5979227fe6f37b1b21203e7e3ca6e5d820 Mon Sep 17 00:00:00 2001
From: Igor Lins e Silva <4753812+igorls@users.noreply.github.com>
Date: Tue, 14 Apr 2026 21:38:00 -0300
Subject: [PATCH] =?UTF-8?q?docs:=20#875=20follow-up=20=E2=80=94=20repo=20s?=
 =?UTF-8?q?urfaces=20+=20reproduction=20URLs=20+=20CHANGELOG?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Remaining in-repo surfaces carrying the same retracted or broken
claims as the public pages fixed in the previous two commits.

CONTRIBUTING.md
 - "Palace structure matters ... 34% retrieval improvement" → reframed
   as scoping (same rewording applied to the website equivalents).

benchmarks/BENCHMARKS.md
 - Add a prominent "Important caveat" block at the top of the
   "Comparison vs Published Systems" table explaining that R@5
   (retrieval recall) and QA accuracy are different metrics, with
   citations to Mastra, Mem0, and Supermemory's own published
   methodology pages. Annotate the specific competitor rows whose
   numbers are QA accuracy, not retrieval recall.
 - Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4
   → 100 step was tuned on 3 specific wrong answers (already disclosed
   further down in the doc under "Benchmark Integrity"); the honest
   hybrid figure is held-out 98.4%.
 - Fix the broken clone URL — `aya-thekeeper/mempal` no longer points
   at anything; now `MemPalace/mempalace`.

benchmarks/README.md + benchmarks/HYBRID_MODE.md
 - Same clone-URL fix applied.

CHANGELOG.md
 - Add a ### Documentation entry under [Unreleased] v3.3.0 that names
   #875 and summarises the scope of the rewrite.
---
 CHANGELOG.md              |  3 ++
 CONTRIBUTING.md           |  2 +-
 benchmarks/BENCHMARKS.md  | 62 ++++++++++++++++++++++++++++++---------
 benchmarks/HYBRID_MODE.md |  6 ++--
 benchmarks/README.md      |  8 ++---
 5 files changed, 59 insertions(+), 22 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 804d485..dd01968 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -41,6 +41,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 - Add `docs/CLOSETS.md` — closet layer overview
 - Fix stale `milla-jovovich/*` org URLs in website and plugin manifests (#787)
 - Fix remaining stale org URLs in contributor docs (#808)
+- Rewrite `README.md` and `mempalaceofficial.com` benchmark pages to remove category-error cross-system comparisons (R@5 retrieval recall had been listed next to competitor QA accuracy under one column), remove the retracted "+34% palace boost" claim from the surfaces where it had remained, replace the `100%` Haiku-rerank headline with the honest held-out `98.4%` R@5, drop the LoCoMo `100%` top-50 row (retrieval-bypass artefact), and fix the broken `aya-thekeeper/mempal` reproduction URL (#875)
+- Add `docs/HISTORY.md` as the canonical home for corrections, retractions, and public notices; move the 2026-04-07 "Note from Milla & Ben" and the 2026-04-11 impostor-domain notice out of `README.md`
+- Add v3.3.0 reproduction result JSONLs and the deterministic `seed=42` 50/450 LongMemEval split under `benchmarks/` — every BENCHMARKS.md claim reproduces exactly
 
 ### Internal
 - Add test coverage for `mine_lock`, closets, entity metadata, BM25, and diary
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 2772b11..9c6501d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -82,7 +82,7 @@ If you're planning a significant change, open an issue first to discuss the appr
 - **Verbatim first**: Never summarize user content. Store exact words.
 - **Local first**: Everything runs on the user's machine. No cloud dependencies.
 - **Zero API by default**: Core features must work without any API key.
-- **Palace structure matters**: Wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement. Respect the hierarchy.
+- **Palace structure is scoping, not magic**: Wings, halls, and rooms act as metadata filters in the underlying vector store. They keep retrieval predictable when a palace holds many unrelated projects or people. Respect the hierarchy — but don't present it as a novel retrieval mechanism.
 
 ## Community
 
diff --git a/benchmarks/BENCHMARKS.md b/benchmarks/BENCHMARKS.md
index f806e5d..77a963e 100644
--- a/benchmarks/BENCHMARKS.md
+++ b/benchmarks/BENCHMARKS.md
@@ -41,23 +41,57 @@ Both are real. Both are reproducible. Neither is the whole picture alone.
 
 ## Comparison vs Published Systems (LongMemEval)
 
-| # | System | R@5 | LLM Required | Which LLM | Notes |
+> **Important caveat — read before quoting this table.**
+> MemPal's `R@5` in this table is **retrieval recall**: is the labelled
+> session for this question inside the top-5 retrieved candidates?
+>
+> Several of the other systems below publish **end-to-end QA accuracy** —
+> a different metric that scores whether the system's generated answer
+> is correct. Retrieval recall and QA accuracy are not comparable; a
+> system can have 100% retrieval recall and 40% QA accuracy, and vice
+> versa.
+>
+> - **Mastra's 94.87%** is binary QA accuracy with GPT-5-mini, per
+>   [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory).
+> - **Supermemory ASMR's ~99%** is QA accuracy with an 8-/12-agent
+>   ensemble, and the authors explicitly frame it as an experimental
+>   proof-of-concept, not production, per
+>   [their ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/).
+> - **Mem0** does not publish a LongMemEval number; their published
+>   metric is LoCoMo QA accuracy (~66.9%), per
+>   [mem0.ai/research](https://mem0.ai/research).
+>
+> The table is kept here as a historical record of how the comparison
+> was originally framed. Public-facing pages (`README.md`,
+> `mempalaceofficial.com`) no longer present this table, per issue
+> [#875](https://github.com/MemPalace/mempalace/issues/875). For a fair
+> head-to-head, run the same metric on the same split.
+
+| # | System | R@5 (retrieval recall, unless noted) | LLM Required | Which LLM | Notes |
 |---|---|---|---|---|---|
-| 1 | **MemPal (hybrid v4 + rerank)** | **100%** | Optional | Haiku | Reproducible, 500/500 |
-| 2 | Supermemory ASMR | ~99% | Yes | Undisclosed | Research only, not in production |
+| 1 | **MemPal (hybrid v4 + Haiku rerank)** | **100%** | Optional | Haiku | 500/500 — but the 99.4%→100% step tuned on 3 specific wrong answers (see "Benchmark Integrity" below). Held-out 450q is 98.4%. |
+| 2 | Supermemory ASMR | ~99% *(QA accuracy, not R@5)* | Yes | Ensemble of Gemini 2.0 Flash / GPT-4o-mini | Experimental, not production, per authors |
 | 3 | MemPal (hybrid v3 + rerank) | 99.4% | Optional | Haiku | Reproducible |
 | 3 | MemPal (palace + rerank) | 99.4% | Optional | Haiku | Independent architecture |
-| 4 | Mastra | 94.87% | Yes | GPT-5-mini | — |
-| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Highest zero-API score published** |
-| 6 | Hindsight | 91.4% | Yes | Gemini-3 | — |
-| 7 | Supermemory (production) | ~85% | Yes | Undisclosed | — |
-| 8 | Stella (dense retriever) | ~85% | None | None | Academic baseline |
-| 9 | Contriever | ~78% | None | None | Academic baseline |
+| 4 | Mastra | 94.87% *(QA accuracy, not R@5)* | Yes | GPT-5-mini | Different metric — not directly comparable to R@5 |
+| 5 | **MemPal (raw, no LLM)** | **96.6%** | **None** | **None** | **Reproducible, 500/500** |
+| 6 | MemPal hybrid v4 held-out 450 | 98.4% | None | None | Honest generalisable hybrid-pipeline figure |
+| 7 | Hindsight | 91.4% *(per their release, metric unverified)* | Yes | Gemini-3 | Check their published methodology |
+| 8 | Stella (dense retriever) | ~85% | None | None | Academic retrieval baseline |
+| 9 | Contriever | ~78% | None | None | Academic retrieval baseline |
 | 10 | BM25 (sparse) | ~70% | None | None | Keyword baseline |
 
-**MemPal raw (96.6%) is the highest published LongMemEval score that requires no API key, no cloud, and no LLM at any stage.**
+The MemPal raw 96.6% is the headline we ship on public surfaces: it's
+retrieval recall, it requires no API key, and it reproduces.
 
-**MemPal hybrid v4 + Haiku rerank (100%) is the first perfect score on LongMemEval — 500/500 questions, all 6 question types at 100%.**
+The MemPal hybrid v4 + Haiku rerank 100% remains an internal
+result — reproducible with `--mode hybrid_v4 --llm-rerank` — but we
+don't quote it on public pages because the final 0.6% was reached by
+inspecting three specific wrong answers (see "Benchmark Integrity"
+below), which is teaching to the test. The honest generalisable figure
+when an LLM is in the loop is the held-out 98.4% R@5 on 450 unseen
+questions, or the model-agnostic 99.2% R@5 / 100% R@10 we reproduced
+with minimax-m2.7 on the full 500.
 
 ---
 
@@ -308,9 +342,9 @@ The palace classifies each question into one of 5 halls. Pass 1 searches only wi
 ### Setup
 
 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 mkdir -p /tmp/longmemeval-data
 curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
   https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
diff --git a/benchmarks/HYBRID_MODE.md b/benchmarks/HYBRID_MODE.md
index 6843e98..37f315e 100644
--- a/benchmarks/HYBRID_MODE.md
+++ b/benchmarks/HYBRID_MODE.md
@@ -196,9 +196,9 @@ python benchmarks/longmemeval_bench.py data/longmemeval_s_cleaned.json --mode hy
 
 ```bash
 # Setup
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 
 # Download data
 mkdir -p /tmp/longmemeval-data
diff --git a/benchmarks/README.md b/benchmarks/README.md
index 6e041fb..417ef05 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -1,13 +1,13 @@
-# MemPal Benchmarks — Reproduction Guide
+# MemPalace Benchmarks — Reproduction Guide
 
 Run the exact same benchmarks we report. Clone, install, run.
 
 ## Setup
 
 ```bash
-git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
-cd mempal
-pip install chromadb pyyaml
+git clone https://github.com/MemPalace/mempalace.git
+cd mempalace
+pip install -e ".[dev]"
 ```
 
 ## Benchmark 1: LongMemEval (500 questions)