Files
mempalace/benchmarks/README.md
T
Igor Lins e Silva bf3b9c5979 docs: #875 follow-up — repo surfaces + reproduction URLs + CHANGELOG
Remaining in-repo surfaces carrying the same retracted or broken
claims as the public pages fixed in the previous two commits.

CONTRIBUTING.md
 - "Palace structure matters ... 34% retrieval improvement" → reframed
   as scoping (same rewording applied to the website equivalents).

benchmarks/BENCHMARKS.md
 - Add a prominent "Important caveat" block at the top of the
   "Comparison vs Published Systems" table explaining that R@5
   (retrieval recall) and QA accuracy are different metrics, with
   citations to Mastra, Mem0, and Supermemory's own published
   methodology pages. Annotate the specific competitor rows whose
   numbers are QA accuracy, not retrieval recall.
 - Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4
   → 100 step was tuned on 3 specific wrong answers (already disclosed
   further down in the doc under "Benchmark Integrity"); the honest
   hybrid figure is held-out 98.4%.
 - Fix the broken clone URL — `aya-thekeeper/mempal` no longer points
   at anything; now `MemPalace/mempalace`.

benchmarks/README.md + benchmarks/HYBRID_MODE.md
 - Same clone-URL fix applied.

CHANGELOG.md
 - Add a ### Documentation entry under [Unreleased] v3.3.0 that names
   #875 and summarises the scope of the rewrite.
2026-04-14 21:38:00 -03:00

125 lines
4.1 KiB
Markdown

# MemPalace Benchmarks — Reproduction Guide
Run the exact same benchmarks we report. Clone, install, run.
## Setup
```bash
git clone https://github.com/MemPalace/mempalace.git
cd mempalace
pip install -e ".[dev]"
```
## Benchmark 1: LongMemEval (500 questions)
Tests retrieval across ~53 conversation sessions per question. The standard benchmark for AI memory.
```bash
# Download data
mkdir -p /tmp/longmemeval-data
curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
# Run (raw mode — our headline 96.6% result)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json
# Run with AAAK compression (84.2%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --mode aaak
# Run with room-based boosting (89.4%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --mode rooms
# Quick test on 20 questions first
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --limit 20
# Turn-level granularity
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --granularity turn
```
**Expected output (raw mode, full 500):**
```
Recall@5: 0.966
Recall@10: 0.982
NDCG@10: 0.889
Time: ~5 minutes on Apple Silicon
```
## Benchmark 2: LoCoMo (1,986 QA pairs)
Tests multi-hop reasoning across 10 long conversations (19-32 sessions each, 400-600 dialog turns).
```bash
# Clone LoCoMo
git clone https://github.com/snap-research/locomo.git /tmp/locomo
# Run (session granularity — our 60.3% result)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --granularity session
# Dialog granularity (harder — 48.0%)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --granularity dialog
# Higher top-k (77.8% at top-50)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --top-k 50
# Quick test on 1 conversation
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --limit 1
```
**Expected output (session, top-10, full 10 conversations):**
```
Avg Recall: 0.603
Temporal: 0.692
Time: ~2 minutes
```
## Benchmark 3: ConvoMem (Salesforce, 75K+ QA pairs)
Tests six categories of conversational memory. Downloads from HuggingFace automatically.
```bash
# Run all categories, 50 items each (our 92.9% result)
python benchmarks/convomem_bench.py --category all --limit 50
# Single category
python benchmarks/convomem_bench.py --category user_evidence --limit 100
# Quick test
python benchmarks/convomem_bench.py --category user_evidence --limit 10
```
**Categories available:** `user_evidence`, `assistant_facts_evidence`, `changing_evidence`, `abstention_evidence`, `preference_evidence`, `implicit_connection_evidence`
**Expected output (all categories, 50 each):**
```
Avg Recall: 0.929
Assistant Facts: 1.000
User Facts: 0.980
Time: ~2 minutes
```
## What Each Benchmark Tests
| Benchmark | What it measures | Why it matters |
|---|---|---|
| **LongMemEval** | Can you find a fact buried in 53 sessions? | Tests basic retrieval quality — the "needle in a haystack" |
| **LoCoMo** | Can you connect facts across conversations over weeks? | Tests multi-hop reasoning and temporal understanding |
| **ConvoMem** | Does your memory system work at scale? | Tests all memory types: facts, preferences, changes, abstention |
## Results Files
Raw results are in `benchmarks/results_*.jsonl` and `benchmarks/results_*.json`. Each file contains every question, every retrieved document, and every score — fully auditable.
## Requirements
- Python 3.9+
- `chromadb` (the only dependency)
- ~300MB disk for LongMemEval data
- ~5 minutes for each full benchmark run
- No API key. No internet during benchmark (after data download). No GPU.
## Next Benchmarks (Planned)
- **Scale testing** — ConvoMem at 50/100/300 conversations per item
- **Hybrid AAAK** — search raw text, deliver AAAK-compressed results
- **End-to-end QA** — retrieve + generate answer + measure F1 (needs LLM API key)