End-user installs now lead with `uv tool install mempalace`, with
`pip install mempalace` kept as a fallback. Dev/contributor docs lead
with `uv sync --extra dev` and `uv run` for tests/benchmarks/lint, with
the equivalent pip recipe kept inline. The shipped `/mempalace:init`
skill instructions (mempalace/instructions/init.md) try `uv tool install`
first when uv is on PATH, then fall back through the pip variants.
Adds a .python-version pin at 3.12 because the lockfile's
onnxruntime==1.24.3 only ships wheels for Python >=3.11; without the
pin, `uv sync` on a host where uv prefers 3.10 fails with no source
distribution available, which would make the documented command a
footgun. pyproject's `requires-python = ">=3.9"` is unchanged — pip
users on 3.9/3.10 are unaffected.
Files updated: README.md, CONTRIBUTING.md, CLAUDE.md, the gemini-cli
guide and example, the .claude-plugin / .codex-plugin READMEs, the
mempalace SKILL, the openclaw SKILL, tools/save.md, the three
benchmarks docs, and the corresponding website mirrors.
The miner upserted one drawer per ChromaDB call, paying tokenizer +
ONNX session setup per chunk. The embedding device was CPU-only because
no EmbeddingFunction was ever wired through the backend.
Two changes, each a speedup in its own right; stacked they give ~10x
end-to-end on a medium corpus (20 files, 568 drawers):
1. Batched upsert. `process_file` and `_file_chunks_locked` now collect
all chunks of a file into a single `collection.upsert(...)` so the
embedding model runs one forward pass per file instead of N.
2. Hardware-accelerated embedding function. New `mempalace/embedding.py`
wraps `ONNXMiniLM_L6_V2` with configurable `preferred_providers`.
`MEMPALACE_EMBEDDING_DEVICE` (or `embedding_device` in config.json)
selects auto / cpu / cuda / coreml / dml. Unavailable accelerators
log a warning and fall back to CPU.
The factory subclasses `ONNXMiniLM_L6_V2` and spoofs its `name()` to
`"default"` so the persisted EF identity matches existing palaces
created with ChromaDB's bare `DefaultEmbeddingFunction` -- same
model, same 384-dim vectors, no rebuild needed when turning GPU on.
`ChromaBackend.get_collection` / `create_collection` now pass the
resolved EF on every call so miner writes and searcher reads agree.
Benchmarks (i9-12900KF + RTX 3090, medium scenario, 568 drawers):
per-chunk + CPU 19.77s · 29 drw/s (baseline)
batched + CPU 8.07s · 70 drw/s (2.4x)
batched + CUDA 2.15s · 264 drw/s (9.2x)
Reproducible via `benchmarks/mine_bench.py`.
Install paths:
pip install mempalace[gpu] # NVIDIA CUDA
pip install mempalace[dml] # DirectML (Windows)
pip install mempalace[coreml] # macOS Neural Engine
Mine header now prints `Device: cpu|cuda|...` so users can confirm the
accelerator engaged.
Remaining in-repo surfaces carrying the same retracted or broken
claims as the public pages fixed in the previous two commits.
CONTRIBUTING.md
- "Palace structure matters ... 34% retrieval improvement" → reframed
as scoping (same rewording applied to the website equivalents).
benchmarks/BENCHMARKS.md
- Add a prominent "Important caveat" block at the top of the
"Comparison vs Published Systems" table explaining that R@5
(retrieval recall) and QA accuracy are different metrics, with
citations to Mastra, Mem0, and Supermemory's own published
methodology pages. Annotate the specific competitor rows whose
numbers are QA accuracy, not retrieval recall.
- Annotate the `hybrid v4 + rerank 100%` row to note that the 99.4
→ 100 step was tuned on 3 specific wrong answers (already disclosed
further down in the doc under "Benchmark Integrity"); the honest
hybrid figure is held-out 98.4%.
- Fix the broken clone URL — `aya-thekeeper/mempal` no longer points
at anything; now `MemPalace/mempalace`.
benchmarks/README.md + benchmarks/HYBRID_MODE.md
- Same clone-URL fix applied.
CHANGELOG.md
- Add a ### Documentation entry under [Unreleased] v3.3.0 that names
#875 and summarises the scope of the rewrite.
Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).
Scorecard — all reproduce exactly:
LongMemEval
raw R@5 96.6% (500/500) ✅
hybrid_v4 held-out 450 R@5 98.4% (442/450) ✅
hybrid_v4 + minimax rerank R@5 99.2% (496/500) *
hybrid_v4 + minimax rerank R@10 100.0% (500/500) *
LoCoMo (session, top-10)
raw 60.3% (1986q) ✅
hybrid v5 88.9% (1986q) ✅
ConvoMem all-categories (250 items) 92.9% ✅
MemBench all-categories (8500) 80.3% ✅
* The minimax-m2.7:cloud rerank run replicates the "100%" claim
with a different LLM family (no Anthropic dependency). R@10 is
a perfect reproduction; R@5 misses 4 questions that the
published Haiku run caught — consistent with BENCHMARKS.md's own
disclosure that hybrid_v4 includes three question-specific fixes
developed by inspecting misses, i.e. teaching to the test.
The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.
Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.
The rerank pipeline was hardcoded to Anthropic's /v1/messages.
Add a backend flag so the same code path can be exercised with
any OpenAI-compatible endpoint — local Ollama, Ollama Cloud,
or any gateway that speaks /v1/chat/completions.
Enables independent verification of the "100% with Haiku rerank"
claim by running the full benchmark with a different LLM family
(e.g. minimax-m2.7:cloud) and zero Anthropic dependency.
Both longmemeval_bench.py and locomo_bench.py:
- llm_rerank*() gain backend= / base_url= kwargs
- CLI: --llm-backend {anthropic,ollama}, --llm-base-url
- API key required only when backend=anthropic (diary/palace modes still require it)
- Parse last integer in response (reasoning models emit multi-int output)
- Fallback to message.reasoning when content is empty
- Raise max_tokens to 1024 for reasoning models
The `_load_api_key()` function in longmemeval_bench.py and locomo_bench.py
searched for API keys in a fixed path (`~/.config/lu/keys.json`) using
personal key names (`anthropic_milla`, `anthropic_claude_code_main`).
This leaks internal infrastructure details into the public codebase and
trains contributors to store credentials in a non-standard location
rather than using the standard ANTHROPIC_API_KEY env var.
Simplified to: CLI flag > env var > empty string. Updated help text
and HYBRID_MODE.md docs to match.
Co-authored-by: Tadao <tadao@travisfixes.com>
The module-level `ssl._create_default_https_context = ssl._create_unverified_context`
disables certificate verification for ALL urllib requests in the process,
not just the benchmark's HuggingFace downloads. This silently exposes
the benchmark runner to MITM attacks.
If a specific environment needs to skip verification (e.g. corporate proxy),
users can set `PYTHONHTTPSVERIFY=0` or pass a custom ssl context per-request
rather than globally patching the ssl module.
Co-authored-by: Tadao <tadao@travisfixes.com>