The BLOB-seq_id migration shim (PR #664) ran int.from_bytes(..., 'big')
over every BLOB in max_seq_id, including chromadb 1.5.x's own native
format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18
integer that silently suppresses every subsequent embeddings_queue write
for the affected segment (queue filter is seq_id > start), causing
silent drawer-write drops after a 1.5.x upgrade.
Two-part fix:
1. Shim narrowing (mempalace/backends/chroma.py)
- Drop max_seq_id from the shim loop. chromadb owns that column's
format; we don't reinterpret it.
- Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the
sysdb-10 b'\x11\x11' prefix rather than misconvert.
2. Recovery command (mempalace/repair.py, mempalace/cli.py)
- mempalace repair --mode max-seq-id [--segment <uuid>]
[--from-sidecar <path>] [--dry-run] [--yes] [--no-backup]
- Detects poisoned rows via threshold (seq_id > 2**53).
- Default heuristic: MAX(embeddings.seq_id) over the collection owning
the poisoned segment. Matches METADATA max exactly; VECTOR segments
get a few seq_ids ahead (queue skips an already-indexed window — an
acceptable loss vs. resetting to 0 and re-processing everything).
- --from-sidecar copies clean values from a pre-corruption sqlite db.
- Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs,
post-repair verification that raises MaxSeqIdVerificationError if
any row is still above threshold.
Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar,
dry-run, segment filter, no-op, backup, rollback-on-verify-failure).
3 new in tests/test_backends.py (max_seq_id untouched by shim,
sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs
still convert). Full suite: 1103 passed.
`bash` on the Windows CI runner resolves to `wsl.exe` which fails with
"Windows Subsystem for Linux has no installed distributions." The shell
hooks themselves are POSIX-only — Windows users use the Python entry
point — so the bash-subprocess exercise is non-applicable on win32.
The static-grep validator tests still run on every platform, so the
shell-side validation is still asserted under Windows; only the live
bash invocation is skipped.
Address Copilot review on #1231:
1. Stop double-mining the transcript on the Python side. ``_get_mine_targets``
now returns only the ``MEMPAL_DIR`` projects target — the convos target
for the transcript dir is dropped because ``_ingest_transcript`` already
handles it on every hook fire. The duplicate spawn was using
``sys.executable`` (vs ``_mempalace_python()``) and a different ``--wing``,
so each Stop/PreCompact event was writing the same transcript into two
wings under asymmetric interpreters and overwriting the single
``_MINE_PID_FILE`` lock.
2. ``_maybe_auto_ingest`` and ``_mine_sync`` now spawn via
``_mempalace_python()`` so the resolved interpreter matches the venv
that owns mempalace (matters under GUI-launched harnesses where
``sys.executable`` may resolve to a system Python without chromadb).
3. Replace ``eval $(...)`` in both shell hooks with a ``mapfile``-based
reader. Sanitized values are still emitted by the same Python parser,
but the shell now does plain variable assignment instead of executing
the parser's stdout — smaller blast radius if the sanitizer is ever
bypassed.
4. Mirror ``_validate_transcript_path`` in the shell hooks via a
``is_valid_transcript_path`` helper — extension + traversal-segment
rejection, parity with the Python validator. The convos mine in each
shell hook is now gated on the validator instead of bare ``-f``.
5. Tighten the ``..`` traversal test that previously exercised the
suffix gate by mistake (``../../etc/passwd`` lacks ``.json[l]``).
Use ``.jsonl`` paths with traversal segments to actually hit the
``..`` rejection branch.
6. README: add a one-liner pointing at ``mempalace sweep`` for users
who want per-message recall on top of the file-level chunks the
hooks produce. The sweeper was undiscoverable previously.
Tests: 1418 passed, 1 skipped (full suite minus benchmarks).
#1230 fixed --mode convos for the case where MEMPAL_DIR was unset, but
left two configurations broken:
- MEMPAL_DIR set to a project dir: convos never mined (MEMPAL_DIR
overrode the transcript path); only project files were ingested.
- MEMPAL_DIR set to a conversations dir per the old hooks/README: the
projects miner ran on JSONL — same wrong-miner behaviour.
The shell hooks (mempal_save_hook.sh, mempal_precompact_hook.sh) had
the same MEMPAL_DIR-overrides-transcript bug AND were missing --mode
on every spawned `mempalace mine` call.
Make the auto-ingest *additive*. _get_mine_dir → _get_mine_targets,
returning a list of (dir, mode) pairs:
- MEMPAL_DIR (when valid) contributes (dir, "projects")
- A valid transcript JSONL contributes (parent, "convos")
- Both can appear together; the hook spawns one ingest per target
Same change applied to the shell save and precompact hooks. Precompact
also gained transcript_path parsing so it can run the convos mine
synchronously before context is compressed. hooks/README.md updated to
describe MEMPAL_DIR as a project-files target, never a convos target.
- Normalize MEMPAL_DIR via Path.expanduser().resolve() so ~/proj paths
are correctly accepted instead of falling through to transcript fallback
- Replace bare Path.expanduser().is_file() transcript check with the
existing _validate_transcript_path() which adds .resolve(), enforces
.jsonl/.json extension, and rejects '..' path-traversal components
- Update tests to compare resolved paths (cross-platform correctness)
- Add tests for tilde expansion, path-traversal rejection, and
non-jsonl extension rejection in _get_mine_dir
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/f69176c7-d752-40ef-ba71-d0e4adc3a689
Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
The Stop and PreCompact hooks spawn `mempalace mine <dir>` with no
`--mode` flag, which defaults to `projects` in cli.py. When MEMPAL_DIR
is unset, _get_mine_dir falls back to the parent of the transcript
JSONL — and miner.py's READABLE_EXTENSIONS includes `.jsonl`, so the
projects miner happily ingests Claude Code session JSONL as if it were
source code instead of conversation.
Make _get_mine_dir return (dir, mode): MEMPAL_DIR keeps `projects`,
the JSONL fallback yields `convos`. Both _maybe_auto_ingest and
_mine_sync now thread the mode into the spawned command.
Five Copilot review issues + the Python 3.9 CI failure rolled into one
follow-up:
* Replace ``dict | None`` annotated assignment with a type-comment so
module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red).
* Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace
repair`` (rebuild) and ``mempalace repair-status``. Updated all
user-facing messages, docstrings, and test assertions.
* Replace ``_get_client()`` in ``tool_search`` with the safe
``_refresh_vector_disabled_flag`` probe so the fallback isn't
defeated by the very chromadb client load it's trying to avoid.
* Short-circuit ``tool_status`` to a pure-sqlite reader
(``_tool_status_via_sqlite``) when divergence is detected so wing /
room counts come back without ever opening the persistent client.
* Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an
``id``-ordered fallback so legacy schemas missing ``created_at``
don't break BM25 search.
New test covers the sqlite-status short-circuit. 1409 tests pass.
When chromadb's HNSW segment freezes at a stale max_elements while
sqlite keeps accumulating embeddings, the next chromadb open segfaults
the MCP server on every tool call. Adds a pure-filesystem capacity probe
(zero chromadb interaction), a `mempalace repair-status` read-only
health check, and a BM25-only sqlite fallback so the palace stays
reachable even when vector search is unavailable.
* `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly
via a tight-allowlist unpickler — no hnswlib import, no segment load.
* MCP server runs the probe at startup and after every reconnect; sets
`_vector_disabled` and routes search to the sqlite FTS5 + BM25 path.
* `tool_status` and `tool_reconnect` surface the fallback state.
* Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).
2 files changed, 60 insertions, 0 deletions. 2 new tests (RED-first).
Follow-up to #1224's privacy warning. The URL-based heuristic in
``mempalace.llm_client._endpoint_is_local`` shipped without recognizing
Tailscale's CGNAT range (100.64.0.0/10), so a user running LM Studio,
Ollama, or any local LLM accessible via a Tailscale-assigned 100.x.x.x
address would currently get a wrong privacy warning — Tailscale
addresses are network-private (only reachable inside the user's
Tailnet) but they're not RFC1918, so the heuristic was treating them
as external.
This PR adds CGNAT recognition: when the hostname starts with ``100.``
AND the second octet is between 64 and 127 inclusive, it's classified
as local. Addresses in 100.x.x.x outside that range (i.e. second octet
< 64 or > 127) are regular allocated public space and remain external,
so a user pointing at a public 100.0.0.1 still gets the warning.
Concrete user impact:
Before: ``mempalace init --llm-provider openai-compat --llm-endpoint http://100.100.50.50:1234``
(LM Studio on Tailnet) → triggers privacy warning incorrectly.
After: same command → no warning. data stays inside the user's
Tailnet, which is what the warning is supposed to protect against.
TDD: 2 tests added in ``tests/test_llm_client.py``, both RED-first.
1. ``test_openai_compat_provider_tailscale_cgnat_endpoint_is_local``
— covers three Tailscale CGNAT addresses (start, middle, near-end
of the range) and pins they're all classified local. This was the
RED that drove the implementation.
2. ``test_openai_compat_provider_outside_tailscale_cgnat_is_external``
— pins the boundary on both sides: addresses with second octet 0-63
and 128-255 stay external. Prevents future "treat all 100.x.x.x as
local" overcorrection.
Tests: 1388 total mempalace tests pass. 2 pre-existing environmental
failures unrelated to this change (chromadb optional dep). Ruff check
+ format both clean.
Backwards compatible: only widens the local-recognition set. Anything
classified local before is still classified local; anything classified
external before remains so unless it's specifically in the CGNAT range.
Out of scope (tracked for future iteration based on real user feedback,
not built speculatively): pre-init confirmation prompt before sending
to external API, persistent ``private-only`` config flag that refuses
external endpoints entirely, explicit cloud-provider name detection
("Using Anthropic's hosted API at ..." vs the current generic warning).
- cli.py: stringify each evidence entry exactly once before the
startswith check (was calling str(e) twice per element).
- tests: replace brittle `confidence != 0.90` assertion with an
equality check against detect_origin_heuristic on the same samples.
The original would have spuriously fired if the heuristic ever
legitimately produced 0.90 for these samples; the new form pins the
contract directly.
Two follow-ups to PR #1221's merge-fields behavior, both raised by the
Copilot review on that PR:
- Evidence merge now prefixes each entry with `Tier-1 heuristic: ` or
`Tier-2 LLM: ` so the on-disk `origin.json` audit record retains tier
provenance. The pre-#1221 code labeled heuristic evidence; the
merge-fields refactor flattened that. Re-prefixing is idempotent.
- Tests now assert that the merged `confidence` is the heuristic's, not
the LLM's. Added inline assertions to the two existing
contradiction/disagreement tests, plus a dedicated
`test_merge_tier_fields_confidence_matches_heuristic_call` that
compares to `detect_origin_heuristic` directly so a future regression
letting Tier 2 confidence leak through cannot pass silently.
Tests: 1378 pass. Ruff check + format both clean (CI-pinned 0.4.x).
4 files changed, 248 insertions, 0 deletions. 7 new tests (4 unit + 3 integration), all RED-first.
Per @milla-jovovich's question to @igorls during PR #1221 review: users
running `mempalace init` with an external LLM provider (Anthropic API,
OpenAI hosted, etc.) need a clear, explicit warning that their folder
content will be sent to the provider, that MemPalace doesn't control
how the provider logs/retains/uses that data, and how to opt out.
@igorls confirmed this should be a small follow-up PR scoped to the
warning itself, before the v3.3.4 tag.
This PR adds:
- `_endpoint_is_local(url)` helper in `mempalace/llm_client.py` —
URL-based heuristic returning True if the hostname is on the user's
machine or private network. Covers: localhost, 127.0.0.1, ::1,
hostnames ending in .local (mDNS/Bonjour), IPv4 RFC1918 ranges
(10/8, 172.16-31/12, 192.168/16), and IPv6 unique-local addresses
(fc00::/7).
- `is_external_service` property on the `LLMProvider` base class.
Subclasses inherit; the URL determines (no provider-specific
hardcoding). This means: Ollama on localhost = local. LM Studio on
LAN = local. Anthropic with default `https://api.anthropic.com` =
external. A user proxying Anthropic through localhost (advanced
setup) = local, no false-positive warning.
- One-line warning print in `cmd_init` after successful provider
acquisition, gated on `is_external_service`:
⚠ {provider_name} is an EXTERNAL API. Your folder content will be
sent to the provider during init. MemPalace does not control how
the provider logs, retains, or uses your data. Pass --no-llm to
keep init fully local.
The warning fires AFTER `LLM enabled: ...` so users see both that
the LLM is engaged AND the privacy implications of where it lives,
before Pass 0 / entity detection actually runs.
LOCAL providers (Ollama on localhost, LM Studio on localhost or LAN,
llama.cpp on localhost, vLLM on localhost) DO NOT trigger the warning —
nothing leaves the user's machine/network in those configurations.
TDD: 7 tests added across 2 files.
Unit tests in `tests/test_llm_client.py` (4 tests, all RED-first):
1. test_ollama_provider_default_endpoint_is_local — pins that the
default `http://localhost:11434` is classified local.
2. test_openai_compat_provider_localhost_endpoint_is_local — covers
the LM Studio / llama.cpp / vLLM common case (localhost,
127.0.0.1, and 192.168.x LAN).
3. test_openai_compat_provider_cloud_endpoint_is_external — pins
that pointing openai-compat at https://api.openai.com (or any
non-local URL) classifies as external.
4. test_anthropic_provider_default_endpoint_is_external — pins that
AnthropicProvider's default endpoint is external (the dominant
user-facing case for `--llm-provider anthropic`).
Integration tests in `tests/test_corpus_origin_integration.py` (3 tests,
RED-first; 1 was the critical RED — the other 2 passed by accident
since nothing printed "EXTERNAL API" before this PR):
5. test_init_prints_privacy_warning_when_provider_is_external —
captures stdout from cmd_init with a mocked external provider,
asserts the warning text contains "EXTERNAL API" + "--no-llm" +
language about MemPalace not controlling provider behavior.
6. test_init_no_privacy_warning_when_provider_is_local — same flow
with a mocked local provider, asserts the warning text does NOT
appear.
7. test_init_no_privacy_warning_with_no_llm_flag — pins the --no-llm
path: no provider acquisition attempted, no warning fires.
Tests: 1382 total mempalace tests pass. 2 pre-existing environmental
failures unrelated to this change (chromadb optional dep). Ruff check +
format both clean.
Backwards compatible: `is_external_service` is a new property; existing
callers don't reference it. The warning is a new print statement that
fires only when an external endpoint is acquired. The `--no-llm` opt-out
existed before this PR and continues to work identically.
Out of scope for follow-up (deliberately not in this PR per Igor's
"small PR" guidance): Tailscale CGNAT (100.64.0.0/10) treatment,
pre-init confirmation prompt, persistent privacy-mode config flag,
explicit cloud-provider name detection. Tracked for future iteration.
2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first).
Per @igorls's review of PR #1211 (https://github.com/MemPalace/mempalace/pull/1211#issuecomment-4322762236):
the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the
heuristic result wholesale. With ``--llm`` default-on (since #1211) and a
small local model like Ollama gemma4:e4b, the LLM can return a wrong
``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident
heuristic ``True``. Tier 2's persona/user/platform extraction is the whole
reason to run it; the YES/NO call should stay with the heuristic.
This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields
instead of replacing:
- ``likely_ai_dialogue`` → KEEP heuristic's (don't let weak LLM flip)
- ``confidence`` → KEEP heuristic's (paired with the bool above)
- ``primary_platform`` → TAKE LLM's when LLM provides one
- ``user_name`` → TAKE LLM's when LLM provides one
- ``agent_persona_names`` → TAKE LLM's when LLM provides any
- ``evidence`` → COMBINE both signal trails
This preserves the persona-extraction value of Tier 2 (the whole point of
running it) while preventing a weak local model from flipping a confident
heuristic.
TDD: 4 tests added in tests/test_corpus_origin_integration.py covering
the four state combinations:
1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool —
The exact failure mode Igor caught. Heuristic confidently flags
AI-dialogue; mocked LLM contradicts. Asserts merged result keeps
heuristic's True AND merges LLM's persona/user/platform fields.
This test was the RED that drove the implementation.
2. test_merge_tier_fields_heuristic_no_no_personas_leak —
Both tiers agree NOT AI-dialogue, both report empty personas. Pins
that the merge doesn't accidentally introduce personas.
3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence —
Both tiers agree AI-dialogue, LLM extracts personas. Pins that
evidence from BOTH tiers ends up in the merged audit trail and
persona/user/platform come from LLM.
4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only —
Backwards compat: with no LLM provider (``--no-llm`` path), the
merge logic doesn't fire and behavior is identical to v3.3.4.
Tests: 1367 pass on the full mempalace suite. 2 pre-existing
environmental failures unrelated to this change (chromadb optional
dep). Ruff check + format both clean.
Three new tests cover the marker contract:
- test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is
written after a successful BLOB → INTEGER conversion.
- test_fix_blob_seq_ids_writes_marker_when_already_integer — marker
is written even on a no-op (already-INTEGER) path. The point of
the marker is to skip sqlite3.connect() on subsequent calls, not
to record that a conversion happened.
- test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies
the load-bearing property via monkeypatched sqlite3.connect: when
the marker exists, the function never opens sqlite. This is the
bug #1090 fix — opening Python's sqlite3 against a live ChromaDB
1.5.x WAL DB corrupts the next PersistentClient call, and we
must never do it again post-migration.
Also folded in @igorls's style nit:
- Path(marker).touch() instead of open(marker, "a").close().
Same effect, reads cleaner. Moved Path import to the top of the
module since it didn't yet exist there.
35/35 backend tests pass. The "Production: fresh MCP server + stop
hook + mempalace mine subprocess" checkbox in the PR body is checked
in the PR reply — the canonical 151K palace has been running this
fix since #1177 was filed (via Syncthing-replicated source on the
disks daemon at v1.7.0); zero PersistentClient corruption since.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.
Plus three nits @igorls flagged:
- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
documented that idempotency of `quarantine_stale_hnsw` is the safety
property (concurrent same-palace calls produce a benign redundant
rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
and a future protocol-0/1 emission would conservatively quarantine
+ lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
conftest reset is the controlled case, and module-level wouldn't
remove the foot-gun, just relocate it. Conftest reset documented in
the existing comment is the right pattern for test isolation.
Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in #1177's territory, not #1173's.
37/37 tests pass on the PR branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.
This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.
The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean.
But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.
## What changes for users (in order, from `pip install` onwards)
**Install** — `pip install mempalace` is unchanged. The package itself didn't shift.
**First run — `mempalace init <folder>`:**
1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.
2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`:
- **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`.
- **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise.
- **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.
3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out.
4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder.
5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.
6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean.
7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it.
8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR.
**Later — when your folder grows:**
9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe.
## Behind the changes
- **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.
- **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`.
- **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged.
- **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module.
- **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`.
- **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.
## Cost story
- **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`.
- **Ollama / local LLM runtime:** zero cost. Fully offline.
- **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes.
- **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks.
## Backwards compatibility
- All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change.
- The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
- `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3.
## Test coverage
- **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
- **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
- **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too.
- **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.
## Hygiene guardrail
This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in #1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.
Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.
Fix: two-stage gate.
1. mtime gate (existing) — gap > stale_seconds is necessary.
2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
tail) and a non-trivial size, WITHOUT deserializing the file.
Healthy segment with mtime drift → keep in place; truncated /
zero-filled / partial-flush → quarantine.
Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.
The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.
4 new tests in test_backends.py:
- test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
- test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
(drift + valid meta — the production case at 06:24)
- test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
(fresh / never-flushed, no meta file)
- test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
size, partial-flush shape)
Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.
Suite 1366/1366 pass.
Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.
CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).
Root cause is a false-positive thrash in the quarantine heuristic:
- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
chromadb's internal cadence, which can lag minutes behind sqlite
under continuous write load.
Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR #1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.
Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.
Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.
Suite 1362/1362 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
make_client() called _fix_blob_seq_ids but skipped quarantine_stale_hnsw,
so every fresh process (stop hook, precompact hook, CLI) opened a drifted
palace and segfaulted in chromadb_rust_bindings before any write-path
protection could fire.
#1062 wires the quarantine call at MCP server startup (covers long-lived
server processes). This fix adds it to make_client() itself — the call
site that all callers (server, hooks, CLI, tests) pass through — so every
fresh PersistentClient open is protected regardless of entry point.
Also lowers stale_seconds default from 3600 to 300: a 0.96h-drifted
segment caused production segfaults before the 1h threshold fired.
ChromaDB's HNSW flush cadence means legitimate drift is seconds to low
minutes; 5min gives headroom without admitting clearly corrupt segments.
The user-reported case in #1208: a palace with 67,580 drawers had its
HNSW files manually quarantined to recover from corruption. ``mempalace
repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair
complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3
that did contain the full 67,580 — but the rebuilt collection only had
the first 10K. 85% data loss, no warning.
Root cause: ChromaDB's collection-layer get() silently caps at
``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a
collection whose segment metadata is stale (typical post-quarantine
state). col.count() returns the same capped value, so neither the
loop bound nor the extraction count flagged the truncation.
Fix is defense-in-depth, not a recovery mechanism. Repair now:
1. After extraction, queries chroma.sqlite3 directly via a read-only
sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN
collections WHERE name='mempalace_drawers'. If that count exceeds
the extracted count, abort with a clear message before any
destructive operation.
2. Falls back to a weaker check when the SQLite query can't run
(chromadb schema drift, locked file): if extracted exactly equals
CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to
refuse without explicit acknowledgement.
3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok``
(rebuild_index kwarg) to override after independent verification.
Useful for the rare case of a palace genuinely sized at exactly
10,000 drawers.
The guard logic lives in ``repair.check_extraction_safety()`` so the
two extraction paths (CLI ``cmd_repair`` and the lower-level
``rebuild_index``) share a single implementation. Raises
``TruncationDetected`` carrying the printable message.
Tests: 9 new cases covering the safe path (counts match, SQLite
unreadable but well under cap), both abort paths (SQLite higher than
extracted, unreadable + at cap), the override flag, and end-to-end
behavior of ``rebuild_index`` with the guard wired in. Plus two
``sqlite_drawer_count`` tests for the missing-file and bad-schema
cases.
What's NOT in this PR: actually recovering the missing 57,580
drawers from the user's case. The on-disk SQLite still holds them;
recovery is a separate flow (direct-extract from chroma.sqlite3,
bypass the chromadb collection layer entirely). This PR's job is
to stop repair from making it worse.
Refs #1208.
3.11 is mid-life; 3.13 is already on the Linux matrix and gives ~3.5
years of upstream support. Aligns the single-version platform jobs with
the top of the Linux matrix. requires-python and lint job left alone.
Refs #1192 (Option A).
ChromaDB can return None for drawers without metadata (legacy data,
partial writes — same root cause as upstream #1020 / our PR #1094).
build_graph at line 95 called meta.get("room", "") unconditionally,
which AttributeErrors on None and takes out every consumer of
build_graph for the whole call path: graph_stats, find_tunnels,
traverse, and (most visibly) the daemon's /stats endpoint.
Caught 2026-04-25 by palace-daemon's verify-routes.sh smoke test
against the canonical 151K-drawer palace — /stats was 500-ing on a
single None drawer.
Adds `if meta is None: continue` guard. Closes the same gap upstream's
#999 None-metadata audit closed in searcher.py / mcp_server.py /
miner.status, just in a different file the audit didn't reach. The
graph-build is recoverable: skipping a single None drawer doesn't
distort the graph since build_graph already filters
`room and room != "general" and wing` — a missing-metadata drawer was
never going to participate anyway.
Test: TestBuildGraph::test_none_metadata_does_not_crash mixes a None
entry into a 3-drawer fixture and asserts the two real drawers are
processed normally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opening chroma.sqlite3 via Python's sqlite3.connect() against a live
ChromaDB 1.5.x WAL-mode database leaves state that segfaults the next
PersistentClient call — the same failure mode tracked at #1090.
_fix_blob_seq_ids runs unconditionally on every make_client() call, so
every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite
open → corrupt → segfault cycle on palaces that have already completed
the 0.6.x → 1.5.x seq_id migration.
Guard with a .blob_seq_ids_migrated marker file in the palace directory:
- If marker exists, return immediately — skip sqlite entirely
- After successful migration (or confirmation that no BLOBs remain),
write the marker so subsequent opens take the fast path
- Palaces that never had BLOB seq_ids also get the marker on first open,
so they too avoid the redundant sqlite open after that
- Already-migrated palaces can touch the marker manually to opt in
Test plan: Direct test — run _fix_blob_seq_ids twice against a fresh
palace; second call returns immediately because marker exists. 1094
existing tests pass.
CI lint job runs `ruff format --check`; the new tests in TestBM25NoneSafety
needed the standard "blank line after import-inside-function" + line-length
wrap. No logic change — formatter pass only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_tokenize` calls `text.lower()` unconditionally; when ChromaDB returns a
drawer with `documents` containing `None`, the hybrid-rerank path raises
`AttributeError: 'NoneType' object has no attribute 'lower'`.
Observed in production daemon log (2026-04-24 21:07:05) during a search
that triggered `_hybrid_rank → _bm25_scores → _tokenize`:
File "mempalace/searcher.py", line 81, in _bm25_scores
tokenized = [_tokenize(d) for d in documents]
File "mempalace/searcher.py", line 52, in _tokenize
return _TOKEN_RE.findall(text.lower())
AttributeError: 'NoneType' object has no attribute 'lower'
Closes the gap left by the upstream None-metadata audit (#999), which
covered metadata loops but not BM25 helpers. Returns `[]` for falsy input
so a None doc gets score 0.0 while the rest of the corpus reranks normally.
Three regression tests in TestBM25NoneSafety lock the behavior and reference
the production trace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After rebasing onto current develop:
- chroma.py: keep develop's quarantine_stale_hnsw + UnsupportedFilterError
validation alongside this PR's _pin_hnsw_threads retrofit.
- tests/test_backends.py: combine quarantine_stale_hnsw and
_pin_hnsw_threads test sections; ruff format.
- miner.py: propagate the new `files=` kwarg (added on develop in #1183
for the init -> mine flow) through _mine_impl so the caller can pass
a pre-scanned file list under the global lock.
Addresses remaining PR #976 review items after rebase on develop.
`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).
In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.
After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.
- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
PR #863 on develop eliminated precompact blocking entirely. After rebasing,
the attempt-cap tests (test_precompact_first_two_attempts_block,
test_precompact_passes_through_after_cap, test_precompact_counter_is_per_session)
would always fail because hook_precompact now mines synchronously and
passes through unconditionally. Remove them to keep the suite green.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses the two actionable Copilot comments from the 2nd review pass.
tests/test_palace_locks.py (#7, #8)
multiprocessing.get_context("fork") is unavailable on Windows, so the
cross-process tests would crash the Windows CI runner. Added
`_get_mp_context()` that picks "spawn" on Windows and "fork" elsewhere.
Spawn re-imports the module in the child; it inherits os.environ
(including the monkeypatched HOME), which is all these tests need.
mempalace/palace.py (#10)
The per-palace lock key was computed from os.path.abspath(palace_path).
On Windows the filesystem is case-insensitive, so `C:\\Palace` and
`c:\\palace` would hash to different keys and two concurrent mines
could touch the same on-disk palace. Switched to
`os.path.normcase(os.path.realpath(...))` so:
* realpath resolves symlinks and `..` segments
* normcase folds case on Windows (no-op on POSIX)
Testing
pytest tests/test_palace_locks.py tests/test_hooks_cli.py
tests/test_backends.py tests/test_cli.py
→ 98 passed, 0 failed.
Addresses the six Copilot review comments on the initial commit.
1) #6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend
The MCP server creates its palace collection directly via
`chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
not through `ChromaBackend.get_collection`. That path was missing the
`hnsw:num_threads=1` metadata, so the primary crash surface for #974
and #965 was untouched by the original patch. Fixed by passing
`hnsw:num_threads=1` at the mcp_server create site too. Documented
in a code comment that the setting is only honored at creation
time — existing palaces created before this fix still need a
`mempalace nuke` + re-mine to gain the protection.
2) #3 — mine_global_lock over-serialized mines across unrelated palaces
Replaced the single global lock file `mine_global.lock` with a
per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
(`mine_palace_<hash>.lock`). Mines against the same palace still
collapse to a single runner (the correctness boundary), but mines
against *different* palaces are now free to run in parallel.
`mine_global_lock` is kept as a backward-compatible alias for
`mine_palace_lock` so any external callers that imported the
previous name keep working.
3) #1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired
`subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
palaces. The previous `except OSError` clause didn't catch it, so
the hook could raise and fail to emit any JSON decision — leaving
the harness without a block/passthrough signal. Fixed by catching
`(OSError, subprocess.TimeoutExpired)` together and always falling
through to the block decision so the hook reliably emits a response.
4) #2 + #4 — tests
- tests/test_hooks_cli.py: added
`test_precompact_first_two_attempts_block`,
`test_precompact_passes_through_after_cap`, and
`test_precompact_counter_is_per_session` to lock in the #955
deadlock fix.
- tests/test_palace_locks.py (new): covers `mine_palace_lock`
single-acquire, reuse-after-release, cross-process serialization
on the same palace, non-interference across different palaces,
path normalization, and the `mine_global_lock` back-compat alias.
5) #5 — known limitation, documented but not auto-fixed
Copilot suggested detecting collections missing `hnsw:num_threads=1`
and calling `collection.modify(metadata=...)` to retrofit existing
palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
replaces metadata rather than merging, and re-passing
`hnsw:space="cosine"` then raises `ValueError: Changing the
distance function of a collection once it is created is not
supported currently.` The HNSW runtime configuration
(`configuration_json`) also does not expose `num_threads` in
chromadb 1.5.x, so the flag appears to be read only at creation
time. Rather than paper over the limitation with a best-effort
`modify` that silently drops `hnsw:space`, documented in the
mcp_server comment that pre-existing palaces need a
`mempalace nuke` + re-mine to gain the protection. Fresh palaces
are always protected.
Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
- Different palaces → both complete in parallel ✓
- Same palace → one completes, the other exits with
"another `mine` is already running against <palace> — exiting
cleanly." ✓