Adds a fifth format adapter to mempalace.normalize alongside the
existing Claude Code, Codex, Claude.ai, ChatGPT, and Slack parsers.
After this lands, mempalace mine --mode convos ingests Gemini CLI
session history without manual export.
Why now: Claude Code and Codex CLI are already supported by convo_miner;
adding Gemini closes the major-CLI-tool coverage gap. After this lands,
the README's "verbatim conversation history" promise is honestly
delivered for all three top-tier API-keyed coding CLIs (Claude Code,
Codex CLI, Gemini CLI), not just two of them. This is the third leg
of the trio Aya pushed for so the public claim matches the actual
ingest pipeline.
Gemini CLI stores sessions at ~/.gemini/tmp/<project_hash>/chats/ as
JSONL. The on-disk schema (per google-gemini/gemini-cli#15292):
{"type":"session_metadata","sessionId":"...","projectHash":"...",...}
{"type":"user","id":"msg1","content":[{"text":"Hello"}]}
{"type":"gemini","id":"msg2","content":[{"text":"Hi"}]}
{"type":"message_update","id":"msg2","tokens":{"input":10,"output":5}}
The new _try_gemini_jsonl parser:
- requires a session_metadata record so it does not false-positive
against Claude Code or Codex JSONL passing through the dispatch
chain in _try_normalize_json
- extracts user/gemini message text from each entry's content array
of {"text": "..."} blocks, joining multiple blocks per message
in order
- skips message_update entries (token-count deltas with no message
text) and any other unknown record types
- returns None when fewer than two conversational messages are
present, mirroring the codex parser's >=2-message guard
Test coverage: 9 new unit tests in tests/test_normalize.py mirroring
the codex test pattern - happy path, multi-turn, missing session
metadata, message_update skip, single-message rejection, multi-block
content concatenation, empty content skip, malformed-line resilience,
and explicit no-match against codex JSONL fixtures. Schema-level only;
real Gemini CLI session fixtures are a follow-up once a real user file
is available.
Closes part of #59 (the Gemini CLI portion of the umbrella request).
Adds api_key_source provenance ('flag' | 'env' | None) to LLMProvider
so cmd_init can distinguish a key passed via --llm-api-key (explicit
opt-in) from one silently picked up via OPENAI_API_KEY / ANTHROPIC_API_KEY
shell env (stray credential).
When the endpoint is external AND api_key_source == 'env', init now
prints a blocking [y/N] prompt before any data is sent. Anything other
than 'y' drops the LLM and falls back to heuristics-only.
Adds --accept-external-llm flag for CI / non-interactive bypass.
Completes the UX gap in #1224: the URL-based warning was informational
and init kept running, so a user who didn't notice the line had already
leaked. The consent prompt is the actual gate; explicit flag-passed keys
remain treated as already-consented.
``def _normalize_wing(wing: str | None) -> str | None`` uses PEP 604
union syntax which requires Python 3.10+ at runtime. The project still
declares ``python_requires=">=3.9"`` and CI runs the test-linux (3.9)
matrix, where every test in ``tests/test_palace_graph*`` errors out
before collection with ``TypeError: unsupported operand type(s) for |``.
Added ``from __future__ import annotations`` so all annotations in
this module are evaluated lazily as strings — the union syntax is then
accepted on 3.9 without needing to rewrite to ``Optional[str]``.
Surfaced after rebasing this PR onto current develop.
cmd_init now invokes ``_run_pass_zero`` unconditionally (#1221, #1223
landed on develop after this PR's branch point). The pass reads sample
content via ``builtins.open``; with that mocked to MagicMock, the
downstream ``"\\n\\n".join(samples)`` in
``corpus_origin.detect_origin_heuristic`` raises
``TypeError: expected str instance, MagicMock found``.
This test only cares about the wing-slug write to the registry, so
stub the pass-zero call directly rather than try to satisfy its full
sample-gathering contract.
Two follow-ups against the review on this PR:
1. ``miner.load_config`` no-yaml fallback was returning the raw dirname
as the wing, while ``cmd_init`` writes ``topics_by_wing`` under the
normalized slug. A hyphenated project mined without a ``mempalace.yaml``
file silently lost every topic tunnel — same key-miss class as #1194,
just down the no-yaml branch (raised by Qodo on this PR).
2. ``convo_miner`` was applying the lower/replace rule inline at one
call site. Now folded through ``normalize_wing_name`` so all wing-slug
producers — ``cmd_init``, ``room_detector_local``, ``miner.load_config``
fallback, ``convo_miner`` — share a single source of truth. No
behavior change for any input; pure consolidation.
Added ``test_load_config_no_yaml_normalizes_hyphenated_wing`` to lock
the fallback path to the normalized slug — fails on develop without
the miner change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`init` was recording `topics_by_wing[<raw-dirname>]` while `mempalace.yaml`
got the lower-cased separator-collapsed slug. At mine time the miner
read the slug from the yaml and missed the registry key, so
`_compute_topic_tunnels_for_wing` returned 0 silently for every project
whose folder contained a `-` or a space — the most common shape in the
wild.
Extracted the rule into `config.normalize_wing_name()` and routed both
`cli.cmd_init` (registry write) and `room_detector_local.detect_rooms_local`
(yaml write) through it. Added a regression test in `test_cli.py`
asserting the registry call uses the normalized slug, plus four direct
unit tests for the helper.
Refs #1180.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The narrowed _fix_blob_seq_ids returned early when safe_rows was empty,
but #1177's marker contract requires the marker to be written on every
successful pass — even no-op — so subsequent opens skip the sqlite3
connection entirely. Without this, palaces that have no genuine 0.6.x
BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every
call, defeating the #1090 corruption guard.
Restructured the conditional so the marker write is unconditional after
a successful sqlite scan, regardless of whether any rows were updated.
Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer
during the develop-rebase of this PR. The author's branch predates the
marker contract from #1177 (merged 2026-04-26), so this is a rebase-edge
fix-up rather than a logic change to their narrowing behaviour.
The BLOB-seq_id migration shim (PR #664) ran int.from_bytes(..., 'big')
over every BLOB in max_seq_id, including chromadb 1.5.x's own native
format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18
integer that silently suppresses every subsequent embeddings_queue write
for the affected segment (queue filter is seq_id > start), causing
silent drawer-write drops after a 1.5.x upgrade.
Two-part fix:
1. Shim narrowing (mempalace/backends/chroma.py)
- Drop max_seq_id from the shim loop. chromadb owns that column's
format; we don't reinterpret it.
- Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the
sysdb-10 b'\x11\x11' prefix rather than misconvert.
2. Recovery command (mempalace/repair.py, mempalace/cli.py)
- mempalace repair --mode max-seq-id [--segment <uuid>]
[--from-sidecar <path>] [--dry-run] [--yes] [--no-backup]
- Detects poisoned rows via threshold (seq_id > 2**53).
- Default heuristic: MAX(embeddings.seq_id) over the collection owning
the poisoned segment. Matches METADATA max exactly; VECTOR segments
get a few seq_ids ahead (queue skips an already-indexed window — an
acceptable loss vs. resetting to 0 and re-processing everything).
- --from-sidecar copies clean values from a pre-corruption sqlite db.
- Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs,
post-repair verification that raises MaxSeqIdVerificationError if
any row is still above threshold.
Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar,
dry-run, segment filter, no-op, backup, rollback-on-verify-failure).
3 new in tests/test_backends.py (max_seq_id untouched by shim,
sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs
still convert). Full suite: 1103 passed.
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:
* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
legacy `create_collection()` path. Preserves existing `hnsw:space`,
`hnsw:num_threads=1` (race fix from #976), and `**ef_kwargs`
(embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
path used when a palace is first opened by the MCP server. Without this
third site, MCP-bootstrapped palaces would skip the guard and could
still trigger the original bloat.
Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.
Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.
Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.
Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the #1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.
Closes#344
Supersedes #346
Co-authored-by: robot-rocket-science <robot-rocket-science@users.noreply.github.com>
`bash` on the Windows CI runner resolves to `wsl.exe` which fails with
"Windows Subsystem for Linux has no installed distributions." The shell
hooks themselves are POSIX-only — Windows users use the Python entry
point — so the bash-subprocess exercise is non-applicable on win32.
The static-grep validator tests still run on every platform, so the
shell-side validation is still asserted under Windows; only the live
bash invocation is skipped.
Address Copilot review on #1231:
1. Stop double-mining the transcript on the Python side. ``_get_mine_targets``
now returns only the ``MEMPAL_DIR`` projects target — the convos target
for the transcript dir is dropped because ``_ingest_transcript`` already
handles it on every hook fire. The duplicate spawn was using
``sys.executable`` (vs ``_mempalace_python()``) and a different ``--wing``,
so each Stop/PreCompact event was writing the same transcript into two
wings under asymmetric interpreters and overwriting the single
``_MINE_PID_FILE`` lock.
2. ``_maybe_auto_ingest`` and ``_mine_sync`` now spawn via
``_mempalace_python()`` so the resolved interpreter matches the venv
that owns mempalace (matters under GUI-launched harnesses where
``sys.executable`` may resolve to a system Python without chromadb).
3. Replace ``eval $(...)`` in both shell hooks with a ``mapfile``-based
reader. Sanitized values are still emitted by the same Python parser,
but the shell now does plain variable assignment instead of executing
the parser's stdout — smaller blast radius if the sanitizer is ever
bypassed.
4. Mirror ``_validate_transcript_path`` in the shell hooks via a
``is_valid_transcript_path`` helper — extension + traversal-segment
rejection, parity with the Python validator. The convos mine in each
shell hook is now gated on the validator instead of bare ``-f``.
5. Tighten the ``..`` traversal test that previously exercised the
suffix gate by mistake (``../../etc/passwd`` lacks ``.json[l]``).
Use ``.jsonl`` paths with traversal segments to actually hit the
``..`` rejection branch.
6. README: add a one-liner pointing at ``mempalace sweep`` for users
who want per-message recall on top of the file-level chunks the
hooks produce. The sweeper was undiscoverable previously.
Tests: 1418 passed, 1 skipped (full suite minus benchmarks).
#976 protects `mempalace mine`, but MCP/direct backend writers still call
ChromaCollection.add/upsert/update/delete without the palace lock. This
moves the lock boundary to the Chroma backend seam so all Chroma writes
share the same palace-level serialization, with a re-entrant guard for
miner paths that already hold the lock.
mine_palace_lock(palace_path) gains a per-thread re-entrant guard
(threading.local + pid-tag against fork inheritance) so
ChromaCollection write methods can take the lock without
self-deadlocking when called from inside miner.mine()'s outer hold.
ChromaCollection.__init__ accepts an optional palace_path; when set,
add/upsert/update/delete wrap their underlying chromadb call with
mine_palace_lock(palace_path). palace_path=None preserves the legacy
no-lock behaviour for direct callers and tests. ChromaBackend's
get_collection/create_collection pass palace_path through;
mcp_server._get_collection forwards _config.palace_path so all MCP
write tools inherit the wrapping.
Tests: 5 new in tests/test_chroma_collection_lock.py covering opt-in,
writer-blocks-during-mine, re-entrant-inside-mine, two-process
serialization, and a source-level read-path-not-locked pin. Plus 1 new
+ 1 rewritten in tests/test_palace_locks.py for the re-entrant
semantics. 52 passed in 1.01s including the existing test_backends.py
regression suite.
Refs #1161.
#1230 fixed --mode convos for the case where MEMPAL_DIR was unset, but
left two configurations broken:
- MEMPAL_DIR set to a project dir: convos never mined (MEMPAL_DIR
overrode the transcript path); only project files were ingested.
- MEMPAL_DIR set to a conversations dir per the old hooks/README: the
projects miner ran on JSONL — same wrong-miner behaviour.
The shell hooks (mempal_save_hook.sh, mempal_precompact_hook.sh) had
the same MEMPAL_DIR-overrides-transcript bug AND were missing --mode
on every spawned `mempalace mine` call.
Make the auto-ingest *additive*. _get_mine_dir → _get_mine_targets,
returning a list of (dir, mode) pairs:
- MEMPAL_DIR (when valid) contributes (dir, "projects")
- A valid transcript JSONL contributes (parent, "convos")
- Both can appear together; the hook spawns one ingest per target
Same change applied to the shell save and precompact hooks. Precompact
also gained transcript_path parsing so it can run the convos mine
synchronously before context is compressed. hooks/README.md updated to
describe MEMPAL_DIR as a project-files target, never a convos target.
- Normalize MEMPAL_DIR via Path.expanduser().resolve() so ~/proj paths
are correctly accepted instead of falling through to transcript fallback
- Replace bare Path.expanduser().is_file() transcript check with the
existing _validate_transcript_path() which adds .resolve(), enforces
.jsonl/.json extension, and rejects '..' path-traversal components
- Update tests to compare resolved paths (cross-platform correctness)
- Add tests for tilde expansion, path-traversal rejection, and
non-jsonl extension rejection in _get_mine_dir
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/f69176c7-d752-40ef-ba71-d0e4adc3a689
Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
The Stop and PreCompact hooks spawn `mempalace mine <dir>` with no
`--mode` flag, which defaults to `projects` in cli.py. When MEMPAL_DIR
is unset, _get_mine_dir falls back to the parent of the transcript
JSONL — and miner.py's READABLE_EXTENSIONS includes `.jsonl`, so the
projects miner happily ingests Claude Code session JSONL as if it were
source code instead of conversation.
Make _get_mine_dir return (dir, mode): MEMPAL_DIR keeps `projects`,
the JSONL fallback yields `convos`. Both _maybe_auto_ingest and
_mine_sync now thread the mode into the spawned command.
Five Copilot review issues + the Python 3.9 CI failure rolled into one
follow-up:
* Replace ``dict | None`` annotated assignment with a type-comment so
module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red).
* Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace
repair`` (rebuild) and ``mempalace repair-status``. Updated all
user-facing messages, docstrings, and test assertions.
* Replace ``_get_client()`` in ``tool_search`` with the safe
``_refresh_vector_disabled_flag`` probe so the fallback isn't
defeated by the very chromadb client load it's trying to avoid.
* Short-circuit ``tool_status`` to a pure-sqlite reader
(``_tool_status_via_sqlite``) when divergence is detected so wing /
room counts come back without ever opening the persistent client.
* Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an
``id``-ordered fallback so legacy schemas missing ``created_at``
don't break BM25 search.
New test covers the sqlite-status short-circuit. 1409 tests pass.
When chromadb's HNSW segment freezes at a stale max_elements while
sqlite keeps accumulating embeddings, the next chromadb open segfaults
the MCP server on every tool call. Adds a pure-filesystem capacity probe
(zero chromadb interaction), a `mempalace repair-status` read-only
health check, and a BM25-only sqlite fallback so the palace stays
reachable even when vector search is unavailable.
* `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly
via a tight-allowlist unpickler — no hnswlib import, no segment load.
* MCP server runs the probe at startup and after every reconnect; sets
`_vector_disabled` and routes search to the sqlite FTS5 + BM25 path.
* `tool_status` and `tool_reconnect` surface the fallback state.
* Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).
2 files changed, 60 insertions, 0 deletions. 2 new tests (RED-first).
Follow-up to #1224's privacy warning. The URL-based heuristic in
``mempalace.llm_client._endpoint_is_local`` shipped without recognizing
Tailscale's CGNAT range (100.64.0.0/10), so a user running LM Studio,
Ollama, or any local LLM accessible via a Tailscale-assigned 100.x.x.x
address would currently get a wrong privacy warning — Tailscale
addresses are network-private (only reachable inside the user's
Tailnet) but they're not RFC1918, so the heuristic was treating them
as external.
This PR adds CGNAT recognition: when the hostname starts with ``100.``
AND the second octet is between 64 and 127 inclusive, it's classified
as local. Addresses in 100.x.x.x outside that range (i.e. second octet
< 64 or > 127) are regular allocated public space and remain external,
so a user pointing at a public 100.0.0.1 still gets the warning.
Concrete user impact:
Before: ``mempalace init --llm-provider openai-compat --llm-endpoint http://100.100.50.50:1234``
(LM Studio on Tailnet) → triggers privacy warning incorrectly.
After: same command → no warning. data stays inside the user's
Tailnet, which is what the warning is supposed to protect against.
TDD: 2 tests added in ``tests/test_llm_client.py``, both RED-first.
1. ``test_openai_compat_provider_tailscale_cgnat_endpoint_is_local``
— covers three Tailscale CGNAT addresses (start, middle, near-end
of the range) and pins they're all classified local. This was the
RED that drove the implementation.
2. ``test_openai_compat_provider_outside_tailscale_cgnat_is_external``
— pins the boundary on both sides: addresses with second octet 0-63
and 128-255 stay external. Prevents future "treat all 100.x.x.x as
local" overcorrection.
Tests: 1388 total mempalace tests pass. 2 pre-existing environmental
failures unrelated to this change (chromadb optional dep). Ruff check
+ format both clean.
Backwards compatible: only widens the local-recognition set. Anything
classified local before is still classified local; anything classified
external before remains so unless it's specifically in the CGNAT range.
Out of scope (tracked for future iteration based on real user feedback,
not built speculatively): pre-init confirmation prompt before sending
to external API, persistent ``private-only`` config flag that refuses
external endpoints entirely, explicit cloud-provider name detection
("Using Anthropic's hosted API at ..." vs the current generic warning).
- cli.py: stringify each evidence entry exactly once before the
startswith check (was calling str(e) twice per element).
- tests: replace brittle `confidence != 0.90` assertion with an
equality check against detect_origin_heuristic on the same samples.
The original would have spuriously fired if the heuristic ever
legitimately produced 0.90 for these samples; the new form pins the
contract directly.
Two follow-ups to PR #1221's merge-fields behavior, both raised by the
Copilot review on that PR:
- Evidence merge now prefixes each entry with `Tier-1 heuristic: ` or
`Tier-2 LLM: ` so the on-disk `origin.json` audit record retains tier
provenance. The pre-#1221 code labeled heuristic evidence; the
merge-fields refactor flattened that. Re-prefixing is idempotent.
- Tests now assert that the merged `confidence` is the heuristic's, not
the LLM's. Added inline assertions to the two existing
contradiction/disagreement tests, plus a dedicated
`test_merge_tier_fields_confidence_matches_heuristic_call` that
compares to `detect_origin_heuristic` directly so a future regression
letting Tier 2 confidence leak through cannot pass silently.
Tests: 1378 pass. Ruff check + format both clean (CI-pinned 0.4.x).
4 files changed, 248 insertions, 0 deletions. 7 new tests (4 unit + 3 integration), all RED-first.
Per @milla-jovovich's question to @igorls during PR #1221 review: users
running `mempalace init` with an external LLM provider (Anthropic API,
OpenAI hosted, etc.) need a clear, explicit warning that their folder
content will be sent to the provider, that MemPalace doesn't control
how the provider logs/retains/uses that data, and how to opt out.
@igorls confirmed this should be a small follow-up PR scoped to the
warning itself, before the v3.3.4 tag.
This PR adds:
- `_endpoint_is_local(url)` helper in `mempalace/llm_client.py` —
URL-based heuristic returning True if the hostname is on the user's
machine or private network. Covers: localhost, 127.0.0.1, ::1,
hostnames ending in .local (mDNS/Bonjour), IPv4 RFC1918 ranges
(10/8, 172.16-31/12, 192.168/16), and IPv6 unique-local addresses
(fc00::/7).
- `is_external_service` property on the `LLMProvider` base class.
Subclasses inherit; the URL determines (no provider-specific
hardcoding). This means: Ollama on localhost = local. LM Studio on
LAN = local. Anthropic with default `https://api.anthropic.com` =
external. A user proxying Anthropic through localhost (advanced
setup) = local, no false-positive warning.
- One-line warning print in `cmd_init` after successful provider
acquisition, gated on `is_external_service`:
⚠ {provider_name} is an EXTERNAL API. Your folder content will be
sent to the provider during init. MemPalace does not control how
the provider logs, retains, or uses your data. Pass --no-llm to
keep init fully local.
The warning fires AFTER `LLM enabled: ...` so users see both that
the LLM is engaged AND the privacy implications of where it lives,
before Pass 0 / entity detection actually runs.
LOCAL providers (Ollama on localhost, LM Studio on localhost or LAN,
llama.cpp on localhost, vLLM on localhost) DO NOT trigger the warning —
nothing leaves the user's machine/network in those configurations.
TDD: 7 tests added across 2 files.
Unit tests in `tests/test_llm_client.py` (4 tests, all RED-first):
1. test_ollama_provider_default_endpoint_is_local — pins that the
default `http://localhost:11434` is classified local.
2. test_openai_compat_provider_localhost_endpoint_is_local — covers
the LM Studio / llama.cpp / vLLM common case (localhost,
127.0.0.1, and 192.168.x LAN).
3. test_openai_compat_provider_cloud_endpoint_is_external — pins
that pointing openai-compat at https://api.openai.com (or any
non-local URL) classifies as external.
4. test_anthropic_provider_default_endpoint_is_external — pins that
AnthropicProvider's default endpoint is external (the dominant
user-facing case for `--llm-provider anthropic`).
Integration tests in `tests/test_corpus_origin_integration.py` (3 tests,
RED-first; 1 was the critical RED — the other 2 passed by accident
since nothing printed "EXTERNAL API" before this PR):
5. test_init_prints_privacy_warning_when_provider_is_external —
captures stdout from cmd_init with a mocked external provider,
asserts the warning text contains "EXTERNAL API" + "--no-llm" +
language about MemPalace not controlling provider behavior.
6. test_init_no_privacy_warning_when_provider_is_local — same flow
with a mocked local provider, asserts the warning text does NOT
appear.
7. test_init_no_privacy_warning_with_no_llm_flag — pins the --no-llm
path: no provider acquisition attempted, no warning fires.
Tests: 1382 total mempalace tests pass. 2 pre-existing environmental
failures unrelated to this change (chromadb optional dep). Ruff check +
format both clean.
Backwards compatible: `is_external_service` is a new property; existing
callers don't reference it. The warning is a new print statement that
fires only when an external endpoint is acquired. The `--no-llm` opt-out
existed before this PR and continues to work identically.
Out of scope for follow-up (deliberately not in this PR per Igor's
"small PR" guidance): Tailscale CGNAT (100.64.0.0/10) treatment,
pre-init confirmation prompt, persistent privacy-mode config flag,
explicit cloud-provider name detection. Tracked for future iteration.
2 files changed, 260 insertions, 7 deletions. 4 new tests (all RED-first).
Per @igorls's review of PR #1211 (https://github.com/MemPalace/mempalace/pull/1211#issuecomment-4322762236):
the corpus-origin Pass 0 currently lets a Tier 2 LLM result REPLACE the
heuristic result wholesale. With ``--llm`` default-on (since #1211) and a
small local model like Ollama gemma4:e4b, the LLM can return a wrong
``likely_ai_dialogue=False, confidence=0.90`` that overrides a confident
heuristic ``True``. Tier 2's persona/user/platform extraction is the whole
reason to run it; the YES/NO call should stay with the heuristic.
This PR changes ``_run_pass_zero`` in ``mempalace/cli.py`` to merge fields
instead of replacing:
- ``likely_ai_dialogue`` → KEEP heuristic's (don't let weak LLM flip)
- ``confidence`` → KEEP heuristic's (paired with the bool above)
- ``primary_platform`` → TAKE LLM's when LLM provides one
- ``user_name`` → TAKE LLM's when LLM provides one
- ``agent_persona_names`` → TAKE LLM's when LLM provides any
- ``evidence`` → COMBINE both signal trails
This preserves the persona-extraction value of Tier 2 (the whole point of
running it) while preventing a weak local model from flipping a confident
heuristic.
TDD: 4 tests added in tests/test_corpus_origin_integration.py covering
the four state combinations:
1. test_merge_tier_fields_heuristic_yes_llm_no_keeps_heuristic_bool —
The exact failure mode Igor caught. Heuristic confidently flags
AI-dialogue; mocked LLM contradicts. Asserts merged result keeps
heuristic's True AND merges LLM's persona/user/platform fields.
This test was the RED that drove the implementation.
2. test_merge_tier_fields_heuristic_no_no_personas_leak —
Both tiers agree NOT AI-dialogue, both report empty personas. Pins
that the merge doesn't accidentally introduce personas.
3. test_merge_tier_fields_heuristic_yes_llm_yes_combines_evidence —
Both tiers agree AI-dialogue, LLM extracts personas. Pins that
evidence from BOTH tiers ends up in the merged audit trail and
persona/user/platform come from LLM.
4. test_merge_tier_fields_no_llm_provider_returns_heuristic_only —
Backwards compat: with no LLM provider (``--no-llm`` path), the
merge logic doesn't fire and behavior is identical to v3.3.4.
Tests: 1367 pass on the full mempalace suite. 2 pre-existing
environmental failures unrelated to this change (chromadb optional
dep). Ruff check + format both clean.
Three new tests cover the marker contract:
- test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is
written after a successful BLOB → INTEGER conversion.
- test_fix_blob_seq_ids_writes_marker_when_already_integer — marker
is written even on a no-op (already-INTEGER) path. The point of
the marker is to skip sqlite3.connect() on subsequent calls, not
to record that a conversion happened.
- test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies
the load-bearing property via monkeypatched sqlite3.connect: when
the marker exists, the function never opens sqlite. This is the
bug #1090 fix — opening Python's sqlite3 against a live ChromaDB
1.5.x WAL DB corrupts the next PersistentClient call, and we
must never do it again post-migration.
Also folded in @igorls's style nit:
- Path(marker).touch() instead of open(marker, "a").close().
Same effect, reads cleaner. Moved Path import to the top of the
module since it didn't yet exist there.
35/35 backend tests pass. The "Production: fresh MCP server + stop
hook + mempalace mine subprocess" checkbox in the PR body is checked
in the PR reply — the canonical 151K palace has been running this
fix since #1177 was filed (via Syncthing-replicated source on the
disks daemon at v1.7.0); zero PersistentClient corruption since.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.
Plus three nits @igorls flagged:
- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
documented that idempotency of `quarantine_stale_hnsw` is the safety
property (concurrent same-palace calls produce a benign redundant
rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
and a future protocol-0/1 emission would conservatively quarantine
+ lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
conftest reset is the controlled case, and module-level wouldn't
remove the foot-gun, just relocate it. Conftest reset documented in
the existing comment is the right pattern for test isolation.
Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in #1177's territory, not #1173's.
37/37 tests pass on the PR branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.
This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.
The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean.
But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.
## What changes for users (in order, from `pip install` onwards)
**Install** — `pip install mempalace` is unchanged. The package itself didn't shift.
**First run — `mempalace init <folder>`:**
1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.
2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`:
- **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`.
- **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise.
- **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.
3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out.
4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder.
5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.
6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean.
7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it.
8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR.
**Later — when your folder grows:**
9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe.
## Behind the changes
- **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.
- **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`.
- **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged.
- **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module.
- **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`.
- **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.
## Cost story
- **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`.
- **Ollama / local LLM runtime:** zero cost. Fully offline.
- **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes.
- **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks.
## Backwards compatibility
- All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change.
- The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
- `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3.
## Test coverage
- **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
- **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
- **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too.
- **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.
## Hygiene guardrail
This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in #1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.
Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.
Fix: two-stage gate.
1. mtime gate (existing) — gap > stale_seconds is necessary.
2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
tail) and a non-trivial size, WITHOUT deserializing the file.
Healthy segment with mtime drift → keep in place; truncated /
zero-filled / partial-flush → quarantine.
Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.
The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.
4 new tests in test_backends.py:
- test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
- test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
(drift + valid meta — the production case at 06:24)
- test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
(fresh / never-flushed, no meta file)
- test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
size, partial-flush shape)
Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.
Suite 1366/1366 pass.
Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.
CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>