merge: develop into hnsw-repair (resolve chroma.py + test_backends.py conflicts)

Develop (post-#1162 lock-plumbing era) refactored the per-open quarantine pass into ChromaBackend._prepare_palace_for_open. This branch's inline-expansion form added quarantine_invalid_hnsw_metadata as a third check, plus a "discard from _quarantined_paths on inode swap" guard so re-opens against a different physical DB re-run quarantine. Resolution merges both: - _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata, gated by the same _quarantined_paths set. - _client keeps the inode_changed -> _quarantined_paths.discard() guard before calling the helper, so a fresh inode triggers a fresh pass. - make_client collapses to a single _prepare_palace_for_open() call. - test_backends.py keeps both the pickle (#1285) and shutil (develop) imports — both are used.
2026-05-07 07:48:45 -03:00
parent 0e32b9643c 435f0ad348
commit be6dc033fd
45 changed files with 3380 additions and 165 deletions
@@ -6,7 +6,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "bash ${CLAUDE_PLUGIN_ROOT}/hooks/mempal-stop-hook.sh"
+            "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/mempal-stop-hook.sh\""
          }
        ]
      }
@@ -16,7 +16,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "bash ${CLAUDE_PLUGIN_ROOT}/hooks/mempal-precompact-hook.sh"
+            "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/mempal-precompact-hook.sh\""
          }
        ]
      }
@@ -6,7 +6,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh session-start"
+            "command": "\"${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh\" session-start"
          }
        ]
      }
@@ -17,7 +17,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh stop"
+            "command": "\"${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh\" stop"
          }
        ]
      }
@@ -28,7 +28,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh precompact"
+            "command": "\"${CODEX_PLUGIN_ROOT}/hooks/mempal-hook.sh\" precompact"
          }
        ]
      }
@@ -6,6 +6,21 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 ---
 ## [3.3.5] — unreleased
 ### Bug Fixes
 - **`mempalace_diary_read` silently dropped entries on agent-name case mismatch.** `tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name`, which preserves case, while `tool_diary_read` filtered by exact match. Writing as `"Claude"` and reading as `"claude"` (or vice-versa) returned zero rows. Both endpoints now lowercase `agent_name` immediately after sanitization, so reads are case-insensitive and the default per-agent wing slug is stable across casings. **Behavior change:** entries written prior to this fix under mixed-case agent names will not match the new lowercase filter; run `mempalace repair` if you need to migrate legacy diary metadata. (#1243)
 - **Knowledge-graph triples with `valid_to < valid_from` were silently invisible.** `KnowledgeGraph.query_entity()` filters with `valid_from <= as_of AND valid_to >= as_of`, so an inverted interval matches no `as_of` and the row is durably stored but unreachable — a P0 data-integrity foot-gun any caller that mixes up the two date params can hit. `add_triple()` now rejects inverted intervals at write time with a clear `ValueError` naming both bounds. Open intervals (one bound only) and point-in-time facts (`valid_from == valid_to`) remain accepted unchanged. (#1214)
 - **`ChromaBackend.close_palace()` / `close()` did not release the SQLite file lock.** Evicted clients sat in `_clients` without `close()`, and chromadb 1.5.x retains the rust-side SQLite lock until GC. Reopening the same palace path after `shutil.rmtree` + recreate within one process failed with `SQLITE_READONLY_DBMOVED` (code 1032). New `_close_client()` helper now calls `PersistentClient.close()` (with a try/except fallback for older chromadb) on `close_palace()`, on whole-backend `close()`, and on the `_client()` invalidation path that detects a missing `chroma.sqlite3`. The mtime/inode auto-invalidation branch is intentionally left alone — callers there may still hold a live `ChromaCollection`. (#1067, #1105)
 - **`EntityRegistry.save()` could leave a corrupt or empty `entity_registry.json` on crash.** `Path.write_text()` is not atomic — kernel sees `open('w')` (truncate), `write`, `close`, and any failure between truncate and full-flush (power loss, OOM, FS-full, kill -9) wipes the months-of-mining people/projects map silently (the registry's `load()` swallows `JSONDecodeError`). Save now writes to a sibling `.tmp` in the same directory, `fsync`s, `chmod 0o600`s, then `os.replace()`s into place — atomic on POSIX and Windows. The previous registry stays intact on any crash before the rename returns. (#1215)
 - **`mempalace compress` crashed on large palaces.** `regenerate_closets` fetched all closet_llm drawers in a single `col.get()`, which trips `SQLITE_MAX_VARIABLE_NUMBER` on palaces above ~32k drawers. Mirrors the #851 fix in `miner.py`: drawer fetch is now paginated at `batch_size=5000`. Per-source aggregation works across batches, so the LLM regeneration call still groups chunks correctly. (#1073, #1107)
 - **CLI and `fact_checker --stdin` mojibaked non-ASCII content on Windows.** Python defaults `sys.stdin`/`stdout`/`stderr` to the system ANSI codepage (cp1252/cp1251/cp950), so `mempalace search > out.txt` and piped fact_checker invocations corrupted Cyrillic / CJK drawer text at the process boundary. New `mempalace/_stdio.py` helper reconfigures all three streams to UTF-8 on `sys.platform == "win32"`, with per-stream `errors` policy: `surrogateescape` on stdin (preserves bad bytes from redirected files for the consumer's parser), `replace` on stdout/stderr (substitutes U+FFFD instead of `UnicodeEncodeError`-ing mid-print). With this, all three user-facing console_scripts (`mcp_server`, `hooks_cli`, `cli`/`fact_checker`) now reconfigure identically on Windows. (#1282)
 - **MCP knowledge-graph tools forwarded malformed date strings to SQLite.** `tool_kg_query` (`as_of`), `tool_kg_add` (`valid_from`), and `tool_kg_invalidate` (`ended`) accepted any string and produced empty result sets on natural-language inputs like `"March 2026"` or `"yesterday"` — callers (especially LLM agents) could not distinguish "no fact at this time" from "your date format was unrecognized." New `sanitize_iso_date()` validator in `config.py` accepts `YYYY`, `YYYY-MM`, `YYYY-MM-DD` (and passes through `None`/`""`); all three tools call it before values reach the storage layer. **Behavior change:** previously-silent date typos now raise a clear `ValueError` naming the offending field; full ISO-8601 with time (`YYYY-MM-DDTHH:MM:SS`, timezone offsets) is not yet accepted — file an issue if you have a use case. (#1164, #1167)
 - **MCP server's `_kg` was a module-level singleton.** Multi-tenant hosts that rotate `MEMPALACE_PALACE_PATH` between tool calls hit the wrong sqlite file, because the KG was constructed once at import time while the ChromaDB side was already per-call via `_get_client()`. The KG is now resolved per-call through a lazy per-path cache (`_kg_by_path` keyed by `os.path.abspath`, with a double-checked-locking init under `_kg_cache_lock`). `tool_reconnect` drains and `close()`s cached KGs alongside the existing chroma reconnect. A `_call_kg` retry guard catches `sqlite3.ProgrammingError` once after a reconnect race. (#1136, #1160)
 ---
 ## [3.3.4] — unreleased
 ### Added
@@ -19,6 +34,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 ### Bug Fixes
 - **MCP server `tool_diary_write` SIGSEGV when default EF provider differs.** `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x persists the EF *identity* (its `name()`) with the collection but not the EF *instance/configuration*, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` — its `name()` matches `mempalace.embedding`'s spoofed `"default"` so the identity check passes, but its provider list is chromadb's default rather than the user's resolved device. The miner / Stop hook ingest path routes through the backend helper and binds the configured EF instead. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon) the default provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. `_get_collection` now reuses `ChromaBackend._resolve_embedding_function()` on the reopen branches that actually open a collection (warm-cache reads stay zero-cost), matching the miner/backend path. (#1299, follow-up to #1262 / #1289)
 - **Hooks no longer recreate `~/.mempalace/` after the user removes it.** When `~/.mempalace/` is deleted (a strong "do not auto-capture" signal), the next `Stop`, `PreCompact`, or `SessionStart` hook would silently rebuild the dir hierarchy and ingest existing transcripts: `_log()` called `STATE_DIR.mkdir(parents=True, exist_ok=True)` unconditionally, so the very act of writing `[HH:MM] SESSION START …` recreated `~/.mempalace/hook_state/`; subsequent calls in the save path then materialized `palace/`, `wal/`, `knowledge_graph.sqlite3`, and N drawers from `~/.claude/projects/*.jsonl`. All four entry points (`hook_stop`, `hook_precompact`, `hook_session_start`, and `_log` itself) now check a new module-level `PALACE_ROOT = Path.home() / ".mempalace"` constant first and short-circuit (returning `{}` on stdout, never logging) when the directory is absent. The user-removable directory becomes a kill-switch — `rm -rf ~/.mempalace` is now a stable state. Net: 23 lines added in `mempalace/hooks_cli.py`, 5 unit tests in `tests/test_hooks_cli.py`. (#1305)
 - **Cross-wing topic tunnels for hyphenated dir names.** `mempalace init` recorded the `topics_by_wing` registry key under the raw directory name (e.g. `mempalace-public`), while `mempalace.yaml`'s `wing` field used the lower-cased + separator-collapsed slug (`mempalace_public`). At mine time the miner read the slug from the yaml and missed the registry, so `_compute_topic_tunnels_for_wing` returned `0` silently. Real-world: any project whose folder contained a hyphen or space lost every topic tunnel. Now both call sites route through a shared `normalize_wing_name()` in `config.py`. (#1194, follow-up to #1180)
 - **CLI `mempalace search` retrieval quality.** The CLI was using pure ChromaDB cosine distance with no BM25 rerank, so drawers containing every query term but embedding as noise (directory listings, diff output, shell logs) scored `Match: 0.0` alongside genuinely irrelevant results with no way to tell them apart. Wired the CLI through the same `_hybrid_rank` the `mempalace_search` MCP tool already used, and surfaced both `cosine=` and `bm25=` scores in the output so users see which component of the match is firing. MCP search was unaffected; this fixes the human-facing CLI parity gap.
 - **Legacy-palace distance-metric warning.** CLI search now detects palaces created before `hnsw:space=cosine` was consistently set and prints a one-line notice pointing at `mempalace repair`. Without the warning such palaces silently used L2 distance, under which the similarity display floored every result to `Match: 0.0`. New palaces mined today already set cosine correctly and now have invariant tests pinning that behavior so future refactors can't silently regress it. (#1179)
@@ -6,6 +6,10 @@
 > domain — including `mempalace.tech` — is an impostor and may distribute
 > malware. Details and timeline: [docs/HISTORY.md](docs/HISTORY.md).
 > [!IMPORTANT]
 > **🚨 Claude Code sessions expire in 30 days w/out auto-save hooks wired!** **[Read this →](https://github.com/MemPalace/mempalace/discussions/1388)**
 <div align="center">
 <img src="assets/mempalace_logo.png" alt="MemPalace" width="240">
@@ -0,0 +1,71 @@
 """Stdio UTF-8 reconfiguration helper for Windows entry points.
 Python on Windows defaults stdio to the system ANSI codepage
 (cp1252/cp1251/cp950 depending on locale), which mojibakes UTF-8 input
 or output the moment a non-Latin character shows up. Every console
 entry point that touches stdio needs to fix this on Windows -- the MCP
 server, the CLI, the fact_checker `--stdin` mode -- so the
 reconfigure code lives here in one place to keep the per-stream
 errors policies aligned across them.
 Per-stream errors policy is caller-chosen:
 * MCP server uses ``strict`` on stdout/stderr because everything written
  there is server-controlled JSON-RPC; any encode failure is a real bug
  the operator wants loud.
 * CLI / fact_checker use ``replace`` on stdout/stderr because they print
  verbatim drawer text that may contain surrogate halves round-tripped
  from filenames -- ``strict`` would crash mid-print.
 * All callers use ``surrogateescape`` on stdin so a malformed byte from
  a redirected file or a misbehaving client survives as a lone surrogate
  the consumer's parser surfaces, instead of ``UnicodeDecodeError``
  killing the read loop on the first bad byte.
 """
 from __future__ import annotations
 import sys
 from typing import Callable, Optional
 def reconfigure_stdio_utf8_on_windows(
    *,
    stdin_errors: str = "surrogateescape",
    stdout_errors: str = "strict",
    stderr_errors: str = "strict",
    on_failure: Optional[Callable[[str, BaseException], None]] = None,
 ) -> None:
    """Reconfigure stdio to UTF-8 on Windows. No-op elsewhere.
    Args:
        stdin_errors: errors= policy for stdin.reconfigure().
        stdout_errors: errors= policy for stdout.reconfigure().
        stderr_errors: errors= policy for stderr.reconfigure().
        on_failure: optional ``(stream_name, exc) -> None`` callback for
            streams whose ``reconfigure`` raises (e.g. Jupyter-replaced
            streams that lack the method-shape we expect). Defaults to a
            ``WARNING:`` line on the original sys.stderr.
    """
    if sys.platform != "win32":
        return
    policies = (
        ("stdin", stdin_errors),
        ("stdout", stdout_errors),
        ("stderr", stderr_errors),
    )
    for name, errors in policies:
        stream = getattr(sys, name, None)
        reconfigure = getattr(stream, "reconfigure", None)
        if reconfigure is None:
            continue
        try:
            reconfigure(encoding="utf-8", errors=errors)
        except Exception as exc:  # noqa: BLE001 -- last-resort guard
            if on_failure is not None:
                on_failure(name, exc)
            else:
                print(
                    f"WARNING: Could not reconfigure {name} to UTF-8: {exc}",
                    file=sys.stderr,
                )
@@ -1,5 +1,6 @@
 """ChromaDB-backed MemPalace storage backend (RFC 001 reference implementation)."""
 import contextlib
 import datetime as _dt
 import logging
 import os
@@ -764,11 +765,58 @@ def _as_list(v: Any) -> list:
    return [v]
-class ChromaCollection(BaseCollection):
+def _close_client(client) -> None:
-    """Thin adapter translating ChromaDB dict returns into typed results."""
+    """Call ``PersistentClient.close()`` if available, swallow otherwise.
-    def __init__(self, collection):
+    chromadb 1.5.x exposes ``Client.close()`` to release rust-side SQLite
    file locks; older versions relied on GC. Try/except keeps forward-compat.
    """
    if client is None:
        return
    try:
        client.close()
    except Exception:
        logger.debug("client.close() unavailable or failed", exc_info=True)
 class ChromaCollection(BaseCollection):
    """Thin adapter translating ChromaDB dict returns into typed results.
    When ``palace_path`` is set, all write methods (``add``, ``upsert``,
    ``update``, ``delete``) acquire ``mine_palace_lock(palace_path)`` for the
    duration of the underlying chromadb call. This serializes MCP and other
    direct-backend writers against ``mempalace mine`` and against each other,
    closing the race between concurrent writers that triggers ChromaDB's
    multi-threaded HNSW corruption (#974/#965).
    The lock is the same primitive used by ``miner.mine()`` so re-entrant
    acquisition from inside the mine pipeline (mine -> _mine_body ->
    collection.upsert) is short-circuited by the per-thread guard inside
    ``mine_palace_lock`` — no self-deadlock.
    ``palace_path=None`` disables the wrapping, preserving the legacy
    no-lock behaviour for callers that construct a ``ChromaCollection``
    directly without going through ``ChromaBackend``.
    """
    def __init__(self, collection, palace_path: Optional[str] = None):
        self._collection = collection
        self._palace_path = palace_path
    @contextlib.contextmanager
    def _write_lock(self):
        """Acquire ``mine_palace_lock`` for the configured palace, if any.
        No-op (yields immediately) when ``self._palace_path`` is None.
        """
        if self._palace_path is None:
            yield
            return
        # Late import — palace.py imports ChromaBackend from this module.
        from ..palace import mine_palace_lock
        with mine_palace_lock(self._palace_path):
            yield
    # ------------------------------------------------------------------
    # Writes
@@ -780,7 +828,8 @@ class ChromaCollection(BaseCollection):
            kwargs["metadatas"] = metadatas
        if embeddings is not None:
            kwargs["embeddings"] = embeddings
-        self._collection.add(**kwargs)
+        with self._write_lock():
            self._collection.add(**kwargs)
    def upsert(self, *, documents, ids, metadatas=None, embeddings=None):
        kwargs: dict[str, Any] = {"documents": documents, "ids": ids}
@@ -788,7 +837,8 @@ class ChromaCollection(BaseCollection):
            kwargs["metadatas"] = metadatas
        if embeddings is not None:
            kwargs["embeddings"] = embeddings
-        self._collection.upsert(**kwargs)
+        with self._write_lock():
            self._collection.upsert(**kwargs)
    def update(
        self,
@@ -807,7 +857,8 @@ class ChromaCollection(BaseCollection):
            kwargs["metadatas"] = metadatas
        if embeddings is not None:
            kwargs["embeddings"] = embeddings
-        self._collection.update(**kwargs)
+        with self._write_lock():
            self._collection.update(**kwargs)
    # ------------------------------------------------------------------
    # Reads
@@ -951,7 +1002,8 @@ class ChromaCollection(BaseCollection):
            kwargs["ids"] = ids
        if where is not None:
            kwargs["where"] = where
-        self._collection.delete(**kwargs)
+        with self._write_lock():
            self._collection.delete(**kwargs)
    def count(self):
        return self._collection.count()
@@ -1065,7 +1117,7 @@ class ChromaBackend(BaseBackend):
        db_path = os.path.join(palace_path, "chroma.sqlite3")
        # DB was present when cache was built but is now missing → invalidate.
        if cached is not None and not os.path.isfile(db_path):
-            self._clients.pop(palace_path, None)
+            _close_client(self._clients.pop(palace_path, None))
            self._freshness.pop(palace_path, None)
            cached = None
            cached_inode, cached_mtime = 0, 0.0
@@ -1081,13 +1133,14 @@ class ChromaBackend(BaseBackend):
        )
        if cached is None or inode_changed or mtime_changed or mtime_appeared:
-            _fix_blob_seq_ids(palace_path)
+            # An inode swap means we are reopening a different physical DB
            # (post-restore, fresh palace at the same path, etc.); drop the
            # per-process gate so the quarantine pre-checks run again
            # against the new disk state instead of trusting cached "we
            # already cleaned this path" credit from the prior inode.
            if inode_changed:
                ChromaBackend._quarantined_paths.discard(palace_path)
-            if palace_path not in ChromaBackend._quarantined_paths:
+            ChromaBackend._prepare_palace_for_open(palace_path)
                quarantine_invalid_hnsw_metadata(palace_path)
                quarantine_stale_hnsw(palace_path)
                ChromaBackend._quarantined_paths.add(palace_path)
            cached = chromadb.PersistentClient(path=palace_path)
            self._clients[palace_path] = cached
            # Re-stat after the client constructor runs: chromadb creates
@@ -1123,6 +1176,36 @@ class ChromaBackend(BaseBackend):
    # property; locking would add cost without correctness gain.
    _quarantined_paths: set[str] = set()
    @staticmethod
    def _prepare_palace_for_open(palace_path: str) -> None:
        """Run the pre-open safety pass shared by :meth:`make_client` and
        :meth:`_client`.
        Three steps, all required before constructing a ``PersistentClient``:
        1. ``_fix_blob_seq_ids`` — repairs the BLOB seq_id quirk that bites
           certain chromadb migrations.
        2. ``quarantine_invalid_hnsw_metadata`` — renames aside any HNSW
           ``index_metadata.pickle`` that fails to load, so chromadb opens
           against an empty index instead of crashing on the unloadable
           pickle (#1266 / PR #1285).
        3. ``quarantine_stale_hnsw`` — also gated by :attr:`_quarantined_paths`
           so it fires once per palace per process. This is the SIGSEGV
           prevention path for stale HNSW segments (see #1121, #1132, #1263);
           wiring it through this helper means CLI mining, search, repair,
           and status all benefit, not just the legacy ``make_client``
           callers.
        Idempotent: safe to call from any code path that is about to open or
        re-open a palace. The ``_quarantined_paths`` gate prevents thrash on
        hot paths (e.g. ``_client()`` is called on every backend operation).
        """
        _fix_blob_seq_ids(palace_path)
        if palace_path not in ChromaBackend._quarantined_paths:
            quarantine_invalid_hnsw_metadata(palace_path)
            quarantine_stale_hnsw(palace_path)
            ChromaBackend._quarantined_paths.add(palace_path)
    @staticmethod
    def make_client(palace_path: str):
        """Create a fresh ``PersistentClient`` (fixes BLOB seq_ids first).
@@ -1135,11 +1218,7 @@ class ChromaBackend(BaseBackend):
        :attr:`_quarantined_paths` for the rationale (cold-start protection
        vs. runtime thrash on steady-write daemons).
        """
-        _fix_blob_seq_ids(palace_path)
+        ChromaBackend._prepare_palace_for_open(palace_path)
        if palace_path not in ChromaBackend._quarantined_paths:
            quarantine_invalid_hnsw_metadata(palace_path)
            quarantine_stale_hnsw(palace_path)
            ChromaBackend._quarantined_paths.add(palace_path)
        return chromadb.PersistentClient(path=palace_path)
    @staticmethod
@@ -1205,17 +1284,25 @@ class ChromaBackend(BaseBackend):
        else:
            collection = client.get_collection(collection_name, **ef_kwargs)
        _pin_hnsw_threads(collection)
-        return ChromaCollection(collection)
+        return ChromaCollection(collection, palace_path=palace_path)
    def close_palace(self, palace) -> None:
-        """Drop cached handles for ``palace``. Accepts ``PalaceRef`` or legacy path str."""
+        """Drop cached handles for ``palace`` and release its SQLite file lock.
        Accepts ``PalaceRef`` or legacy path str. chromadb's rust-side file
        lock is held until ``PersistentClient.close()`` is called, so plain
        dict eviction would leave the palace path unreopenable and
        unremovable in the same process.
        """
        path = palace.local_path if isinstance(palace, PalaceRef) else palace
        if path is None:
            return
-        self._clients.pop(path, None)
+        _close_client(self._clients.pop(path, None))
        self._freshness.pop(path, None)
    def close(self) -> None:
        for client in self._clients.values():
            _close_client(client)
        self._clients.clear()
        self._freshness.clear()
        self._closed = True
@@ -1256,7 +1343,7 @@ class ChromaBackend(BaseBackend):
            },
            **ef_kwargs,
        )
-        return ChromaCollection(collection)
+        return ChromaCollection(collection, palace_path=palace_path)
 def _normalize_get_collection_args(args, kwargs):
@@ -232,6 +232,13 @@ def cmd_init(args):
    from .project_scanner import discover_entities
    from .room_detector_local import detect_rooms_local
    # Honor --palace (issue #1313): without this, init silently ignored the
    # flag and always used ~/.mempalace. Mirror the env-var pattern used by
    # mcp_server.py so every downstream read of ``cfg.palace_path`` (Pass 0,
    # cfg.init(), the post-init mine) routes to the user-specified location.
    if getattr(args, "palace", None):
        os.environ["MEMPALACE_PALACE_PATH"] = os.path.abspath(os.path.expanduser(args.palace))
    cfg = MempalaceConfig()
    # Resolve entity-detection languages: --lang overrides config.
@@ -310,8 +317,7 @@ def cmd_init(args):
                )
        except LLMError as e:
            print(
-                f"  LLM init failed ({e}). "
+                f"  LLM init failed ({e}). Running heuristics-only — pass --no-llm to silence this."
                f"Running heuristics-only — pass --no-llm to silence this."
            )
    # Pass 0: detect whether the corpus is AI-dialogue. Writes
@@ -912,7 +918,7 @@ def cmd_compress(args):
    # Store compressed versions (unless dry-run)
    if not args.dry_run:
        try:
-            comp_col = backend.get_or_create_collection(palace_path, "mempalace_compressed")
+            comp_col = backend.get_or_create_collection(palace_path, "mempalace_closets")
            for doc_id, compressed, meta, stats in compressed_entries:
                comp_meta = dict(meta)
                comp_meta["compression_ratio"] = round(stats["size_ratio"], 1)
@@ -923,7 +929,7 @@ def cmd_compress(args):
                    metadatas=[comp_meta],
                )
            print(
-                f"  Stored {len(compressed_entries)} compressed drawers in 'mempalace_compressed' collection."
+                f"  Stored {len(compressed_entries)} compressed drawers in 'mempalace_closets' collection."
            )
        except Exception as e:
            print(f"  Error storing compressed drawers: {e}")
@@ -939,7 +945,25 @@ def cmd_compress(args):
        print("  (dry run -- nothing stored)")
 def _reconfigure_stdio_utf8_on_windows():
    """Decode stdio as UTF-8 on Windows for the primary `mempalace` CLI.
    Thin wrapper around the shared helper in ``mempalace._stdio``. The CLI
    overrides stdout/stderr to ``replace`` because ``mempalace search``
    prints verbatim drawer text that may carry surrogate halves
    round-tripped from filenames -- ``strict`` would crash mid-print and
    lose the rest of the search result block. stdin keeps the default
    ``surrogateescape`` so a redirected non-UTF-8 file does not kill the
    read on the first bad byte.
    """
    from ._stdio import reconfigure_stdio_utf8_on_windows
    reconfigure_stdio_utf8_on_windows(stdout_errors="replace", stderr_errors="replace")
 def main():
    _reconfigure_stdio_utf8_on_windows()
    version_label = f"MemPalace {__version__}"
    parser = argparse.ArgumentParser(
        description="MemPalace — Give your AI a memory. No API key required.",
@@ -40,6 +40,7 @@ import json
 import os
 import re
 import time
 import urllib.parse
 import urllib.request
 import urllib.error
 from datetime import datetime
@@ -101,6 +102,14 @@ class LLMConfig:
        self.endpoint = (endpoint or os.environ.get("LLM_ENDPOINT", "")).rstrip("/")
        self.key = key or os.environ.get("LLM_KEY", "")
        self.model = model or os.environ.get("LLM_MODEL", "")
        if self.endpoint:
            # Privacy-by-architecture: reject file:// and other non-HTTP schemes
            # so a misconfigured endpoint cannot exfiltrate local files.
            scheme = urllib.parse.urlparse(self.endpoint).scheme.lower()
            if scheme not in ("http", "https"):
                raise ValueError(
                    f"LLM_ENDPOINT must use http:// or https:// (got scheme {scheme!r})"
                )
    def missing(self) -> list:
        missing = []
@@ -221,17 +230,28 @@ def regenerate_closets(
        print("No drawers in palace.")
        return {"processed": 0}
-    all_data = drawers_col.get(limit=total, include=["documents", "metadatas"])
+    # Paginate the fetch — a single get(limit=total, ...) blows through
-    by_source = {}
+    # SQLite's SQLITE_MAX_VARIABLE_NUMBER (32766) on large palaces and
-    for doc_id, doc, meta in zip(all_data["ids"], all_data["documents"], all_data["metadatas"]):
+    # crashes inside chromadb (see #802, #850, #1073).
-        source = meta.get("source_file", "unknown")
+    by_source: dict = {}
-        w = meta.get("wing", "")
+    batch_size = 5000
-        if wing and w != wing:
+    offset = 0
-            continue
+    while offset < total:
-        if source not in by_source:
+        batch = drawers_col.get(limit=batch_size, offset=offset, include=["documents", "metadatas"])
-            by_source[source] = {"drawer_ids": [], "content": [], "meta": meta}
+        ids = batch["ids"]
-        by_source[source]["drawer_ids"].append(doc_id)
+        if not ids:
-        by_source[source]["content"].append(doc)
+            break
        for doc_id, doc, meta in zip(ids, batch["documents"], batch["metadatas"]):
            meta = meta or {}
            source = meta.get("source_file", "unknown")
            w = meta.get("wing", "")
            if wing and w != wing:
                continue
            if source not in by_source:
                by_source[source] = {"drawer_ids": [], "content": [], "meta": meta}
            by_source[source]["drawer_ids"].append(doc_id)
            by_source[source]["content"].append(doc)
        offset += len(ids)
    sources = list(by_source.keys())
    if sample > 0:
@@ -81,6 +81,38 @@ def sanitize_kg_value(value: str, field_name: str = "value") -> str:
    return value
 # ISO-8601 date validator for knowledge-graph temporal parameters
 # (as_of, valid_from, valid_to, ended). Parameterized queries already
 # prevent SQL injection, but unvalidated date strings silently miss
 # every row — callers cannot distinguish "no fact at this time" from
 # "your date format was unrecognized." Require full YYYY-MM-DD: KG
 # queries compare TEXT dates lexicographically, so partials like "2026"
 # would re-introduce silent empty results (e.g. "2026-01-01" <= "2026"
 # is False), defeating the purpose of validation.
 _ISO_DATE_RE = re.compile(r"^\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])$")
 def sanitize_iso_date(value, field_name: str = "date"):
    """Validate an ISO-8601 date string, accepting None or empty as-is.
    Accepts only ``YYYY-MM-DD``. Raises ValueError on any other
    non-empty input so the MCP layer can surface a clear error to the
    caller instead of silently returning empty results. Partial dates
    (``YYYY``, ``YYYY-MM``) are rejected because KG queries compare
    TEXT dates lexicographically and would silently exclude valid facts.
    """
    if value is None or value == "":
        return value
    if not isinstance(value, str):
        raise ValueError(f"{field_name} must be a string")
    value = value.strip()
    if not _ISO_DATE_RE.match(value):
        raise ValueError(
            f"{field_name}={value!r} is not a valid ISO-8601 date " f"(expected YYYY-MM-DD)"
        )
    return value
 def sanitize_content(value: str, max_length: int = 100_000) -> str:
    """Validate drawer/diary content length."""
    if not isinstance(value, str) or not value.strip():
@@ -11,6 +11,7 @@ Same palace as project mining. Different ingest strategy.
 import os
 import sys
 import hashlib
 import logging
 from pathlib import Path
 from datetime import datetime
 from collections import defaultdict
@@ -24,6 +25,8 @@ from .palace import (
    mine_lock,
 )
 logger = logging.getLogger("mempalace_mcp")
 # Cached hall keywords — avoids re-reading config per drawer
 _HALL_KEYWORDS_CACHE = None
@@ -331,7 +334,7 @@ def _file_chunks_locked(collection, source_file, chunks, wing, room, agent, extr
        try:
            collection.delete(where={"source_file": source_file})
        except Exception:
-            pass
+            logger.debug("Stale-drawer purge failed for %s", source_file, exc_info=True)
        # Batch chunks into bounded upserts so large transcripts keep most of
        # the embedding speedup without one huge Chroma/SQLite request. Keep
@@ -89,7 +89,7 @@ def dedup_source_group(col, drawer_ids, threshold=DEFAULT_THRESHOLD, dry_run=Tru
    kept = []
    to_delete = []
-    for did, doc, meta in items:
+    for did, doc, _meta in items:
        if not doc or len(doc) < 20:
            to_delete.append(did)
            continue
@@ -873,7 +873,7 @@ class Dialect:
        for date_key in sorted(by_date.keys()):
            lines.append(f"=MOMENTS[{date_key}]=")
-            for z, fnum in by_date[date_key]:
+            for z, _fnum in by_date[date_key]:
                entities = []
                for p in z.get("people", []):
                    code = self.encode_entity(p)
@@ -16,6 +16,7 @@ Usage:
 """
 import json
 import os
 import re
 import urllib.request
 import urllib.parse
@@ -320,11 +321,35 @@ class EntityRegistry:
            self._path.parent.chmod(0o700)
        except (OSError, NotImplementedError):
            pass
-        self._path.write_text(json.dumps(self._data, indent=2), encoding="utf-8")
+        # Atomic write: serialize to a sibling temp file in the same dir
        # (so os.replace stays on one filesystem), fsync, then rename over
        # the target. A crash mid-write leaves the previous registry intact
        # instead of a half-written file or an empty file from the truncate.
        payload = json.dumps(self._data, indent=2)
        tmp_path = self._path.with_name(self._path.name + ".tmp")
        with open(tmp_path, "w", encoding="utf-8") as f:
            f.write(payload)
            f.flush()
            os.fsync(f.fileno())
        try:
-            self._path.chmod(0o600)
+            tmp_path.chmod(0o600)
        except (OSError, NotImplementedError):
            pass
        os.replace(tmp_path, self._path)
        # On ext4 (and similar) the rename's durability across power loss
        # requires an additional fsync on the parent directory. Without it,
        # the kernel can ack the rename and a crash reverts to the state
        # where the temp file is present and the target is at the old version.
        try:
            dir_fd = os.open(str(self._path.parent), os.O_RDONLY)
            try:
                os.fsync(dir_fd)
            finally:
                os.close(dir_fd)
        except OSError:
            # Windows and some special filesystems reject directory fds — they
            # have different durability semantics on rename anyway.
            pass
    @staticmethod
    def _empty() -> dict:
@@ -27,6 +27,7 @@ Usage:
 from __future__ import annotations
 import logging
 import os
 import re
 from datetime import datetime, timezone
@@ -35,6 +36,8 @@ from datetime import datetime, timezone
 # ~/.mempalace/known_entities.json on every check_text call.
 from .miner import _load_known_entities_raw
 logger = logging.getLogger("mempalace_mcp")
 # Narrow detection patterns — parse "X is Y's Z" and "X's Z is Y".
 # Names are captured greedily as word sequences (letters + optional
@@ -214,6 +217,7 @@ def _check_kg_contradictions(text: str, palace_path: str) -> list:
        try:
            facts = kg.query_entity(subject, direction="outgoing")
        except Exception:
            logger.debug("KG lookup failed for subject %r", subject, exc_info=True)
            continue
        if not facts:
            continue
@@ -303,11 +307,27 @@ def _edit_distance(s1: str, s2: str) -> int:
    return prev[-1]
 def _reconfigure_stdio_utf8_on_windows():
    """Decode --stdin payload as UTF-8 on Windows.
    Thin wrapper around the shared helper in ``mempalace._stdio``. Mirrors
    the primary CLI policy: stdout/stderr use ``replace`` because
    extracted fact text can include surrogate halves round-tripped from
    filenames -- ``strict`` would raise UnicodeEncodeError mid-print.
    stdin keeps the default ``surrogateescape``.
    """
    from ._stdio import reconfigure_stdio_utf8_on_windows
    reconfigure_stdio_utf8_on_windows(stdout_errors="replace", stderr_errors="replace")
 if __name__ == "__main__":
    import argparse
    import json
    import sys
    _reconfigure_stdio_utf8_on_windows()
    parser = argparse.ArgumentParser(
        description="Check text against known facts in the MemPalace palace.",
        epilog="Exits 0 when no issues found, 1 when one or more issues detected.",
@@ -16,6 +16,23 @@ from pathlib import Path
 SAVE_INTERVAL = 15
 STATE_DIR = Path.home() / ".mempalace" / "hook_state"
 PALACE_ROOT = Path.home() / ".mempalace"
 def _palace_root_exists() -> bool:
    """User-removable kill-switch.
    If ~/.mempalace/ does not exist, the user has explicitly cleared it.
    All hook side effects (logging, state dir creation, mining, ingestion)
    must respect this and short-circuit BEFORE touching disk — including
    before logging the short-circuit itself.
    Uses ``is_dir()`` rather than ``exists()`` so a stray regular file at
    ``~/.mempalace`` (or a broken symlink) is treated as absent — otherwise
    the kill-switch would be bypassed and ``STATE_DIR.mkdir()`` would later
    crash on ``NotADirectoryError``.
    """
    return PALACE_ROOT.is_dir()
 def _mempalace_python() -> str:
@@ -142,6 +159,8 @@ _state_dir_initialized = False
 def _log(message: str):
    """Append to hook state log file."""
    if not _palace_root_exists():
        return  # User removed the palace; do not recreate by logging
    global _state_dir_initialized
    try:
        if not _state_dir_initialized:
@@ -550,6 +569,9 @@ def _wing_from_transcript_path(transcript_path: str) -> str:
 def hook_stop(data: dict, harness: str):
    """Stop hook: block every N messages for auto-save."""
    if not _palace_root_exists():
        _output({})
        return
    parsed = _parse_harness_input(data, harness)
    session_id = parsed["session_id"]
    stop_hook_active = parsed["stop_hook_active"]
@@ -659,6 +681,9 @@ def hook_stop(data: dict, harness: str):
 def hook_session_start(data: dict, harness: str):
    """Session start hook: initialize session tracking state."""
    if not _palace_root_exists():
        _output({})
        return
    parsed = _parse_harness_input(data, harness)
    session_id = parsed["session_id"]
@@ -673,6 +698,9 @@ def hook_session_start(data: dict, harness: str):
 def hook_precompact(data: dict, harness: str):
    """Precompact hook: mine transcript synchronously, then allow compaction."""
    if not _palace_root_exists():
        _output({})
        return
    parsed = _parse_harness_input(data, harness)
    session_id = parsed["session_id"]
    transcript_path = parsed["transcript_path"]
@@ -171,6 +171,15 @@ class KnowledgeGraph:
            add_triple("Max", "does", "swimming", valid_from="2025-01-01")
            add_triple("Alice", "worried_about", "Max injury", valid_from="2026-01", valid_to="2026-02")
        """
        # Reject inverted intervals: a triple with valid_to < valid_from
        # would never satisfy `valid_from <= as_of AND valid_to >= as_of`,
        # so it would be invisible to every query — silently corrupt.
        if valid_from is not None and valid_to is not None and valid_to < valid_from:
            raise ValueError(
                f"valid_to={valid_to!r} is before valid_from={valid_from!r}; "
                "an inverted interval would be invisible to every KG query"
            )
        sub_id = self._entity_id(subject)
        obj_id = self._entity_id(obj)
        pred = predicate.lower().replace(" ", "_")
@@ -124,6 +124,8 @@ class Layer1:
        # Score each drawer: prefer high importance, recent filing
        scored = []
        for doc, meta in zip(docs, metas):
            meta = meta or {}
            doc = doc or ""
            importance = 3
            # Try multiple metadata keys that might carry weight info
            for key in ("importance", "emotional_weight", "weight"):
@@ -155,7 +157,7 @@ class Layer1:
            lines.append(room_line)
            total_len += len(room_line)
-            for imp, meta, doc in entries:
+            for _imp, meta, doc in entries:
                source = Path(meta.get("source_file", "")).name if meta.get("source_file") else ""
                # Truncate doc to keep L1 compact
@@ -222,6 +224,8 @@ class Layer2:
        lines = [f"## L2 — ON-DEMAND ({len(docs)} drawers)"]
        for doc, meta in zip(docs[:n_results], metas[:n_results]):
            meta = meta or {}
            doc = doc or ""
            room_name = meta.get("room", "?")
            source = Path(meta.get("source_file", "")).name if meta.get("source_file") else ""
            snippet = doc.strip().replace("\n", " ")
@@ -283,7 +287,7 @@ class Layer3:
        for i, (doc, meta, dist) in enumerate(zip(docs, metas, dists), 1):
            meta = meta or {}
            doc = doc or ""
-            similarity = round(1 - dist, 3)
+            similarity = round(max(0.0, 1 - dist), 3)
            wing_name = meta.get("wing", "?")
            room_name = meta.get("room", "?")
            source = Path(meta.get("source_file", "")).name if meta.get("source_file") else ""
@@ -46,8 +46,10 @@ import argparse  # noqa: E402  (deferred until after stdio protection above)
 import json  # noqa: E402
 import logging  # noqa: E402
 import hashlib  # noqa: E402
 import sqlite3  # noqa: E402
 import threading  # noqa: E402
 import time  # noqa: E402
-from datetime import datetime  # noqa: E402
+from datetime import date, datetime  # noqa: E402
 from pathlib import Path  # noqa: E402
 from .config import (  # noqa: E402
@@ -55,6 +57,7 @@ from .config import (  # noqa: E402
    sanitize_kg_value,
    sanitize_name,
    sanitize_content,
    sanitize_iso_date,
 )
 from .version import __version__  # noqa: E402
 from chromadb.errors import NotFoundError as _ChromaNotFoundError  # noqa: E402
@@ -78,7 +81,7 @@ from .palace_graph import (  # noqa: E402
    follow_tunnels,
 )
-from .knowledge_graph import KnowledgeGraph  # noqa: E402
+from .knowledge_graph import KnowledgeGraph, DEFAULT_KG_PATH  # noqa: E402
 logging.basicConfig(level=logging.INFO, format="%(message)s", stream=sys.stderr)
 logger = logging.getLogger("mempalace_mcp")
@@ -103,12 +106,61 @@ if _args.palace:
    os.environ["MEMPALACE_PALACE_PATH"] = os.path.abspath(_args.palace)
 _config = MempalaceConfig()
-# Only override KG path when --palace is explicitly provided; otherwise use
+
-# KnowledgeGraph's default (~/.mempalace/knowledge_graph.sqlite3).
+_kg_by_path: dict[str, KnowledgeGraph] = {}
-if _args.palace:
+_kg_cache_lock = threading.Lock()
-    _kg = KnowledgeGraph(db_path=os.path.join(_config.palace_path, "knowledge_graph.sqlite3"))
+_palace_flag_given: bool = bool(_args.palace)
-else:
+
-    _kg = KnowledgeGraph()
+
 def _resolve_kg_path() -> str:
    if _palace_flag_given:
        return os.path.join(_config.palace_path, "knowledge_graph.sqlite3")
    return DEFAULT_KG_PATH
 def _get_kg() -> KnowledgeGraph:
    path = os.path.abspath(_resolve_kg_path())
    kg = _kg_by_path.get(path)
    if kg is not None:
        return kg
    with _kg_cache_lock:
        kg = _kg_by_path.get(path)
        if kg is None:
            kg = KnowledgeGraph(db_path=path)
            _kg_by_path[path] = kg
    return kg
 def _call_kg(op):
    """Run ``op(kg)`` against the cached KG with one-shot retry on close.
    Race we're guarding against: a handler grabs ``kg = _get_kg()`` and is
    about to call ``kg.add_triple(...)`` when ``tool_reconnect`` fires on
    another thread, drains ``_kg_by_path``, and closes the underlying
    sqlite3.Connection. The handler's call then raises
    ``sqlite3.ProgrammingError: Cannot operate on a closed database`` and
    bubbles up as a -32000 to the MCP client even though the user just
    asked for a reconnect.
    Catch that single class of error, evict the stale entry from the
    cache (only if it still points at the closed instance — another
    thread may have already replaced it), and try once more with a fresh
    KG. Beyond one retry give up: a second close means we're losing a
    sustained race we won't win in this loop, and a hung loop is worse
    than a clear failure surface.
    """
    for attempt in range(2):
        kg = _get_kg()
        try:
            return op(kg)
        except sqlite3.ProgrammingError:
            if attempt == 0:
                path = os.path.abspath(_resolve_kg_path())
                with _kg_cache_lock:
                    if _kg_by_path.get(path) is kg:
                        _kg_by_path.pop(path, None)
                continue
            raise
 _client_cache = None
@@ -274,47 +326,94 @@ def _get_client():
 def _get_collection(create=False):
-    """Return the ChromaDB collection, caching the client between calls."""
+    """Return the ChromaDB collection, caching the client between calls.
-    global _collection_cache, _metadata_cache, _metadata_cache_time
+
-    try:
+    On failure, log the exception and retry once after clearing the client
-        client = _get_client()
+    and collection caches. Tools were silently returning ``None`` when a
-        if create:
+    cached client/collection went stale — typically after the chromadb
-            # hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
+    rust bindings invalidated a handle following an out-of-band write —
-            # HNSW insert path, which has a race in repairConnectionsForUpdate /
+    leaving the LLM with no diagnostic and no recovery path. The retry
-            # addPoint (see issues #974, #965). Set via metadata on fresh
+    forces ``_get_client()`` to rebuild from scratch (which re-runs
-            # collections and re-applied via _pin_hnsw_threads() for legacy
+    ``quarantine_stale_hnsw`` per #1322), so the second attempt heals the
-            # palaces whose collections were created before this fix (the
+    common stale-handle / stale-HNSW case automatically.
-            # runtime config does not persist cross-process in chromadb 1.5.x,
+    """
-            # so the retrofit runs every time _get_collection opens a cache).
+    global _client_cache, _collection_cache, _metadata_cache, _metadata_cache_time
-            #
+    for attempt in range(2):
-            # ChromaDB 1.5.x's Rust binding SIGSEGVs when get_or_create_collection
+        try:
-            # is called with metadata that differs from what's stored. The split
+            client = _get_client()
-            # below skips the metadata-comparison codepath for existing
+            # ChromaDB 1.x persists the EF *identity* (its ``name()``) with the
-            # collections, mirroring the backend-layer fix from #1262.
+            # collection but not the EF *instance/configuration*. So a reader or
-            try:
+            # writer that omits ``embedding_function=`` silently gets chromadb's
-                raw = client.get_collection(_config.collection_name)
+            # built-in ``DefaultEmbeddingFunction`` — its ``name()`` matches the
-            except _ChromaNotFoundError:
+            # one we spoof in ``mempalace.embedding`` (both report ``"default"``,
-                raw = client.create_collection(
+            # the identity check passes), but the *provider list* is chromadb's
-                    _config.collection_name,
+            # default rather than the user's resolved device. On bleeding-edge
-                    metadata={
+            # interpreters (#1299: python 3.14 + chromadb 1.5.x on Apple Silicon)
-                        "hnsw:space": "cosine",
+            # that default provider selection can SIGSEGV the host process on
-                        "hnsw:num_threads": 1,
+            # first ``col.add()``. The miner / Stop hook ingest path avoids this
-                        **_HNSW_BLOAT_GUARD,
+            # because it routes through ``ChromaBackend.get_collection``, which
-                    },
+            # resolves the EF via ``ChromaBackend._resolve_embedding_function``;
-                )
+            # the MCP server bypassed that abstraction. Resolve the EF inside the
-            _pin_hnsw_threads(raw)
+            # branches that actually open a collection so warm-cache reads stay
-            _collection_cache = ChromaCollection(raw)
+            # zero-cost. Reuse the backend helper so the two call sites can't
-            _metadata_cache = None
+            # drift on logging or fallback semantics.
-            _metadata_cache_time = 0
+            if create:
-        elif _collection_cache is None:
+                ef = ChromaBackend._resolve_embedding_function()
-            raw = client.get_collection(_config.collection_name)
+                ef_kwargs = {"embedding_function": ef} if ef is not None else {}
-            _pin_hnsw_threads(raw)
+                # hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
-            _collection_cache = ChromaCollection(raw)
+                # HNSW insert path, which has a race in repairConnectionsForUpdate /
-            _metadata_cache = None
+                # addPoint (see issues #974, #965). Set via metadata on fresh
-            _metadata_cache_time = 0
+                # collections and re-applied via _pin_hnsw_threads() for legacy
-        return _collection_cache
+                # palaces whose collections were created before this fix (the
-    except Exception:
+                # runtime config does not persist cross-process in chromadb 1.5.x,
-        return None
+                # so the retrofit runs every time _get_collection opens a cache).
                #
                # ChromaDB 1.5.x's Rust binding SIGSEGVs when get_or_create_collection
                # is called with metadata that differs from what's stored. The split
                # below skips the metadata-comparison codepath for existing
                # collections, mirroring the backend-layer fix from #1262.
                try:
                    raw = client.get_collection(_config.collection_name, **ef_kwargs)
                except _ChromaNotFoundError:
                    raw = client.create_collection(
                        _config.collection_name,
                        metadata={
                            "hnsw:space": "cosine",
                            "hnsw:num_threads": 1,
                            **_HNSW_BLOAT_GUARD,
                        },
                        **ef_kwargs,
                    )
                _pin_hnsw_threads(raw)
                _collection_cache = ChromaCollection(raw, palace_path=_config.palace_path)
                _metadata_cache = None
                _metadata_cache_time = 0
            elif _collection_cache is None:
                ef = ChromaBackend._resolve_embedding_function()
                ef_kwargs = {"embedding_function": ef} if ef is not None else {}
                raw = client.get_collection(_config.collection_name, **ef_kwargs)
                _pin_hnsw_threads(raw)
                _collection_cache = ChromaCollection(raw, palace_path=_config.palace_path)
                _metadata_cache = None
                _metadata_cache_time = 0
            return _collection_cache
        except Exception:
            logger.exception(
                "_get_collection attempt %d/2 failed (palace=%s, create=%s)",
                attempt + 1,
                _config.palace_path,
                create,
            )
            if attempt == 0:
                # Reset all caches so the next attempt forces _get_client()
                # to rebuild the chromadb client from scratch — that path
                # re-runs quarantine_stale_hnsw (#1322) and reopens the
                # collection cleanly, healing the common stale-handle case.
                _client_cache = None
                _collection_cache = None
                _metadata_cache = None
                _metadata_cache_time = 0
    return None
 def _no_palace():
@@ -433,7 +532,6 @@ def _tool_status_via_sqlite() -> dict:
        "total_drawers": total,
        "wings": wings,
        "rooms": rooms,
        "palace_path": _config.palace_path,
        "protocol": PALACE_PROTOCOL,
        "aaak_dialect": AAAK_SPEC,
        "vector_disabled": True,
@@ -472,7 +570,6 @@ def tool_status():
        "total_drawers": count,
        "wings": wings,
        "rooms": rooms,
        "palace_path": _config.palace_path,
        "protocol": PALACE_PROTOCOL,
        "aaak_dialect": AAAK_SPEC,
    }
@@ -656,7 +753,7 @@ def tool_check_duplicate(content: str, threshold: float = 0.9):
            "vector_disabled": True,
            "vector_disabled_reason": _vector_disabled_reason,
            "hint": (
-                "duplicate detection requires vector search; run " "`mempalace repair` to restore"
+                "duplicate detection requires vector search; run `mempalace repair` to restore"
            ),
        }
    try:
@@ -669,10 +766,12 @@ def tool_check_duplicate(content: str, threshold: float = 0.9):
        if results["ids"] and results["ids"][0]:
            for i, drawer_id in enumerate(results["ids"][0]):
                dist = results["distances"][0][i]
-                similarity = round(1 - dist, 3)
+                similarity = round(max(0.0, 1 - dist), 3)
                if similarity >= threshold:
-                    meta = results["metadatas"][0][i]
+                    # Chroma 1.5.x can return None for partially-flushed rows;
-                    doc = results["documents"][0][i]
+                    # coerce to empty sentinels so downstream .get() is safe.
                    meta = results["metadatas"][0][i] or {}
                    doc = results["documents"][0][i] or ""
                    duplicates.append(
                        {
                            "id": drawer_id,
@@ -827,7 +926,7 @@ def tool_add_drawer(
        if existing and existing["ids"]:
            return {"success": True, "reason": "already_exists", "drawer_id": drawer_id}
    except Exception:
-        pass
+        logger.debug("Idempotency pre-check failed for %s", drawer_id, exc_info=True)
    try:
        col.upsert(
@@ -893,12 +992,21 @@ def tool_get_drawer(drawer_id: str):
            return {"error": f"Drawer not found: {drawer_id}"}
        meta = result["metadatas"][0]
        doc = result["documents"][0]
        # source_file is the absolute filesystem path written by the
        # miners. Reduce to its basename before handing it to the MCP
        # client — same threat model as the palace_path leak fix:
        # nested-agent / multi-server topologies treat the client as a
        # separate trust domain. Basename preserves citation utility.
        # Mirrors the searcher.search_memories() return shape.
        safe_meta = dict(meta) if meta else {}
        if safe_meta.get("source_file"):
            safe_meta["source_file"] = Path(safe_meta["source_file"]).name
        return {
            "drawer_id": drawer_id,
            "content": doc,
-            "wing": meta.get("wing", ""),
+            "wing": safe_meta.get("wing", ""),
-            "room": meta.get("room", ""),
+            "room": safe_meta.get("room", ""),
-            "metadata": meta,
+            "metadata": safe_meta,
        }
    except Exception as e:
        return {"error": str(e)}
@@ -933,6 +1041,13 @@ def tool_list_drawers(wing: str = None, room: str = None, limit: int = 20, offse
            kwargs["where"] = where
        result = col.get(**kwargs)
        # Compute total matching drawers for pagination.
        if where:
            total_result = col.get(where=where, include=[])
            total = len(total_result["ids"])
        else:
            total = col.count()
        drawers = []
        for i, did in enumerate(result["ids"]):
            meta = result["metadatas"][i]
@@ -947,6 +1062,7 @@ def tool_list_drawers(wing: str = None, room: str = None, limit: int = 20, offse
            )
        return {
            "drawers": drawers,
            "total": total,
            "count": len(drawers),
            "offset": offset,
            "limit": limit,
@@ -1031,22 +1147,41 @@ def tool_kg_query(entity: str, as_of: str = None, direction: str = "both"):
    """Query the knowledge graph for an entity's relationships."""
    try:
        entity = sanitize_kg_value(entity, "entity")
        as_of = sanitize_iso_date(as_of, "as_of")
    except ValueError as e:
        return {"error": str(e)}
    if direction not in ("outgoing", "incoming", "both"):
        return {"error": "direction must be 'outgoing', 'incoming', or 'both'"}
-    results = _kg.query_entity(entity, as_of=as_of, direction=direction)
+    results = _call_kg(lambda kg: kg.query_entity(entity, as_of=as_of, direction=direction))
    return {"entity": entity, "as_of": as_of, "facts": results, "count": len(results)}
 def tool_kg_add(
-    subject: str, predicate: str, object: str, valid_from: str = None, source_closet: str = None
+    subject: str,
    predicate: str,
    object: str,
    valid_from: str = None,
    valid_to: str = None,
    source_closet: str = None,
    source_file: str = None,
    source_drawer_id: str = None,
 ):
-    """Add a relationship to the knowledge graph."""
+    """Add a relationship to the knowledge graph.
    All temporal and provenance fields are optional. ``valid_to`` lets callers
    backfill historical facts with a known end date in a single call (instead
    of a separate ``kg_invalidate``). ``source_file`` and ``source_drawer_id``
    are RFC 002 provenance fields populated by adapters / bulk importers.
    TODO(#1283): once the ISO-8601 validation PR lands, wire ``validate_iso_date``
    over ``valid_from`` / ``valid_to`` here so malformed dates fail fast at the
    MCP boundary instead of silently producing empty query results.
    """
    try:
        subject = sanitize_kg_value(subject, "subject")
        predicate = sanitize_name(predicate, "predicate")
        object = sanitize_kg_value(object, "object")
        valid_from = sanitize_iso_date(valid_from, "valid_from")
    except ValueError as e:
        return {"success": False, "error": str(e)}
@@ -1057,32 +1192,59 @@ def tool_kg_add(
            "predicate": predicate,
            "object": object,
            "valid_from": valid_from,
            "valid_to": valid_to,
            "source_closet": source_closet,
            "source_file": source_file,
            "source_drawer_id": source_drawer_id,
        },
    )
-    triple_id = _kg.add_triple(
+    triple_id = _call_kg(
-        subject, predicate, object, valid_from=valid_from, source_closet=source_closet
+        lambda kg: kg.add_triple(
            subject,
            predicate,
            object,
            valid_from=valid_from,
            valid_to=valid_to,
            source_closet=source_closet,
            source_file=source_file,
            source_drawer_id=source_drawer_id,
        )
    )
    return {"success": True, "triple_id": triple_id, "fact": f"{subject} → {predicate} → {object}"}
 def tool_kg_invalidate(subject: str, predicate: str, object: str, ended: str = None):
-    """Mark a fact as no longer true (set end date)."""
+    """Mark a fact as no longer true (set end date).
    Returns the actual ``ended`` date that was stored — when the caller omits
    ``ended``, the underlying graph stamps ``date.today()``, and the response
    reflects that resolved value (instead of the literal string ``"today"``)
    so callers can verify what was persisted.
    TODO(#1283): apply ``validate_iso_date`` to ``ended`` once that PR lands.
    """
    try:
        subject = sanitize_kg_value(subject, "subject")
        predicate = sanitize_name(predicate, "predicate")
        object = sanitize_kg_value(object, "object")
        ended = sanitize_iso_date(ended, "ended")
    except ValueError as e:
        return {"success": False, "error": str(e)}
    resolved_ended = ended or date.today().isoformat()
    _wal_log(
        "kg_invalidate",
-        {"subject": subject, "predicate": predicate, "object": object, "ended": ended},
+        {
            "subject": subject,
            "predicate": predicate,
            "object": object,
            "ended": resolved_ended,
        },
    )
-    _kg.invalidate(subject, predicate, object, ended=ended)
+    _call_kg(lambda kg: kg.invalidate(subject, predicate, object, ended=resolved_ended))
    return {
        "success": True,
        "fact": f"{subject} → {predicate} → {object}",
-        "ended": ended or "today",
+        "ended": resolved_ended,
    }
@@ -1093,13 +1255,13 @@ def tool_kg_timeline(entity: str = None):
            entity = sanitize_kg_value(entity, "entity")
        except ValueError as e:
            return {"error": str(e)}
-    results = _kg.timeline(entity)
+    results = _call_kg(lambda kg: kg.timeline(entity))
    return {"entity": entity or "all", "timeline": results, "count": len(results)}
 def tool_kg_stats():
    """Knowledge graph overview: entities, triples, relationship types."""
-    return _kg.stats()
+    return _call_kg(lambda kg: kg.stats())
 # ==================== AGENT DIARY ====================
@@ -1112,9 +1274,13 @@ def tool_diary_write(agent_name: str, entry: str, topic: str = "general", wing:
    This is the agent's personal journal — observations, thoughts,
    what it worked on, what it noticed, what it thinks matters.
    Note: ``agent_name`` is normalized to lowercase before storage so
    that diary reads are case-insensitive (see #1243). "Claude",
    "claude", and "CLAUDE" all resolve to the same agent.
    """
    try:
-        agent_name = sanitize_name(agent_name, "agent_name")
+        agent_name = sanitize_name(agent_name, "agent_name").lower()
        entry = sanitize_content(entry)
        topic = sanitize_name(topic, "topic")
    except ValueError as e:
@@ -1123,7 +1289,7 @@ def tool_diary_write(agent_name: str, entry: str, topic: str = "general", wing:
    if wing:
        wing = sanitize_name(wing)
    else:
-        wing = f"wing_{agent_name.lower().replace(' ', '_')}"
+        wing = f"wing_{agent_name.replace(' ', '_')}"
    room = "diary"
    col = _get_collection(create=True)
    if not col:
@@ -1188,9 +1354,14 @@ def tool_diary_read(agent_name: str, last_n: int = 10, wing: str = ""):
    written to. Diary writes from hooks land in project-derived wings
    (``wing_<project>``), so requiring a specific wing on read would
    silo those entries from agent-initiated reads.
    Note: ``agent_name`` is normalized to lowercase before filtering so
    that reads are case-insensitive (see #1243). Entries written under
    pre-fix mixed-case agent names will not match the lowercase filter;
    use ``mempalace repair`` to migrate legacy data if needed.
    """
    try:
-        agent_name = sanitize_name(agent_name, "agent_name")
+        agent_name = sanitize_name(agent_name, "agent_name").lower()
        if wing:
            wing = sanitize_name(wing)
    except ValueError as e:
@@ -1273,7 +1444,7 @@ def tool_hook_settings(silent_save: bool = None, desktop_toast: bool = None):
    try:
        config = MempalaceConfig()
    except Exception:
-        pass
+        logger.debug("Could not re-read config after update", exc_info=True)
    result = {
        "success": True,
@@ -1322,10 +1493,11 @@ def tool_memories_filed_away():
 def tool_reconnect():
-    """Force the MCP server to drop the cached ChromaDB collection and reconnect.
+    """Force the MCP server to drop cached ChromaDB + KnowledgeGraph state.
    Use after external scripts or CLI commands modify the palace database
-    directly, which can leave the in-memory HNSW index stale.
+    or replace ``knowledge_graph.sqlite3`` directly, which can leave the
    in-memory HNSW index stale or pin a closed-on-disk SQLite connection.
    """
    global \
        _client_cache, \
@@ -1343,6 +1515,15 @@ def tool_reconnect():
    # still applies after the reconnect.
    _vector_disabled = False
    _vector_disabled_reason = ""
    # Drain the per-path KnowledgeGraph cache so a replaced sqlite file is
    # reopened on the next tool call rather than served from a stale handle.
    with _kg_cache_lock:
        for kg in _kg_by_path.values():
            try:
                kg.close()
            except Exception:
                pass
        _kg_by_path.clear()
    try:
        col = _get_collection()
        if col is None:
@@ -1419,7 +1600,7 @@ TOOLS = {
        "handler": tool_kg_query,
    },
    "mempalace_kg_add": {
-        "description": "Add a fact to the knowledge graph. Subject → predicate → object with optional time window. E.g. ('Max', 'started_school', 'Year 7', valid_from='2026-09-01').",
+        "description": "Add a fact to the knowledge graph. Subject → predicate → object with optional time window. E.g. ('Max', 'started_school', 'Year 7', valid_from='2026-09-01'). Pass valid_to to backfill an already-ended historical fact in a single call.",
        "input_schema": {
            "type": "object",
            "properties": {
@@ -1433,10 +1614,22 @@ TOOLS = {
                    "type": "string",
                    "description": "When this became true (YYYY-MM-DD, optional)",
                },
                "valid_to": {
                    "type": "string",
                    "description": "When this stopped being true (YYYY-MM-DD, optional). Use for backfilling already-ended historical facts.",
                },
                "source_closet": {
                    "type": "string",
                    "description": "Closet ID where this fact appears (optional)",
                },
                "source_file": {
                    "type": "string",
                    "description": "Source file path the fact was extracted from (optional)",
                },
                "source_drawer_id": {
                    "type": "string",
                    "description": "Drawer ID the fact was extracted from (optional, RFC 002 provenance)",
                },
            },
            "required": ["subject", "predicate", "object"],
        },
@@ -1660,7 +1853,7 @@ TOOLS = {
        "handler": tool_get_drawer,
    },
    "mempalace_list_drawers": {
-        "description": "List drawers with pagination. Optional wing/room filter. Returns IDs, wings, rooms, and content previews.",
+        "description": "List drawers with pagination. Optional wing/room filter. Returns IDs, wings, rooms, content previews, and total matching count for pagination.",
        "input_schema": {
            "type": "object",
            "properties": {
@@ -1801,6 +1994,12 @@ SUPPORTED_PROTOCOL_VERSIONS = [
 def handle_request(request):
    if not isinstance(request, dict):
        return {
            "jsonrpc": "2.0",
            "id": None,
            "error": {"code": -32600, "message": "Invalid Request"},
        }
    method = request.get("method") or ""
    params = request.get("params") or {}
    req_id = request.get("id")
@@ -1838,6 +2037,15 @@ def handle_request(request):
            },
        }
    elif method == "tools/call":
        if not isinstance(params, dict) or "name" not in params:
            return {
                "jsonrpc": "2.0",
                "id": req_id,
                "error": {
                    "code": -32602,
                    "message": "Invalid params: 'name' is required for tools/call",
                },
            }
        tool_name = params.get("name")
        tool_args = params.get("arguments") or {}
        if tool_name not in TOOLS:
@@ -1886,7 +2094,11 @@ def handle_request(request):
            return {
                "jsonrpc": "2.0",
                "id": req_id,
-                "result": {"content": [{"type": "text", "text": json.dumps(result, indent=2)}]},
+                "result": {
                    "content": [
                        {"type": "text", "text": json.dumps(result, indent=2, ensure_ascii=False)}
                    ]
                },
            }
        except Exception:
            logger.exception(f"Tool error in {tool_name}")
@@ -1921,6 +2133,16 @@ def _restore_stdout():
 def main():
    _restore_stdout()
    # Force UTF-8 on stdio. MCP JSON-RPC is UTF-8, but Python on Windows
    # defaults stdin/stdout to the system codepage (e.g. cp1251), which
    # corrupts non-ASCII payloads and surfaces as generic -32000 errors on
    # Cyrillic/CJK content. See PEP 540.
    for stream in (sys.stdin, sys.stdout):
        if hasattr(stream, "reconfigure"):
            try:
                stream.reconfigure(encoding="utf-8", errors="replace")
            except (AttributeError, OSError):
                pass
    logger.info("MemPalace MCP Server starting...")
    # Pre-flight: probe HNSW capacity before any tool call so the warning
    # is visible at startup rather than on first use (#1222). Pure
@@ -1937,7 +2159,7 @@ def main():
            request = json.loads(line)
            response = handle_request(request)
            if response is not None:
-                sys.stdout.write(json.dumps(response) + "\n")
+                sys.stdout.write(json.dumps(response, ensure_ascii=False) + "\n")
                sys.stdout.flush()
        except KeyboardInterrupt:
            break
@@ -12,6 +12,7 @@ import sys
 import shlex
 import hashlib
 import fnmatch
 import logging
 from pathlib import Path
 from datetime import datetime
 from collections import defaultdict
@@ -31,6 +32,8 @@ from .palace import (
    upsert_closet_lines,
 )
 logger = logging.getLogger("mempalace_mcp")
 READABLE_EXTENSIONS = {
    ".txt",
    ".md",
@@ -842,7 +845,7 @@ def process_file(
        try:
            collection.delete(where={"source_file": source_file})
        except Exception:
-            pass
+            logger.debug("Stale-drawer purge failed for %s", source_file, exc_info=True)
        # Batch chunks into bounded upserts so the embedding model sees many
        # chunks per forward pass without building one huge Chroma/SQLite
@@ -118,14 +118,14 @@ def normalize(filepath: str) -> str:
    try:
        file_size = os.path.getsize(filepath)
    except OSError as e:
-        raise IOError(f"Could not read {filepath}: {e}")
+        raise IOError(f"Could not read {filepath}: {e}") from e
    if file_size > 500 * 1024 * 1024:  # 500 MB safety limit
        raise IOError(f"File too large ({file_size // (1024 * 1024)} MB): {filepath}")
    try:
        with open(filepath, "r", encoding="utf-8", errors="replace") as f:
            content = f.read()
    except OSError as e:
-        raise IOError(f"Could not read {filepath}: {e}")
+        raise IOError(f"Could not read {filepath}: {e}") from e
    if not content.strip():
        return content
@@ -6,11 +6,15 @@ Consolidates collection access patterns used by both miners and the MCP server.
 import contextlib
 import hashlib
 import logging
 import os
 import re
 import threading
 from .backends.chroma import ChromaBackend
 logger = logging.getLogger("mempalace_mcp")
 SKIP_DIRS = {
    ".git",
    "node_modules",
@@ -228,7 +232,7 @@ def purge_file_closets(closets_col, source_file: str) -> None:
    try:
        closets_col.delete(where={"source_file": source_file})
    except Exception:
-        pass
+        logger.debug("Closet purge failed for %s", source_file, exc_info=True)
 def upsert_closet_lines(closets_col, closet_id_base, lines, metadata):
@@ -306,7 +310,7 @@ def mine_lock(source_file: str):
                fcntl.flock(lf, fcntl.LOCK_UN)
        except Exception:
-            pass
+            logger.debug("Mine-lock release failed", exc_info=True)
        lf.close()
@@ -314,6 +318,47 @@ class MineAlreadyRunning(RuntimeError):
    """Raised when another `mempalace mine` already holds the per-palace lock."""
 # Per-thread record of palaces this thread already holds the lock for. Used by
 # `mine_palace_lock` to short-circuit re-entrant acquisition from the same
 # thread (e.g. miner.mine() acquires the outer lock then calls
 # ChromaCollection.upsert which now also tries to acquire). Without this guard
 # the inner call would block on its own outer flock (Linux fcntl locks are per
 # open file description, so a same-thread second open of the lock file is a
 # distinct lock and self-deadlocks).
 #
 # The holder set is tagged with ``pid`` so that a forked child does NOT
 # inherit re-entrant credit from its parent: the OS-level flock IS NOT
 # inherited as a "we hold it" semantically — the child must reacquire — but
 # Python's ``threading.local`` IS inherited across fork. The pid check
 # clears stale state so a forked child correctly hits the fcntl path.
 _palace_lock_holders = threading.local()
 def _holder_state():
    """Return the per-thread (pid, keys) record, refreshing after fork."""
    keys = getattr(_palace_lock_holders, "keys", None)
    pid = getattr(_palace_lock_holders, "pid", None)
    current_pid = os.getpid()
    if keys is None or pid != current_pid:
        keys = set()
        _palace_lock_holders.keys = keys
        _palace_lock_holders.pid = current_pid
    return keys
 def _held_by_this_thread(lock_key: str) -> bool:
    """Return True if this thread already holds ``mine_palace_lock`` for ``lock_key``."""
    return lock_key in _holder_state()
 def _mark_held(lock_key: str) -> None:
    _holder_state().add(lock_key)
 def _mark_released(lock_key: str) -> None:
    _holder_state().discard(lock_key)
@contextlib.contextmanager
 def mine_palace_lock(palace_path: str):
    """Per-palace non-blocking lock around the full `mine` pipeline.
@@ -338,6 +383,12 @@ def mine_palace_lock(palace_path: str):
    Non-blocking: if another `mine` is already writing to this palace,
    raise MineAlreadyRunning so the caller can exit cleanly instead of
    piling up as a waiting worker.
    Re-entrant: if the current thread already holds the lock for the same
    palace, the context manager passes through without re-acquiring. This
    lets ChromaCollection write methods (which acquire the lock themselves
    to protect MCP/direct callers) compose with miner.mine() (which holds
    the outer lock for the entire mine pipeline) without self-deadlock.
    """
    lock_dir = os.path.join(os.path.expanduser("~"), ".mempalace", "locks")
    os.makedirs(lock_dir, exist_ok=True)
@@ -346,6 +397,11 @@ def mine_palace_lock(palace_path: str):
    palace_key = hashlib.sha256(lock_key_source.encode()).hexdigest()[:16]
    lock_path = os.path.join(lock_dir, f"mine_palace_{palace_key}.lock")
    if _held_by_this_thread(palace_key):
        # Same thread already holds the lock for this palace — pass through.
        yield
        return
    lf = open(lock_path, "w")
    acquired = False
    try:
@@ -369,7 +425,11 @@ def mine_palace_lock(palace_path: str):
                raise MineAlreadyRunning(
                    f"another `mempalace mine` is already running against {resolved}"
                ) from exc
-        yield
+        _mark_held(palace_key)
        try:
            yield
        finally:
            _mark_released(palace_key)
    finally:
        if acquired:
            try:
@@ -575,7 +575,7 @@ def follow_tunnels(wing: str, room: str, col=None, config=None):
                    if did and did in drawer_map:
                        c["drawer_preview"] = drawer_map[did][:300]
            except Exception:
-                pass
+                logger.debug("Drawer preview hydration failed", exc_info=True)
    return connections
@@ -202,7 +202,7 @@ def detect_rooms_from_files(project_dir: str) -> list:
    SKIP_DIRS = {".git", "node_modules", "__pycache__", ".venv", "venv", "dist", "build"}
-    for root, dirs, filenames in os.walk(project_path):
+    for _root, dirs, filenames in os.walk(project_path):
        dirs[:] = [d for d in dirs if d not in SKIP_DIRS]
        for filename in filenames:
            name_lower = filename.lower().replace("-", "_").replace(" ", "_")
@@ -134,6 +134,11 @@ def _hybrid_rank(
      themselves. Since the absolute scale is unbounded, BM25 is min-max
      normalized within the candidate set so weights are commensurable.
    Candidates with ``distance=None`` are treated as vector-unknown
    (no vector signal available) and scored on BM25 contribution alone.
    Used by candidate-union mode to merge BM25-only candidates that the
    vector index didn't surface.
    Mutates each result dict to add ``bm25_score`` and reorders the list
    in place. Returns the same list for convenience.
    """
@@ -147,7 +152,11 @@ def _hybrid_rank(
    scored = []
    for r, raw, norm in zip(results, bm25_raw, bm25_norm):
-        vec_sim = max(0.0, 1.0 - r.get("distance", 1.0))
+        distance = r.get("distance")
        if distance is None:
            vec_sim = 0.0
        else:
            vec_sim = max(0.0, 1.0 - distance)
        r["bm25_score"] = round(raw, 3)
        scored.append((vector_weight * vec_sim + bm25_weight * norm, r))
@@ -236,7 +245,7 @@ def _expand_with_neighbors(drawers_col, matched_doc: str, matched_meta: dict, ra
        all_meta = drawers_col.get(where={"source_file": src}, include=["metadatas"])
        total_drawers = len(all_meta.ids) if all_meta.ids else None
    except Exception:
-        pass
+        logger.debug("total_drawers lookup failed for %s", src, exc_info=True)
    return {
        "text": combined_text,
@@ -288,10 +297,10 @@ def search(query: str, palace_path: str, wing: str = None, room: str = None, n_r
    """
    try:
        col = get_collection(palace_path, create=False)
-    except Exception:
+    except Exception as e:
        print(f"\n  No palace found at {palace_path}")
        print("  Run: mempalace init <dir> then mempalace mine <dir>")
-        raise SearchError(f"No palace found at {palace_path}")
+        raise SearchError(f"No palace found at {palace_path}") from e
    # Alert the user if this palace predates hnsw:space=cosine being set on
    # creation — their similarity scores will be junk until they run repair.
@@ -331,7 +340,7 @@ def search(query: str, palace_path: str, wing: str = None, room: str = None, n_r
    # `_hybrid_rank`; do the same here so CLI results match what agents
    # see via `mempalace_search`.
    hits = [
-        {"text": doc, "distance": float(dist), "metadata": meta or {}}
+        {"text": doc or "", "distance": float(dist), "metadata": meta or {}}
        for doc, meta, dist in zip(docs, metas, dists)
    ]
    hits = _hybrid_rank(hits, query)
@@ -372,6 +381,7 @@ def _bm25_only_via_sqlite(
    room: str = None,
    n_results: int = 5,
    max_candidates: int = 500,
    _include_internal: bool = False,
 ) -> dict:
    """BM25-only search reading drawers directly from chroma.sqlite3.
@@ -540,17 +550,25 @@ def _bm25_only_via_sqlite(
            continue
        if room and meta.get("room") != room:
            continue
        full_source = meta.get("source_file", "") or ""
        candidates.append(
            {
                "text": d["text"],
                "wing": meta.get("wing", "unknown"),
                "room": meta.get("room", "unknown"),
-                "source_file": Path(meta.get("source_file", "?") or "?").name,
+                "source_file": Path(full_source).name if full_source else "?",
                "created_at": meta.get("filed_at", "unknown"),
                # No vector distance available in BM25-only mode.
                "similarity": None,
                "distance": None,
                "matched_via": "bm25_sqlite",
                # Internal: full path + chunk_index let callers (notably
                # candidate_strategy="union") dedupe at chunk granularity
                # rather than basename — two files in different directories
                # may share a basename, and one source_file is split across
                # multiple chunks. Stripped before this helper returns.
                "_source_file_full": full_source,
                "_chunk_index": meta.get("chunk_index"),
            }
        )
@@ -565,6 +583,12 @@ def _bm25_only_via_sqlite(
    hits = candidates[:n_results]
    for h in hits:
        h.pop("_score", None)
        # Strip internal fields by default so the public BM25-only fallback
        # response stays clean. Callers that need chunk-precise dedup
        # (notably the union-merge path) opt in via _include_internal.
        if not _include_internal:
            h.pop("_source_file_full", None)
            h.pop("_chunk_index", None)
    return {
        "query": query,
@@ -576,6 +600,117 @@ def _bm25_only_via_sqlite(
    }
 def _merge_bm25_union_candidates(
    hits: list,
    query: str,
    palace_path: str,
    wing: str,
    room: str,
    n_results: int,
    max_distance: float = 0.0,
 ) -> None:
    """Append top-K BM25-only candidates from sqlite into ``hits`` in place.
    Used by ``search_memories(..., candidate_strategy="union")`` to widen
    the rerank pool's *source* (not just its size) — vector-only candidate
    selection skips docs whose embeddings are far from the query even when
    BM25 signal is strong.
    Dedup is chunk-precise: the key is ``(_source_file_full, _chunk_index)``
    so two files sharing a basename in different directories don't collide,
    and a vector hit on chunk N of a file doesn't block BM25 from
    contributing chunk M of the same file. Falls back to ``source_file``
    only when full-path/chunk metadata is absent.
    BM25-only additions carry ``distance=None`` so ``_hybrid_rank`` scores
    them on BM25 contribution alone.
    When ``max_distance > 0.0`` (a strict vector-distance threshold is
    set), BM25-only candidates are skipped entirely — they have no vector
    distance to satisfy the threshold, and silently injecting them would
    break the existing ``max_distance`` guarantee that hybrid results lie
    within the requested vector-distance bound.
    """
    if max_distance > 0.0:
        return
    try:
        bm25_extra = _bm25_only_via_sqlite(
            query,
            palace_path,
            wing=wing,
            room=room,
            n_results=n_results * 3,
            _include_internal=True,
        ).get("results", [])
    except Exception:
        logger.debug("candidate_strategy=union: BM25 fetch failed", exc_info=True)
        return
    def _dedup_key(entry: dict):
        full = entry.get("_source_file_full")
        ci = entry.get("_chunk_index")
        if full and ci is not None:
            return (full, ci)
        # Fall back to basename only when richer metadata is missing —
        # avoids silently dropping candidates on legacy data while still
        # giving chunk-precise dedup whenever the metadata is present.
        return entry.get("source_file")
    seen = {_dedup_key(h) for h in hits}
    for bh in bm25_extra:
        key = _dedup_key(bh)
        if not key or key == "?" or key in seen:
            continue
        bh["distance"] = None
        bh["effective_distance"] = None
        bh["closet_boost"] = 0.0
        hits.append(bh)
        seen.add(key)
 # Strategy dispatch — keeps search_memories' branch count under the
 # project's complexity ceiling (C901 max-complexity=25). New strategies
 # register here.
 _CANDIDATE_MERGERS = {
    "vector": None,  # default no-op
    "union": _merge_bm25_union_candidates,
 }
 def _validate_candidate_strategy(strategy: str) -> None:
    """Raise ``ValueError`` for unknown strategies.
    Called eagerly at the top of ``search_memories`` so invalid values
    fail consistently regardless of whether the call routes through the
    vector path, the BM25-only fallback, or returns an early error dict.
    """
    if strategy not in _CANDIDATE_MERGERS:
        raise ValueError(
            f"candidate_strategy must be one of {tuple(_CANDIDATE_MERGERS)}, got {strategy!r}"
        )
 def _apply_candidate_strategy(
    strategy: str,
    hits: list,
    query: str,
    palace_path: str,
    wing: str,
    room: str,
    n_results: int,
    max_distance: float = 0.0,
 ) -> None:
    """Dispatch to the registered merger for ``strategy``.
    Strategy validity is assumed (``_validate_candidate_strategy`` runs
    earlier); ``"vector"`` is a no-op.
    """
    merger = _CANDIDATE_MERGERS[strategy]
    if merger is not None:
        merger(hits, query, palace_path, wing, room, n_results, max_distance=max_distance)
 def search_memories(
    query: str,
    palace_path: str,
@@ -584,6 +719,7 @@ def search_memories(
    n_results: int = 5,
    max_distance: float = 0.0,
    vector_disabled: bool = False,
    candidate_strategy: str = "vector",
 ) -> dict:
    """Programmatic search — returns a dict instead of printing.
@@ -603,7 +739,30 @@ def search_memories(
            (#1222). Set by the MCP server when the HNSW capacity probe
            detects a divergence that would segfault chromadb on segment
            load.
        candidate_strategy: How candidates for the hybrid re-rank are gathered.
            * ``"vector"`` (default) — preserves historical behavior: top
              ``n_results * 3`` rows from the vector index are the rerank pool.
              Cheap; works well when query and target docs agree in the
              embedding space.
            * ``"union"`` — also pull top ``n_results * 3`` BM25 candidates
              from the sqlite FTS5 index and merge them into the rerank pool
              (deduped by source_file). Catches docs with strong BM25 signal
              that are vector-distant from the query (e.g. terminology guides
              looked up by narrative-shaped queries; policy clauses surfaced
              by scenario descriptions). Adds one sqlite open + FTS5 MATCH
              per query; perf cost is small but unmeasured at corpus scale.
              Opt in until the cost is characterized.
              When ``max_distance > 0.0`` is also set, BM25-only candidates
              are skipped — they have no vector distance and would silently
              violate the requested distance threshold.
    """
    # Validate the strategy eagerly so invalid values fail the same way
    # regardless of whether the call routes through the vector path or
    # the BM25-only fallback below.
    _validate_candidate_strategy(candidate_strategy)
    if vector_disabled:
        return _bm25_only_via_sqlite(
            query,
@@ -667,7 +826,8 @@ def search_memories(
            if source and source not in closet_boost_by_source:
                closet_boost_by_source[source] = (rank, cdist, cdoc[:200])
    except Exception:
-        pass  # no closets yet — hybrid degrades to pure drawer search
+        # No closets yet — hybrid degrades to pure drawer search.
        logger.debug("Closet collection unavailable; using drawer-only search", exc_info=True)
    # Rank-based boost. The ordinal signal ("which closet matched best") is
    # more reliable than absolute distance on narrative content, where
@@ -681,6 +841,8 @@ def search_memories(
        _first_or_empty(drawer_results, "metadatas"),
        _first_or_empty(drawer_results, "distances"),
    ):
        meta = meta or {}
        doc = doc or ""
        # Filter on raw distance before rounding to avoid precision loss.
        if max_distance > 0.0 and dist > max_distance:
            continue
@@ -697,7 +859,12 @@ def search_memories(
                matched_via = "drawer+closet"
                closet_preview = c_preview
-        effective_dist = dist - boost
+        # Clamp to the valid cosine-distance range [0, 2]. When a strong
        # closet boost (up to 0.40) exceeds the raw distance, the subtraction
        # can go negative — which (a) yields ``similarity > 1.0`` downstream
        # and (b) makes the sort key land *below* ordinary positive distances,
        # inverting the ranking so the best hybrid matches sort last.
        effective_dist = max(0.0, min(2.0, dist - boost))
        entry = {
            "text": doc,
            "wing": meta.get("wing", "unknown"),
@@ -742,6 +909,7 @@ def search_memories(
                include=["documents", "metadatas"],
            )
        except Exception:
            logger.debug("Neighbor fetch failed for %s", full_source, exc_info=True)
            continue
        docs = source_drawers.documents
        metas_ = source_drawers.metadatas
@@ -779,8 +947,29 @@ def search_memories(
        h["drawer_index"] = best_idx
        h["total_drawers"] = len(ordered_docs)
-    # BM25 hybrid re-rank within the final candidate set.
+    # Candidate strategy hook: optionally widen the rerank pool's *source*
-    hits = _hybrid_rank(hits, query)
+    # before ranking. Default ("vector") is a no-op; "union" merges top-K
    # BM25 candidates from sqlite. See `_apply_candidate_strategy`.
    # ``max_distance`` is forwarded so union mode can refuse to inject
    # BM25-only (distance=None) candidates that would silently bypass the
    # caller's strict distance threshold.
    _apply_candidate_strategy(
        candidate_strategy,
        hits,
        query,
        palace_path,
        wing,
        room,
        n_results,
        max_distance=max_distance,
    )
    # BM25 hybrid re-rank within the final candidate set, then trim back
    # to the requested size. Without the trim, ``candidate_strategy="union"``
    # would return up to 4× ``n_results`` (vector hits + BM25 union pool),
    # breaking the existing ``search_memories`` size contract that the MCP
    # ``limit`` parameter is built on.
    hits = _hybrid_rank(hits, query)[:n_results]
    for h in hits:
        h.pop("_sort_key", None)
        h.pop("_source_file_full", None)
@@ -40,8 +40,9 @@ def _patch_mcp_config(monkeypatch, palace_path, tmp_path):
    import mempalace.mcp_server as mcp_mod
    kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))
    monkeypatch.setattr(mcp_mod, "_config", cfg)
-    monkeypatch.setattr(mcp_mod, "_kg", KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3")))
+    monkeypatch.setattr(mcp_mod, "_get_kg", lambda: kg)
 def _get_rss_mb():
@@ -84,8 +84,9 @@ class TestToolStatusMemoryProfile:
        cfg = MempalaceConfig(config_dir=str(tmp_path / "cfg"))
        monkeypatch.setattr(cfg, "_file_config", {"palace_path": palace_path})
        kg = KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3"))
        monkeypatch.setattr(mcp_mod, "_config", cfg)
-        monkeypatch.setattr(mcp_mod, "_kg", KnowledgeGraph(db_path=str(tmp_path / "kg.sqlite3")))
+        monkeypatch.setattr(mcp_mod, "_get_kg", lambda: kg)
        from mempalace.mcp_server import tool_status
@@ -1,5 +1,6 @@
 import os
 import pickle
 import shutil
 import sqlite3
 from pathlib import Path
@@ -208,6 +209,52 @@ def test_query_empty_preserves_embeddings_outer_shape_when_requested():
    assert not_requested.embeddings is None
 def test_chroma_close_palace_releases_sqlite_lock_for_reopen(tmp_path):
    """close_palace must release chromadb's rust-side SQLite file lock so
    a fresh PersistentClient on the same path after shutil.rmtree can
    write without hitting SQLITE_READONLY_DBMOVED."""
    backend = ChromaBackend()
    palace_path = tmp_path / "palace-a"
    ref = PalaceRef(id=str(palace_path), local_path=str(palace_path))
    col = backend.get_collection(palace=ref, collection_name="mempalace_drawers", create=True)
    col.upsert(documents=["hello"], ids=["a"], metadatas=[{"k": "v"}])
    backend.close_palace(ref)
    shutil.rmtree(palace_path)
    col = backend.get_collection(palace=ref, collection_name="mempalace_drawers", create=True)
    col.upsert(documents=["world"], ids=["b"], metadatas=[{"k": "v2"}])
    assert col.count() == 1
 def test_chroma_close_releases_all_cached_clients(tmp_path):
    """close() must release every cached client's SQLite file lock so any
    of their palace paths can be reopened by a fresh backend in the same
    process."""
    backend = ChromaBackend()
    palace_a = tmp_path / "palace-a"
    palace_b = tmp_path / "palace-b"
    ref_a = PalaceRef(id=str(palace_a), local_path=str(palace_a))
    ref_b = PalaceRef(id=str(palace_b), local_path=str(palace_b))
    for ref in (ref_a, ref_b):
        backend.get_collection(palace=ref, collection_name="mempalace_drawers", create=True).upsert(
            documents=["x"], ids=["x"], metadatas=[{"k": "v"}]
        )
    backend.close()
    for path in (palace_a, palace_b):
        shutil.rmtree(path)
        ref = PalaceRef(id=str(path), local_path=str(path))
        fresh = ChromaBackend()
        col = fresh.get_collection(palace=ref, collection_name="mempalace_drawers", create=True)
        col.upsert(documents=["y"], ids=["y"], metadatas=[{"k": "v2"}])
        assert col.count() == 1
        fresh.close()
 def test_chroma_cache_invalidates_when_db_file_missing(tmp_path):
    """A palace rebuild that removes chroma.sqlite3 must drop the stale cache.
@@ -735,9 +782,9 @@ def test_make_client_quarantines_only_on_first_call_per_palace(tmp_path, monkeyp
    ChromaBackend.make_client(palace_path)
    ChromaBackend.make_client(palace_path)
-    assert calls == [palace_path], (
+    assert calls == [
-        "quarantine_stale_hnsw should fire once per palace per process, not on every reconnect"
+        palace_path
-    )
+    ], "quarantine_stale_hnsw should fire once per palace per process, not on every reconnect"
 def test_make_client_gates_invalid_metadata_on_first_call(tmp_path, monkeypatch):
@@ -797,6 +844,67 @@ def test_make_client_quarantines_each_palace_independently(tmp_path, monkeypatch
    assert calls == [palace_a, palace_b]
 # ── _client() cold-start gate (#1121, #1132, #1263) ──────────────────────
 def test_client_quarantines_corrupt_segment_on_first_open(tmp_path, monkeypatch):
    """The instance ``_client()`` path must run ``quarantine_stale_hnsw``
    on first open, mirroring the ``make_client()`` static helper. Before
    PR #1173's wiring was extended here, CLI mining / search / repair /
    status all skipped the quarantine pass and would SIGSEGV on a stale
    HNSW segment (#1121, #1132, #1263)."""
    now = 1_700_000_000.0
    palace, seg = _make_palace_with_segment(
        tmp_path,
        hnsw_mtime=now - 7200,
        sqlite_mtime=now,
        meta_bytes=_CORRUPT_META,
    )
    monkeypatch.setattr(ChromaBackend, "_quarantined_paths", set())
    backend = ChromaBackend()
    try:
        backend._client(str(palace))
    finally:
        backend.close()
    assert not seg.exists(), "_client() should have quarantined the corrupt segment"
    drift_dirs = [p for p in palace.iterdir() if ".drift-" in p.name]
    assert len(drift_dirs) == 1
 def test_client_quarantines_only_on_first_call_per_palace(tmp_path, monkeypatch):
    """Repeated ``_client()`` calls for the same palace re-run quarantine
    at most once — the ``_quarantined_paths`` gate prevents runtime
    thrash on hot paths (``_client()`` is hit on every backend op)."""
    palace_path = str(tmp_path / "palace")
    os.makedirs(palace_path, exist_ok=True)
    (Path(palace_path) / "chroma.sqlite3").write_text("")
    monkeypatch.setattr(ChromaBackend, "_quarantined_paths", set())
    calls: list[str] = []
    def _spy(path, stale_seconds=300.0):
        calls.append(path)
        return []
    monkeypatch.setattr("mempalace.backends.chroma.quarantine_stale_hnsw", _spy)
    backend = ChromaBackend()
    try:
        backend._client(palace_path)
        backend._client(palace_path)
        backend._client(palace_path)
    finally:
        backend.close()
    assert (
        calls == [palace_path]
    ), "quarantine_stale_hnsw should fire once per palace per process from _client(), not on every call"
 # ── _pin_hnsw_threads (per-process retrofit, separate from this PR's gate) ──
@@ -0,0 +1,321 @@
 """Tests for ChromaCollection's palace-write-lock integration.
 Closes the gap left by ``mine_palace_lock`` only protecting the
 ``mempalace mine`` pipeline: MCP/direct writers that call
 ``ChromaCollection.add/upsert/update/delete`` must also serialize against
 mine and against each other to avoid the multi-threaded HNSW corruption
 documented in #974/#965.
 Property tested:
 * ``ChromaCollection(c, palace_path=p)`` wraps every write with
  ``mine_palace_lock(p)``.
 * Writes raise ``MineAlreadyRunning`` when another holder owns the lock
  (instead of silently racing into the underlying chromadb call).
 * Re-entrant composition with ``miner.mine()`` does not self-deadlock:
  ``with mine_palace_lock(p): col.upsert(...)`` runs to completion.
 * ``ChromaCollection(c)`` (no palace_path) preserves legacy no-lock
  behaviour for tests/callers that build the adapter directly without
  going through ``ChromaBackend``.
 POSIX-only: ``mine_palace_lock`` uses ``fcntl`` on Unix and ``msvcrt`` on
 Windows; the contention semantics differ enough that the cross-process
 tests are skipped on Windows runners.
 """
 from __future__ import annotations
 import multiprocessing
 import os
 import time
 import pytest
 from mempalace.backends.chroma import ChromaCollection
 from mempalace.palace import MineAlreadyRunning, mine_palace_lock
 def _get_mp_context():
    """Same start-method picker as test_palace_locks.py."""
    start_method = "spawn" if os.name == "nt" else "fork"
    return multiprocessing.get_context(start_method)
 # ---------------------------------------------------------------------------
 # Fakes
 # ---------------------------------------------------------------------------
 class _FakeChromaCollection:
    """Records calls; never blocks. Stand-in for chromadb.Collection."""
    def __init__(self):
        self.adds: list[dict] = []
        self.upserts: list[dict] = []
        self.updates: list[dict] = []
        self.deletes: list[dict] = []
    def add(self, **kwargs):
        self.adds.append(kwargs)
    def upsert(self, **kwargs):
        self.upserts.append(kwargs)
    def update(self, **kwargs):
        self.updates.append(kwargs)
    def delete(self, **kwargs):
        self.deletes.append(kwargs)
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _hold_lock(palace_path: str, ready_flag: str, release_flag: str) -> int:
    """Acquire ``mine_palace_lock``, signal readiness, wait for release.
    Mirrors the helper in ``test_palace_locks.py`` so the contention
    semantics match across both test files.
    """
    try:
        with mine_palace_lock(palace_path):
            open(ready_flag, "w").close()
            for _ in range(500):
                if os.path.exists(release_flag):
                    return 0
                time.sleep(0.01)
            return 0
    except MineAlreadyRunning:
        return 1
 # ---------------------------------------------------------------------------
 # Tests — opt-in lock wiring
 # ---------------------------------------------------------------------------
 def test_palace_path_none_skips_lock(tmp_path, monkeypatch):
    """Legacy callers (``ChromaCollection(c)``) keep no-lock behaviour.
    A ``ChromaCollection`` built without ``palace_path`` must not touch the
    lock infrastructure at all. This guards against regressions where a
    test or third-party caller relies on the historical bare-write path.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    fake = _FakeChromaCollection()
    col = ChromaCollection(fake)  # no palace_path -> no lock
    # Hold the lock in a child process. Without palace_path, the parent
    # write must still succeed (the lock does not gate this caller).
    palace = str(tmp_path / "palace")
    ready = str(tmp_path / "ready")
    release = str(tmp_path / "release")
    ctx = _get_mp_context()
    holder = ctx.Process(target=_hold_lock, args=(palace, ready, release))
    holder.start()
    try:
        for _ in range(500):
            if os.path.exists(ready):
                break
            time.sleep(0.01)
        assert os.path.exists(ready), "holder failed to acquire lock"
        col.upsert(documents=["doc"], ids=["id-1"])
        assert fake.upserts == [{"documents": ["doc"], "ids": ["id-1"]}]
    finally:
        open(release, "w").close()
        holder.join(timeout=5)
 def test_writer_blocks_during_mine(tmp_path, monkeypatch):
    """A held ``mine_palace_lock`` causes ``ChromaCollection`` writes to raise.
    This is the property that closes the MCP-bypass gap: when a mine is in
    flight, MCP/direct writes raise ``MineAlreadyRunning`` rather than
    silently entering chromadb's write path concurrent with mine.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    palace = str(tmp_path / "palace")
    ready = str(tmp_path / "ready")
    release = str(tmp_path / "release")
    ctx = _get_mp_context()
    holder = ctx.Process(target=_hold_lock, args=(palace, ready, release))
    holder.start()
    try:
        for _ in range(500):
            if os.path.exists(ready):
                break
            time.sleep(0.01)
        assert os.path.exists(ready), "holder failed to acquire lock"
        fake = _FakeChromaCollection()
        col = ChromaCollection(fake, palace_path=palace)
        with pytest.raises(MineAlreadyRunning):
            col.upsert(documents=["doc"], ids=["id-1"])
        with pytest.raises(MineAlreadyRunning):
            col.add(documents=["doc"], ids=["id-2"])
        with pytest.raises(MineAlreadyRunning):
            col.update(ids=["id-3"], documents=["doc"])
        with pytest.raises(MineAlreadyRunning):
            col.delete(ids=["id-4"])
        # The fake must have received NO calls — the lock must gate
        # before reaching the underlying chromadb layer.
        assert fake.upserts == []
        assert fake.adds == []
        assert fake.updates == []
        assert fake.deletes == []
    finally:
        open(release, "w").close()
        holder.join(timeout=5)
 def test_reentrant_inside_mine_passes_through(tmp_path, monkeypatch):
    """``ChromaCollection.upsert`` inside ``mine_palace_lock`` does not deadlock.
    ``miner.mine()`` already holds ``mine_palace_lock(palace_path)`` for the
    full mine pipeline; ``_mine_body`` then calls
    ``collection.upsert(...)``. With the per-thread re-entrant guard in
    ``mine_palace_lock``, the inner acquire is a pass-through and the
    underlying chromadb call runs immediately.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    palace = str(tmp_path / "palace")
    fake = _FakeChromaCollection()
    col = ChromaCollection(fake, palace_path=palace)
    with mine_palace_lock(palace):
        # If the re-entrant guard were missing, this would self-deadlock on
        # the underlying flock. We rely on pytest-timeout (configured in
        # pyproject.toml) to enforce this in CI; the assertion just confirms
        # the call landed.
        col.upsert(documents=["d"], ids=["i"], metadatas=[{"k": "v"}])
        col.add(documents=["d2"], ids=["i2"])
        col.update(ids=["i"], documents=["d-updated"])
        col.delete(ids=["i2"])
    assert len(fake.upserts) == 1
    assert len(fake.adds) == 1
    assert len(fake.updates) == 1
    assert len(fake.deletes) == 1
 class _SlowFakeChromaCollection(_FakeChromaCollection):
    """Fake whose write methods hold the caller for ``hold_seconds``.
    Used to keep ``mine_palace_lock`` acquired long enough for a sibling
    process to contend deterministically.
    """
    def __init__(self, hold_seconds: float = 0.3):
        super().__init__()
        self._hold = hold_seconds
    def upsert(self, **kwargs):
        time.sleep(self._hold)
        super().upsert(**kwargs)
 def _slow_writer_target(palace_path, tmp_path_str, pid, result_q):
    """Subprocess target: try a slow upsert, report ok/busy."""
    os.environ["HOME"] = tmp_path_str
    # Fresh import inside child so HOME monkeypatch routes the lock dir.
    from mempalace.backends.chroma import ChromaCollection as _CC
    from mempalace.palace import MineAlreadyRunning as _MAR
    fake = _SlowFakeChromaCollection(hold_seconds=0.3)
    col = _CC(fake, palace_path=palace_path)
    try:
        col.upsert(documents=[f"d{pid}"], ids=[f"i{pid}"])
        result_q.put(("ok", pid))
    except _MAR:
        result_q.put(("busy", pid))
 def test_concurrent_writers_serialize(tmp_path, monkeypatch):
    """Two processes calling ``ChromaCollection.upsert`` against the same
    palace must be serialized: at most one enters chromadb at a time, the
    other raises ``MineAlreadyRunning``.
    This is the property that prevents the parallel HNSW insert race that
    drives #974/#965 — under concurrent MCP write fan-out, exactly one
    writer reaches chromadb and the rest fail loudly instead of corrupting
    the index.
    The slow fake holds the lock for 0.3s per writer, large enough for the
    second process to contend even on slow CI runners.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    palace = str(tmp_path / "palace")
    ctx = _get_mp_context()
    result_q = ctx.Queue()
    p1 = ctx.Process(target=_slow_writer_target, args=(palace, str(tmp_path), 1, result_q))
    p2 = ctx.Process(target=_slow_writer_target, args=(palace, str(tmp_path), 2, result_q))
    p1.start()
    # Tiny stagger so p1 wins the race deterministically; without it the
    # OS scheduler can pick either, which is also a valid outcome but
    # makes the assertion brittle on slow CI.
    time.sleep(0.05)
    p2.start()
    p1.join(timeout=5)
    p2.join(timeout=5)
    outcomes = [result_q.get(timeout=1) for _ in range(2)]
    statuses = sorted(o[0] for o in outcomes)
    assert statuses == ["busy", "ok"], f"expected one ok + one busy, got {outcomes}"
 def test_read_path_does_not_acquire_lock(tmp_path, monkeypatch):
    """``query`` / ``get`` / ``count`` must not be gated by the write lock.
    Read traffic is the dominant workload (semantic search, MCP get, etc.)
    and serializing it against mine would tank latency for no correctness
    benefit. This test pins that property: with another process holding
    the write lock, reads must still complete instantly.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    palace = str(tmp_path / "palace")
    ready = str(tmp_path / "ready")
    release = str(tmp_path / "release")
    ctx = _get_mp_context()
    holder = ctx.Process(target=_hold_lock, args=(palace, ready, release))
    holder.start()
    try:
        for _ in range(500):
            if os.path.exists(ready):
                break
            time.sleep(0.01)
        assert os.path.exists(ready), "holder failed to acquire lock"
        # _FakeChromaCollection doesn't implement query/get/count; we only
        # need to confirm the wrapper does not call into mine_palace_lock
        # for reads, which we assert by observing the wrapped methods are
        # NOT in ChromaCollection's _write_lock path. A direct check via
        # source inspection is more honest than mocking the entire chroma
        # surface here.
        import inspect
        from mempalace.backends.chroma import ChromaCollection as _CC
        for write_attr in ("add", "upsert", "update", "delete"):
            src = inspect.getsource(getattr(_CC, write_attr))
            assert "_write_lock" in src, f"{write_attr} should acquire write lock"
        for read_attr in ("query", "get", "count"):
            method = getattr(_CC, read_attr, None)
            if method is None:
                continue
            src = inspect.getsource(method)
            assert (
                "_write_lock" not in src
            ), f"{read_attr} must NOT acquire the write lock (read path)"
    finally:
        open(release, "w").close()
        holder.join(timeout=5)
@@ -175,6 +175,61 @@ def test_cmd_init_normalizes_wing_name_for_topics_registry(mock_config_cls, tmp_
        assert mock_register.call_args.kwargs["wing"] == "my_cool_app"
 def test_cmd_init_honors_palace_flag(tmp_path, monkeypatch):
    """Regression for #1313: ``cmd_init`` must honor ``--palace`` instead of
    silently writing to ``~/.mempalace``. Mirrors the env-var pattern used
    by ``cmd_mine`` / ``cmd_status`` / ``mcp_server`` so every downstream
    read of ``cfg.palace_path`` (Pass 0, ``cfg.init()``, post-init mine)
    routes to the user-specified location.
    """
    project = tmp_path / "project"
    project.mkdir()
    palace = tmp_path / "custom_palace"
    # Make sure no leftover env var from another test leaks in — we want to
    # verify that --palace ALONE drives the resolution. Prime monkeypatch's
    # undo list with setenv first so that the env var ``cmd_init`` writes
    # below is rolled back at teardown (``delenv(raising=False)`` on a
    # missing key registers no undo entry, which would leak into the next
    # test).
    monkeypatch.setenv("MEMPALACE_PALACE_PATH", "")
    monkeypatch.setenv("MEMPAL_PALACE_PATH", "")
    monkeypatch.delenv("MEMPALACE_PALACE_PATH")
    monkeypatch.delenv("MEMPAL_PALACE_PATH")
    args = argparse.Namespace(
        dir=str(project),
        palace=str(palace),
        yes=True,
        auto_mine=False,
    )
    captured = {}
    def fake_pass_zero(project_dir, palace_dir, llm_provider):
        # Capture the palace_dir Pass 0 sees — this is the smoking-gun
        # value for the bug. Pre-fix it was always ~/.mempalace.
        captured["pass_zero_palace_dir"] = palace_dir
        return None
    with (
        patch("mempalace.entity_detector.scan_for_detection", return_value=[]),
        patch("mempalace.room_detector_local.detect_rooms_local"),
        patch("mempalace.cli._run_pass_zero", side_effect=fake_pass_zero),
        patch("mempalace.cli._maybe_run_mine_after_init"),
    ):
        cmd_init(args)
    expected = str(palace)
    # Pass 0 must have been handed the --palace location, not ~/.mempalace.
    assert captured["pass_zero_palace_dir"] == expected
    # And the env var must point at the custom palace so any downstream
    # ``cfg.palace_path`` read in this process resolves correctly too.
    import os
    assert os.environ.get("MEMPALACE_PALACE_PATH") == os.path.abspath(expected)
@patch("mempalace.cli.MempalaceConfig")
 def test_cmd_init_with_entities_zero_total(mock_config_cls, tmp_path, capsys):
    """When entities detected but total is 0, prints 'No entities' message."""
@@ -934,7 +989,7 @@ def test_cmd_compress_with_config(mock_config_cls, tmp_path, capsys):
@patch("mempalace.cli.MempalaceConfig")
 def test_cmd_compress_stores_results(mock_config_cls, capsys):
-    """Non-dry-run compress stores to mempalace_compressed collection."""
+    """Non-dry-run compress stores to mempalace_closets collection (#1244)."""
    mock_config_cls.return_value.palace_path = "/fake/palace"
    args = argparse.Namespace(palace=None, wing=None, dry_run=False, config=None)
    mock_col = MagicMock()
@@ -972,6 +1027,53 @@ def test_cmd_compress_stores_results(mock_config_cls, capsys):
    assert "Stored" in out
    assert "Total:" in out
    mock_comp_col.upsert.assert_called_once()
    # Verify the compress output goes to the closets collection so that
    # palace.get_closets_collection() / searcher can read it back (#1244).
    (call_args, _kwargs) = mock_backend.get_or_create_collection.call_args
    assert (
        call_args[1] == "mempalace_closets"
    ), f"compress should write to mempalace_closets, got {call_args[1]!r}"
    assert "mempalace_closets" in out
 def test_cmd_compress_output_readable_via_get_closets_collection(tmp_path, capsys):
    """End-to-end: cmd_compress output must be readable via the same code
    path palace.py uses (`get_closets_collection`). Regression for #1244."""
    from mempalace.backends.chroma import ChromaBackend
    from mempalace.palace import get_closets_collection, get_collection
    palace_path = str(tmp_path / "palace")
    # Seed a drawer in the palace so cmd_compress has something to compress.
    drawers = get_collection(palace_path, "mempalace_drawers", create=True)
    drawers.upsert(
        ids=["drawer-1"],
        documents=["The quick brown fox jumps over the lazy dog."],
        metadatas=[{"wing": "test", "room": "demo", "source_file": "fox.txt"}],
    )
    args = argparse.Namespace(palace=palace_path, wing=None, dry_run=False, config=None)
    with patch("mempalace.cli.MempalaceConfig") as mock_config_cls:
        mock_config_cls.return_value.palace_path = palace_path
        # Use a real ChromaBackend so the write actually lands on disk and
        # the read-side helper can find it.
        with patch("mempalace.backends.chroma.ChromaBackend", side_effect=ChromaBackend):
            cmd_compress(args)
    out = capsys.readouterr().out
    assert "Stored" in out
    # Now read via the *same* code path palace.py / searcher uses.
    closets = get_closets_collection(palace_path, create=False)
    got = closets.get(ids=["drawer-1"], include=["documents", "metadatas"])
    assert got["ids"] == ["drawer-1"], (
        "compressed drawer not found in mempalace_closets — "
        "cmd_compress wrote to the wrong collection (#1244)"
    )
    assert got["documents"] and got["documents"][0], "empty compressed doc"
    meta = got["metadatas"][0]
    assert meta.get("wing") == "test"
    assert "compression_ratio" in meta
 def test_cmd_repair_trailing_slash_does_not_recurse():
@@ -985,3 +1087,58 @@ def test_cmd_repair_trailing_slash_does_not_recurse():
    palace_path = os.path.expanduser(args.palace).rstrip(os.sep)
    backup_path = palace_path + ".backup"
    assert not backup_path.startswith(palace_path + os.sep)
 # ── stdio reconfigure on Windows ─────────────────────────────────────
 class _ReconfigurableStringIO:
    def __init__(self):
        self.reconfigure_calls = []
    def reconfigure(self, **kwargs):
        self.reconfigure_calls.append(kwargs)
 def test_reconfigures_stdio_to_utf8_on_windows():
    """Windows `mempalace` CLI must decode/encode stdio as UTF-8.
    Without this, piped non-ASCII input (`mempalace search ... < q.txt`)
    or piped non-ASCII output (`mempalace search "..." > out.txt`) is
    mojibaked through the system ANSI codepage on non-Latin Windows
    locales (cp1252/cp1251/cp950).
    """
    from mempalace.cli import _reconfigure_stdio_utf8_on_windows
    stdin = _ReconfigurableStringIO()
    stdout = _ReconfigurableStringIO()
    stderr = _ReconfigurableStringIO()
    with (
        patch.object(sys, "platform", "win32"),
        patch.object(sys, "stdin", stdin),
        patch.object(sys, "stdout", stdout),
        patch.object(sys, "stderr", stderr),
    ):
        _reconfigure_stdio_utf8_on_windows()
    # Per-stream errors policy: stdin survives bad bytes via
    # surrogateescape so a redirected non-UTF-8 file does not crash
    # the read; stdout/stderr use replace so a drawer carrying a
    # round-tripped surrogate half does not crash mid-print.
    assert stdin.reconfigure_calls == [{"encoding": "utf-8", "errors": "surrogateescape"}]
    assert stdout.reconfigure_calls == [{"encoding": "utf-8", "errors": "replace"}]
    assert stderr.reconfigure_calls == [{"encoding": "utf-8", "errors": "replace"}]
 def test_reconfigure_stdio_is_noop_off_windows():
    """Linux/macOS already default to UTF-8 stdio -- helper must not touch streams."""
    from mempalace.cli import _reconfigure_stdio_utf8_on_windows
    stdin = _ReconfigurableStringIO()
    with (
        patch.object(sys, "platform", "linux"),
        patch.object(sys, "stdin", stdin),
    ):
        _reconfigure_stdio_utf8_on_windows()
    assert stdin.reconfigure_calls == []
@@ -296,6 +296,182 @@ class TestRegenerateClosets:
            assert meta.get("generated_by", "").startswith("llm:")
            assert meta.get("normalize_version") == NORMALIZE_VERSION
    def test_regen_paginates_drawer_fetch(self, tmp_path):
        """Regression for #1073: drawers_col.get must be paginated at
        batch_size=5000. A single get(limit=total, ...) on a palace with
        more than SQLite's SQLITE_MAX_VARIABLE_NUMBER (32766) drawers
        blows up inside chromadb. Matches the miner.status pattern
        introduced in #851 (see #802, #850, #1073)."""
        from mempalace import closet_llm as closet_llm_mod
        palace = str(tmp_path / "palace")
        # Build a fake collection: 12_000 drawers across 3 source files,
        # enough to force 3 batches of batch_size=5000 (5000 + 5000 + 2000).
        n_drawers = 12_000
        ids = [f"d{i:05d}" for i in range(n_drawers)]
        docs = [f"doc body {i}" for i in range(n_drawers)]
        metas = [
            {
                "wing": "w",
                "room": "r",
                "source_file": f"/src/file_{i % 3}.md",
                "entities": "",
            }
            for i in range(n_drawers)
        ]
        get_calls: list = []
        class FakeDrawersCol:
            def count(self):
                return n_drawers
            def get(self, limit=None, offset=0, include=None, **kwargs):
                get_calls.append({"limit": limit, "offset": offset, "include": include})
                end = min(offset + (limit or n_drawers), n_drawers)
                return {
                    "ids": ids[offset:end],
                    "documents": docs[offset:end],
                    "metadatas": metas[offset:end],
                }
        class FakeClosetsCol:
            """Accept the purge + upsert calls the success path makes."""
            def get(self, *a, **kw):
                return {"ids": [], "documents": [], "metadatas": []}
            def delete(self, *a, **kw):
                return None
            def upsert(self, *a, **kw):
                return None
        fake_drawers = FakeDrawersCol()
        fake_closets = FakeClosetsCol()
        def fake_urlopen(req, timeout=None):
            return _FakeResp(
                {
                    "choices": [
                        {"message": {"content": '{"topics":["t1"],"quotes":[],"summary":""}'}}
                    ],
                    "usage": {"prompt_tokens": 1, "completion_tokens": 1},
                }
            )
        cfg = LLMConfig(endpoint="http://local/v1", model="m")
        with (
            patch.object(closet_llm_mod, "get_collection", return_value=fake_drawers),
            patch.object(closet_llm_mod, "get_closets_collection", return_value=fake_closets),
            patch.object(closet_llm_mod, "purge_file_closets", return_value=None),
            patch.object(closet_llm_mod, "upsert_closet_lines", return_value=None),
            patch("urllib.request.urlopen", side_effect=fake_urlopen),
        ):
            result = regenerate_closets(palace, cfg=cfg, dry_run=True)
        # Three paginated calls: (limit=5000, offset=0), (5000, 5000), (5000, 10000).
        assert len(get_calls) == 3, f"expected 3 batched fetches, got {len(get_calls)}"
        for call in get_calls:
            assert (
                call["limit"] == 5000
            ), f"batch must be 5000 — got {call['limit']} (would risk SQLITE_MAX_VARIABLE_NUMBER)"
            # include must still request both documents and metadatas
            assert "documents" in call["include"]
            assert "metadatas" in call["include"]
        assert [c["offset"] for c in get_calls] == [0, 5000, 10_000]
        # by_source aggregation must be preserved exactly across batches:
        # 12_000 drawers, 3 source files → 4_000 drawers each.
        # dry_run=True short-circuits LLM calls but still walks by_source.
        assert result.get("processed", 0) == 0  # dry_run
        # Verify no single call tried to pull more than batch_size.
        assert max(c["limit"] for c in get_calls) <= 5000
    def test_regen_by_source_aggregates_across_batches(self, tmp_path):
        """Pagination must not change the by_source grouping — drawers for
        the same source_file split across batches still land in one group."""
        from mempalace import closet_llm as closet_llm_mod
        palace = str(tmp_path / "palace")
        # 7_500 drawers, alternating between two source files → forces
        # splits across the 5000/2500 boundary. Each source ends up with
        # 3_750 drawers after regrouping.
        n_drawers = 7_500
        ids = [f"d{i:05d}" for i in range(n_drawers)]
        docs = [f"body-{i}" for i in range(n_drawers)]
        metas = [
            {
                "wing": "w",
                "room": "r",
                "source_file": f"/src/file_{i % 2}.md",
                "entities": "",
            }
            for i in range(n_drawers)
        ]
        captured_sources: dict = {}
        class FakeDrawersCol:
            def count(self):
                return n_drawers
            def get(self, limit=None, offset=0, include=None, **kwargs):
                end = min(offset + (limit or n_drawers), n_drawers)
                return {
                    "ids": ids[offset:end],
                    "documents": docs[offset:end],
                    "metadatas": metas[offset:end],
                }
        class FakeClosetsCol:
            def get(self, *a, **kw):
                return {"ids": [], "documents": [], "metadatas": []}
            def delete(self, *a, **kw):
                return None
            def upsert(self, *a, **kw):
                return None
        # Hook _call_llm to inspect what regenerate_closets aggregated
        # per source before the HTTP boundary.
        real_call_llm = closet_llm_mod._call_llm
        def spying_call_llm(cfg, source_file, wing, room, content):
            captured_sources[source_file] = content
            return (
                {"topics": ["t"], "quotes": [], "summary": ""},
                {"prompt_tokens": 1, "completion_tokens": 1},
            )
        cfg = LLMConfig(endpoint="http://local/v1", model="m")
        with (
            patch.object(closet_llm_mod, "get_collection", return_value=FakeDrawersCol()),
            patch.object(closet_llm_mod, "get_closets_collection", return_value=FakeClosetsCol()),
            patch.object(closet_llm_mod, "purge_file_closets", return_value=None),
            patch.object(closet_llm_mod, "upsert_closet_lines", return_value=None),
            patch.object(closet_llm_mod, "_call_llm", side_effect=spying_call_llm),
        ):
            regenerate_closets(palace, cfg=cfg)
        # Both sources survived the pagination boundary.
        assert set(captured_sources.keys()) == {"/src/file_0.md", "/src/file_1.md"}
        # Each source accumulated exactly 3_750 drawer bodies, concatenated
        # with the "\n\n" separator the regenerate path uses.
        for source, content in captured_sources.items():
            assert content.count("\n\n") == 3_749, (
                f"{source}: expected 3_750 chunks joined (3_749 separators), "
                f"got {content.count(chr(10) + chr(10)) + 1}"
            )
        # Silence unused-var lint.
        assert real_call_llm is not None
    def test_regen_uses_basename_not_split_slash(self, tmp_path, monkeypatch):
        """Regression: the old closet_id base used ``source.split('/')[-1]``
        which silently degrades on Windows paths (``C:\\proj\\a.md`` →
@@ -3,7 +3,13 @@ import json
 import tempfile
 import pytest
-from mempalace.config import MempalaceConfig, normalize_wing_name, sanitize_kg_value, sanitize_name
+from mempalace.config import (
    MempalaceConfig,
    normalize_wing_name,
    sanitize_iso_date,
    sanitize_kg_value,
    sanitize_name,
 )
 def test_default_config():
@@ -212,3 +218,69 @@ def test_kg_value_rejects_null_bytes():
 def test_kg_value_rejects_over_length():
    with pytest.raises(ValueError):
        sanitize_kg_value("a" * 129)
 # --- sanitize_iso_date ---
 def test_iso_date_rejects_year_only():
    # Partial dates re-introduce silent empty result sets via lexicographic
    # TEXT comparison in KG queries (e.g. "2026-01-01" <= "2026" is False).
    with pytest.raises(ValueError):
        sanitize_iso_date("2026")
 def test_iso_date_rejects_year_month():
    with pytest.raises(ValueError):
        sanitize_iso_date("2026-03")
 def test_iso_date_accepts_full_date():
    assert sanitize_iso_date("2026-03-15") == "2026-03-15"
 def test_iso_date_passes_through_none():
    assert sanitize_iso_date(None) is None
 def test_iso_date_passes_through_empty_string():
    assert sanitize_iso_date("") == ""
 def test_iso_date_strips_whitespace():
    assert sanitize_iso_date("  2026-03-15  ") == "2026-03-15"
 def test_iso_date_rejects_natural_language():
    with pytest.raises(ValueError):
        sanitize_iso_date("March 2026")
 def test_iso_date_rejects_abbreviated_month():
    with pytest.raises(ValueError):
        sanitize_iso_date("Jan 2025")
 def test_iso_date_rejects_us_format():
    with pytest.raises(ValueError):
        sanitize_iso_date("03/15/2026")
 def test_iso_date_rejects_invalid_month():
    with pytest.raises(ValueError):
        sanitize_iso_date("2026-13")
 def test_iso_date_rejects_invalid_day():
    with pytest.raises(ValueError):
        sanitize_iso_date("2026-02-32")
 def test_iso_date_rejects_non_string():
    with pytest.raises(ValueError):
        sanitize_iso_date(20260315)
 def test_iso_date_error_names_field():
    with pytest.raises(ValueError, match="valid_from"):
        sanitize_iso_date("yesterday", "valid_from")
@@ -2,6 +2,8 @@
 from unittest.mock import patch
 import pytest
 from mempalace.entity_registry import (
    COMMON_ENGLISH_WORDS,
    PERSON_CONTEXT_PATTERNS,
@@ -71,6 +73,50 @@ def test_save_creates_file(tmp_path):
    assert (tmp_path / "entity_registry.json").exists()
 def test_save_is_atomic_does_not_leave_tmp(tmp_path):
    # Atomic write must not leave the .tmp sidecar file after a successful save.
    registry = EntityRegistry.load(config_dir=tmp_path)
    registry.save()
    leftover = list(tmp_path.glob("entity_registry.json.tmp*"))
    assert leftover == [], f"atomic write leaked tmp file(s): {leftover}"
 def test_save_preserves_previous_on_serialization_failure(tmp_path, monkeypatch):
    # If serialization fails mid-write, the previous registry must remain
    # intact — this is the whole point of atomic write vs truncating in place.
    registry = EntityRegistry.load(config_dir=tmp_path)
    registry.seed(
        mode="personal",
        people=[{"name": "Alice", "relationship": "friend", "context": "personal"}],
        projects=[],
    )
    registry.save()
    target = tmp_path / "entity_registry.json"
    original = target.read_text(encoding="utf-8")
    # Force os.replace to raise — simulates filesystem full / permission flip
    # AFTER the temp file is written but BEFORE the rename completes.
    import os as _os
    real_replace = _os.replace
    def boom(src, dst):
        raise OSError("simulated rename failure")
    monkeypatch.setattr(_os, "replace", boom)
    with pytest.raises(OSError):
        registry.seed(
            mode="personal",
            people=[{"name": "Bob", "relationship": "friend", "context": "personal"}],
            projects=[],
        )
        registry.save()
    # Restore os.replace before reading so the assertion can rely on it.
    monkeypatch.setattr(_os, "replace", real_replace)
    assert target.read_text(encoding="utf-8") == original
 # ── seed ────────────────────────────────────────────────────────────────
@@ -286,3 +286,66 @@ class TestCLI:
        assert "similar_name" in out
        # Silence unused import warning.
        _ = (MagicMock, patch, fact_checker)
    def test_reconfigures_stdio_to_utf8_on_windows(self):
        """Windows fact_checker --stdin must decode payload as UTF-8.
        Without this, Python defaults stdio to the system ANSI codepage
        (cp1252/cp1251/cp950), which mojibakes non-ASCII text before
        pattern parsing sees it.
        """
        import io
        import sys
        from mempalace.fact_checker import _reconfigure_stdio_utf8_on_windows
        class _ReconfigurableStringIO(io.StringIO):
            def __init__(self, initial_value=""):
                super().__init__(initial_value)
                self.reconfigure_calls = []
            def reconfigure(self, **kwargs):
                self.reconfigure_calls.append(kwargs)
        stdin = _ReconfigurableStringIO()
        stdout = _ReconfigurableStringIO()
        stderr = _ReconfigurableStringIO()
        with (
            patch.object(sys, "platform", "win32"),
            patch.object(sys, "stdin", stdin),
            patch.object(sys, "stdout", stdout),
            patch.object(sys, "stderr", stderr),
        ):
            _reconfigure_stdio_utf8_on_windows()
        # Per-stream errors policy: stdin uses surrogateescape so a stray
        # malformed byte from a redirected file does not crash the read,
        # stdout/stderr use replace so an extracted fact carrying a
        # surrogate half does not crash mid-print.
        assert stdin.reconfigure_calls == [{"encoding": "utf-8", "errors": "surrogateescape"}]
        assert stdout.reconfigure_calls == [{"encoding": "utf-8", "errors": "replace"}]
        assert stderr.reconfigure_calls == [{"encoding": "utf-8", "errors": "replace"}]
    def test_reconfigure_stdio_is_noop_off_windows(self):
        """Linux/macOS already default to UTF-8 stdio -- helper must not touch streams."""
        import io
        import sys
        from mempalace.fact_checker import _reconfigure_stdio_utf8_on_windows
        class _ReconfigurableStringIO(io.StringIO):
            def __init__(self):
                super().__init__()
                self.reconfigure_calls = []
            def reconfigure(self, **kwargs):
                self.reconfigure_calls.append(kwargs)
        stdin = _ReconfigurableStringIO()
        with (
            patch.object(sys, "platform", "linux"),
            patch.object(sys, "stdin", stdin),
        ):
            _reconfigure_stdio_utf8_on_windows()
        assert stdin.reconfigure_calls == []
@@ -8,6 +8,7 @@ from unittest.mock import MagicMock, patch
 import pytest
 import mempalace.hooks_cli as hooks_cli_mod
 from mempalace.hooks_cli import (
    SAVE_INTERVAL,
    _count_human_messages,
@@ -959,3 +960,108 @@ def test_stop_hook_rejects_injected_stop_hook_active(tmp_path):
    # The injected value is not "true"/"1"/"yes", so the hook should NOT pass through.
    # Save must have been attempted.
    assert mock_save.called
 # --- Absent palace root: hooks must not recreate ~/.mempalace ---
 #
 # When the user removes ~/.mempalace (e.g. `rm -rf`), that is the strongest
 # possible "do not auto-capture" signal. Hooks must short-circuit BEFORE
 # touching disk — including before the log-line that previously triggered
 # STATE_DIR.mkdir() on its own.
 def _redirect_palace_root(monkeypatch, tmp_path):
    """Point PALACE_ROOT and STATE_DIR at a tmp location that does NOT exist."""
    fake_root = tmp_path / "absent-mempalace"
    monkeypatch.setattr(hooks_cli_mod, "PALACE_ROOT", fake_root)
    monkeypatch.setattr(hooks_cli_mod, "STATE_DIR", fake_root / "hook_state")
    monkeypatch.setattr(hooks_cli_mod, "_state_dir_initialized", False)
    return fake_root
 def test_hook_stop_does_not_create_palace_dir_when_absent(tmp_path, monkeypatch):
    fake_root = _redirect_palace_root(monkeypatch, tmp_path)
    transcript = tmp_path / "t.jsonl"
    transcript.write_text("")
    buf = io.StringIO()
    with contextlib.redirect_stdout(buf):
        hook_stop(
            {"session_id": "absent", "transcript_path": str(transcript), "stop_hook_active": False},
            "claude-code",
        )
    assert json.loads(buf.getvalue() or "{}") == {}
    assert not fake_root.exists()
 def test_hook_precompact_does_not_create_palace_dir_when_absent(tmp_path, monkeypatch):
    fake_root = _redirect_palace_root(monkeypatch, tmp_path)
    transcript = tmp_path / "t.jsonl"
    transcript.write_text("")
    buf = io.StringIO()
    with contextlib.redirect_stdout(buf):
        hook_precompact(
            {"session_id": "absent", "transcript_path": str(transcript)},
            "claude-code",
        )
    assert json.loads(buf.getvalue() or "{}") == {}
    assert not fake_root.exists()
 def test_hook_session_start_does_not_create_palace_dir_when_absent(tmp_path, monkeypatch):
    fake_root = _redirect_palace_root(monkeypatch, tmp_path)
    buf = io.StringIO()
    with contextlib.redirect_stdout(buf):
        hook_session_start({"session_id": "absent"}, "claude-code")
    assert json.loads(buf.getvalue() or "{}") == {}
    assert not fake_root.exists()
 def test_log_does_not_create_palace_dir_when_absent(tmp_path, monkeypatch):
    fake_root = _redirect_palace_root(monkeypatch, tmp_path)
    _log("test message")
    assert not fake_root.exists()
 def test_existing_dir_proceeds_normally(tmp_path, monkeypatch):
    """Regression: when PALACE_ROOT exists, hooks must proceed (no short-circuit)."""
    fake_root = tmp_path / "present-mempalace"
    fake_root.mkdir()
    monkeypatch.setattr(hooks_cli_mod, "PALACE_ROOT", fake_root)
    monkeypatch.setattr(hooks_cli_mod, "STATE_DIR", fake_root / "hook_state")
    monkeypatch.setattr(hooks_cli_mod, "_state_dir_initialized", False)
    _log("test message")
    # _log should have created the state dir under the existing palace root
    assert (fake_root / "hook_state").exists()
    assert (fake_root / "hook_state" / "hook.log").is_file()
 def test_regular_file_at_palace_root_treated_as_absent(tmp_path, monkeypatch):
    """A regular file at ~/.mempalace must be treated the same as absent.
    ``Path.exists()`` returns True for a regular file, which would let the
    kill-switch be bypassed and crash later when ``STATE_DIR.mkdir()`` runs
    on ``NotADirectoryError``. ``_palace_root_exists()`` must use
    ``is_dir()`` so a stray file (or broken symlink) short-circuits cleanly.
    """
    fake_root = tmp_path / "file-not-dir"
    fake_root.write_text("oops, this is a file not a directory")
    monkeypatch.setattr(hooks_cli_mod, "PALACE_ROOT", fake_root)
    monkeypatch.setattr(hooks_cli_mod, "STATE_DIR", fake_root / "hook_state")
    monkeypatch.setattr(hooks_cli_mod, "_state_dir_initialized", False)
    # _palace_root_exists() is the source of truth — it must return False.
    assert hooks_cli_mod._palace_root_exists() is False
    # Hooks must short-circuit (return {} on stdout) and not touch disk.
    buf = io.StringIO()
    with contextlib.redirect_stdout(buf):
        hook_session_start({"session_id": "file-at-root"}, "claude-code")
    assert json.loads(buf.getvalue() or "{}") == {}
    # _log must also short-circuit — it must NOT try to mkdir a path under a
    # regular file (which would raise NotADirectoryError).
    _log("test message")  # would raise if not short-circuited
    # The stray file is left untouched; we never try to convert it.
    assert fake_root.is_file()
    assert fake_root.read_text() == "oops, this is a file not a directory"
@@ -0,0 +1,234 @@
 """Tests for ``candidate_strategy="union"`` in ``search_memories``.
 The default ``"vector"`` strategy gathers candidates from the vector index
 only. Docs with strong BM25 signal but vector embeddings far from the query
 get skipped — terminology guides looked up by narrative-shaped queries are
 the canonical case.
 The ``"union"`` strategy also pulls top-K BM25-only candidates from sqlite
 FTS5 and merges them into the rerank pool. Both signal sources contribute
 candidates; the hybrid rerank picks the best from a richer pool.
 Default behavior is unchanged ("vector") — these tests exercise opt-in
 "union" mode.
 """
 from mempalace.palace import get_collection
 from mempalace.searcher import search_memories
 def _seed_drawers(palace_path):
    """Seed a corpus where the right doc for one query is BM25-strong but
    vector-distant.
    D1-D3 are short narrative tickets that semantically cluster around
    "customer support / order / shipped" vocabulary. D4 is a meta-document
    of bullet rules ("brand voice") that contains rare keywords like
    "Absolutely" and "apologize" the query repeats verbatim — strong BM25
    signal but stylistically far from the narrative tickets.
    """
    col = get_collection(palace_path, create=True)
    col.upsert(
        ids=["D1", "D2", "D3", "D4"],
        documents=[
            "Customer wrote in asking why their order shipped without "
            "the promo sticker. Standard reply explaining the threshold.",
            "Order delivery delayed three days; customer requested a "
            "refund. Support agent processed return via ticket queue.",
            "Customer asked about the missing freebie; the reply "
            "explained the campaign mechanics and shipped status.",
            "Brand voice rules: dry, sturdy, never effusive. "
            "Never 'Absolutely!' Never apologize for policy — explain it. "
            "Avoid premium / curated / elevated vocabulary.",
        ],
        metadatas=[
            {"wing": "shop", "room": "support", "source_file": "ticket_D1.md"},
            {"wing": "shop", "room": "support", "source_file": "ticket_D2.md"},
            {"wing": "shop", "room": "support", "source_file": "ticket_D3.md"},
            {"wing": "shop", "room": "guides", "source_file": "brand_voice_D4.md"},
        ],
    )
 _NARRATIVE_QUERY = (
    "A support agent is drafting a reply to a customer asking why their "
    "order shipped without a free sticker. Draft the reply, but never say "
    "'Absolutely!' and do not apologize for policy."
 )
 class TestCandidateUnion:
    def test_default_vector_strategy_unchanged(self, tmp_path):
        """Default behavior must be identical to omitting the parameter."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        without = search_memories(_NARRATIVE_QUERY, palace, n_results=5)
        with_default = search_memories(
            _NARRATIVE_QUERY, palace, n_results=5, candidate_strategy="vector"
        )
        ids_a = [h["source_file"] for h in without["results"]]
        ids_b = [h["source_file"] for h in with_default["results"]]
        assert ids_a == ids_b, "explicit candidate_strategy='vector' must match default"
    def test_union_surfaces_bm25_strong_vector_distant_doc(self, tmp_path):
        """The brand-voice doc has strong BM25 signal for the query but is
        stylistically far from the narrative tickets. Union mode must
        retrieve it; vector-only mode is allowed to miss it."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        result = search_memories(_NARRATIVE_QUERY, palace, n_results=5, candidate_strategy="union")
        ids = [h["source_file"] for h in result["results"]]
        assert "brand_voice_D4.md" in ids, (
            "union mode must surface BM25-strong docs even when vector signal "
            f"is weak; got {ids}"
        )
    def test_union_preserves_vector_hits(self, tmp_path):
        """Union mode must not drop docs that vector-only mode finds —
        the rerank pool grows, it doesn't shrink."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        vector = search_memories(_NARRATIVE_QUERY, palace, n_results=5, candidate_strategy="vector")
        union = search_memories(_NARRATIVE_QUERY, palace, n_results=5, candidate_strategy="union")
        vec_ids = {h["source_file"] for h in vector["results"]}
        union_ids = {h["source_file"] for h in union["results"]}
        # In a 4-doc corpus with n_results=5, both should return all 4.
        # The invariant is: union should not lose anything vector found.
        missing = vec_ids - union_ids
        assert not missing, f"union dropped docs that vector found: {missing}"
    def test_union_handles_empty_palace(self, tmp_path):
        """No drawers — union mode should return empty results, not crash."""
        palace = str(tmp_path / "palace")
        get_collection(palace, create=True)  # create empty collection
        result = search_memories("anything", palace, n_results=5, candidate_strategy="union")
        assert result.get("results", []) == []
    def test_invalid_candidate_strategy_raises(self, tmp_path):
        """Bad arg should raise rather than silently fall back."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        import pytest
        with pytest.raises(ValueError, match="candidate_strategy"):
            search_memories("anything", palace, n_results=5, candidate_strategy="bogus")
    def test_invalid_strategy_raises_even_when_vector_disabled(self, tmp_path):
        """Validation must happen before the ``vector_disabled`` early return —
        invalid values must fail consistently regardless of routing."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        import pytest
        with pytest.raises(ValueError, match="candidate_strategy"):
            search_memories(
                "anything",
                palace,
                n_results=5,
                vector_disabled=True,
                candidate_strategy="bogus",
            )
    def test_union_respects_n_results_limit(self, tmp_path):
        """When the merged candidate set is larger than ``n_results``, the
        result must be trimmed back to the requested size — the MCP
        ``limit`` contract depends on this invariant."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        # 4-doc corpus, n_results=2 → union pool can grow to ~8 candidates,
        # rerank reorders them, but final list must respect the cap.
        result = search_memories(_NARRATIVE_QUERY, palace, n_results=2, candidate_strategy="union")
        assert (
            len(result["results"]) <= 2
        ), f"union must trim to n_results=2; got {len(result['results'])} results"
    def test_union_skipped_when_max_distance_set(self, tmp_path):
        """``max_distance`` is a vector-distance threshold; BM25-only
        candidates have ``distance=None`` and cannot satisfy it. Union
        must not silently inject them when a strict threshold is set,
        otherwise the existing ``max_distance`` guarantee regresses."""
        palace = str(tmp_path / "palace")
        _seed_drawers(palace)
        # Sanity: without max_distance, union surfaces the BM25-strong doc.
        unfiltered = search_memories(
            _NARRATIVE_QUERY, palace, n_results=5, candidate_strategy="union"
        )
        assert "brand_voice_D4.md" in {h["source_file"] for h in unfiltered["results"]}
        # With a tight max_distance, union must NOT inject BM25-only hits —
        # every returned hit must have a real (non-None) distance.
        filtered = search_memories(
            _NARRATIVE_QUERY,
            palace,
            n_results=5,
            candidate_strategy="union",
            max_distance=0.5,
        )
        for h in filtered["results"]:
            assert h.get("distance") is not None, (
                f"union under max_distance must not inject BM25-only "
                f"(distance=None) candidates; offending hit: {h}"
            )
            assert h["distance"] <= 0.5, f"hit violates max_distance=0.5: distance={h['distance']}"
    def test_union_dedup_is_chunk_precise_not_basename(self, tmp_path):
        """Two files with the same basename in different directories must
        not collide — union must dedup on full path (or chunk-level key),
        not on basename alone. Otherwise a BM25-strong README from one
        directory silently shadows a BM25-strong README from another.
        """
        palace = str(tmp_path / "palace")
        col = get_collection(palace, create=True)
        col.upsert(
            ids=["A_README", "B_README", "narrative"],
            documents=[
                # Both README files share the basename README.md but live
                # in different directories. Each contains distinctive
                # terminology a query might surface via BM25.
                "PROJECT ALPHA: configuration for the Frobnitz subsystem. "
                "Set FROBNITZ_TIMEOUT=30 to enable widget rotation.",
                "PROJECT BETA: configuration for the Wibble subsystem. "
                "Set WIBBLE_THRESHOLD=0.5 to enable signal smoothing.",
                "Engineers occasionally chat about how the legacy "
                "subsystems all need their config knobs tweaked.",
            ],
            metadatas=[
                {"wing": "code", "room": "docs", "source_file": "alpha/README.md"},
                {"wing": "code", "room": "docs", "source_file": "beta/README.md"},
                {"wing": "code", "room": "docs", "source_file": "chat.md"},
            ],
        )
        # Query that hits BM25 for BOTH READMEs (distinct vocab from each).
        # Vector-only might pick the chat doc as semantically "closest";
        # union must surface both READMEs without basename collision.
        result = search_memories(
            "FROBNITZ_TIMEOUT WIBBLE_THRESHOLD configuration",
            palace,
            n_results=5,
            candidate_strategy="union",
        )
        sources = [h["source_file"] for h in result["results"]]
        readme_count = sum(1 for s in sources if s == "README.md")
        assert readme_count >= 2, (
            f"union must surface both README.md files from different dirs "
            f"(basename collision would drop one); got sources={sources}"
        )
 class TestHybridRankTolerantOfMissingDistance:
    """``_hybrid_rank`` accepts ``distance=None`` — required for BM25-only
    candidates injected by union mode."""
    def test_distance_none_scored_as_zero_vector_sim(self):
        from mempalace.searcher import _hybrid_rank
        results = [
            {"text": "alpha beta gamma", "distance": 0.2},  # close vector match
            {"text": "alpha alpha alpha", "distance": None},  # BM25-only — heavy term repetition
        ]
        # Query matches "alpha" heavily; the BM25-only candidate with no
        # vector signal should still rank competitively on BM25 alone.
        ranked = _hybrid_rank(results, "alpha")
        assert all("bm25_score" in r for r in ranked), "rerank should add bm25_score"
        # Both must survive — neither should crash on distance=None.
        assert len(ranked) == 2
@@ -5,6 +5,8 @@ Covers: entity CRUD, triple CRUD, temporal queries, invalidation,
 timeline, stats, and edge cases (duplicate triples, ID collisions).
 """
 import pytest
 class TestEntityOperations:
    def test_add_entity(self, kg):
@@ -45,6 +47,38 @@ class TestTripleOperations:
        tid2 = kg.add_triple("Alice", "works_at", "Acme")
        assert tid1 != tid2  # new triple since old one was closed
    def test_add_triple_rejects_inverted_interval(self, kg):
        # valid_to before valid_from would never satisfy
        # `valid_from <= as_of AND valid_to >= as_of` — silently invisible
        # to every query. Reject at write time instead.
        with pytest.raises(ValueError, match="before valid_from"):
            kg.add_triple(
                "Alice",
                "worked_at",
                "Acme",
                valid_from="2026-03-01",
                valid_to="2026-02-01",
            )
    def test_add_triple_accepts_equal_dates(self, kg):
        # Same-day intervals are valid (point-in-time facts).
        tid = kg.add_triple(
            "Alice",
            "joined",
            "Acme",
            valid_from="2026-03-15",
            valid_to="2026-03-15",
        )
        assert tid.startswith("t_alice_joined_acme_")
    def test_add_triple_allows_only_one_bound(self, kg):
        # The guard only fires when BOTH bounds are set.
        tid1 = kg.add_triple("Alice", "knows", "Bob", valid_from="2026-01-01")
        assert tid1.startswith("t_alice_knows_bob_")
        kg.invalidate("Alice", "knows", "Bob", ended="2026-02-01")
        tid2 = kg.add_triple("Alice", "knew", "Bob", valid_to="2026-03-01")
        assert tid2.startswith("t_alice_knew_bob_")
 class TestQueries:
    def test_query_outgoing(self, seeded_kg):
@@ -655,3 +655,72 @@ def test_memory_stack_status_with_palace(tmp_path):
    assert result["total_drawers"] == 42
    assert result["L0_identity"]["exists"] is True
 # ── Layer1 / Layer2 None-metadata guards ───────────────────────────────
 #
 # Chroma 1.5.x can return ``None`` inside the ``metadatas`` / ``documents``
 # lists for partially-flushed rows. The Layer1.generate() and
 # Layer2.retrieve() loops previously called ``meta.get(...)`` without
 # coercing, raising ``AttributeError: 'NoneType' object has no attribute
 # 'get'`` and blowing up the whole wake-up render. These tests guard that
 # the loops tolerate the None entries and render the rest of the result.
 def test_layer1_handles_none_metadata():
    """Layer1.generate tolerates None entries in the metadatas list."""
    docs = ["important memory", "another memory"]
    metas = [{"room": "decisions", "source_file": "a.txt"}, None]
    mock_col = _mock_chromadb_for_layer(docs, metas)
    with (
        patch("mempalace.layers.MempalaceConfig") as mock_cfg,
        patch("mempalace.layers._get_collection", return_value=mock_col),
    ):
        mock_cfg.return_value.palace_path = "/fake"
        layer = Layer1(palace_path="/fake")
        # Should not raise AttributeError on the None entry.
        result = layer.generate()
    assert "ESSENTIAL STORY" in result
    assert "important memory" in result
 def test_layer1_handles_none_document():
    """Layer1.generate tolerates None entries in the documents list."""
    docs = ["first doc", None]
    metas = [
        {"room": "r", "source_file": "a.txt"},
        {"room": "r", "source_file": "b.txt"},
    ]
    mock_col = _mock_chromadb_for_layer(docs, metas)
    with (
        patch("mempalace.layers.MempalaceConfig") as mock_cfg,
        patch("mempalace.layers._get_collection", return_value=mock_col),
    ):
        mock_cfg.return_value.palace_path = "/fake"
        layer = Layer1(palace_path="/fake")
        result = layer.generate()
    assert result  # Render succeeded despite the None document.
 def test_layer2_handles_none_metadata():
    """Layer2.retrieve tolerates None entries in the metadatas list."""
    mock_col = MagicMock()
    mock_col.get.return_value = {
        "documents": ["first doc", "second doc"],
        "metadatas": [{"room": "r", "source_file": "a.txt"}, None],
    }
    with (
        patch("mempalace.layers.MempalaceConfig") as mock_cfg,
        patch("mempalace.layers._get_collection", return_value=mock_col),
    ):
        mock_cfg.return_value.palace_path = "/fake"
        layer = Layer2(palace_path="/fake")
        # Should not raise AttributeError on the None entry.
        result = layer.retrieve()
    assert "L2 — ON-DEMAND" in result
@@ -8,7 +8,9 @@ via monkeypatch to avoid touching real data.
 from datetime import datetime
 import json
 import os
 import sys
 from unittest.mock import MagicMock
 import pytest
@@ -18,7 +20,7 @@ def _patch_mcp_server(monkeypatch, config, kg):
    from mempalace import mcp_server
    monkeypatch.setattr(mcp_server, "_config", config)
-    monkeypatch.setattr(mcp_server, "_kg", kg)
+    monkeypatch.setattr(mcp_server, "_get_kg", lambda: kg)
 def _get_collection(palace_path, create=False):
@@ -146,6 +148,20 @@ class TestHandleRequest:
        )
        assert resp["error"]["code"] == -32601
    def test_tools_call_missing_params(self):
        from mempalace.mcp_server import handle_request
        for bad_params in [None, {}, {"arguments": {}}]:
            resp = handle_request(
                {
                    "method": "tools/call",
                    "id": 15,
                    "params": bad_params,
                }
            )
            assert resp["error"]["code"] == -32602
            assert "Invalid params" in resp["error"]["message"]
    def test_unknown_method(self):
        from mempalace.mcp_server import handle_request
@@ -188,6 +204,17 @@ class TestHandleRequest:
        resp = handle_request({"method": None, "id": 99, "params": {}})
        assert resp["error"]["code"] == -32601
    @pytest.mark.parametrize("payload", [None, [], "plain", 42, True])
    def test_handle_request_invalid_payload_returns_jsonrpc_error(self, payload):
        from mempalace.mcp_server import handle_request
        resp = handle_request(payload)
        assert resp == {
            "jsonrpc": "2.0",
            "id": None,
            "error": {"code": -32600, "message": "Invalid Request"},
        }
    def test_tools_call_dispatches(self, monkeypatch, config, palace_path, seeded_kg):
        _patch_mcp_server(monkeypatch, config, seeded_kg)
        from mempalace.mcp_server import handle_request
@@ -495,6 +522,41 @@ class TestWriteTools:
        result = tool_delete_drawer("nonexistent_drawer")
        assert result["success"] is False
    def test_check_duplicate_handles_none_metadata(self, monkeypatch, config, kg):
        """tool_check_duplicate must tolerate None entries in the result lists
        that ChromaDB 1.5.x returns for partially-flushed rows.
        Previously ``meta = results["metadatas"][0][i]`` was unguarded and
        raised ``AttributeError: 'NoneType' object has no attribute 'get'``
        the moment the first matching drawer came back with None metadata —
        surfacing to the MCP client as the uninformative
        ``"Duplicate check failed"`` because the broad ``except Exception``
        wrapper swallows the real cause.
        """
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace import mcp_server
        mock_col = MagicMock()
        mock_col.query.return_value = {
            "ids": [["d1", "d2"]],
            "distances": [[0.05, 0.05]],
            "metadatas": [[{"wing": "w", "room": "r"}, None]],
            "documents": [["first doc", None]],
        }
        monkeypatch.setattr(mcp_server, "_get_collection", lambda: mock_col)
        result = mcp_server.tool_check_duplicate("any content", threshold=0.5)
        # Both entries land in matches (above threshold), None ones rendered
        # with sentinel values rather than crashing the whole response.
        assert result.get("is_duplicate") is True
        assert len(result["matches"]) == 2
        # The None-metadata entry falls back to sentinels.
        none_entry = result["matches"][1]
        assert none_entry["wing"] == "?"
        assert none_entry["room"] == "?"
        assert none_entry["content"] == ""
    def test_check_duplicate(self, monkeypatch, config, palace_path, seeded_collection, kg):
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace.mcp_server import tool_check_duplicate
@@ -531,6 +593,45 @@ class TestWriteTools:
        result = tool_get_drawer("nonexistent_drawer")
        assert "error" in result
    def test_get_drawer_does_not_leak_absolute_source_file_path(
        self, monkeypatch, config, palace_path, collection, kg
    ):
        """tool_get_drawer must not expose the absolute filesystem path
        that the miners write into ``source_file``. Same threat class as
        the palace_path leak in mempalace_status: in nested-agent or
        multi-server MCP topologies the client is a separate trust
        domain, and the directory layout of the host has no documented
        client-side use. Basename is enough for citation."""
        _patch_mcp_server(monkeypatch, config, kg)
        secret_dir = "/private/home/alice/secret-research/2026"
        absolute_source = f"{secret_dir}/notes.md"
        collection.add(
            ids=["drawer_leak_probe"],
            documents=["verbatim drawer body for leak probe"],
            metadatas=[
                {
                    "wing": "research",
                    "room": "notes",
                    "source_file": absolute_source,
                    "chunk_index": 0,
                    "added_by": "miner",
                    "filed_at": "2026-05-03T00:00:00",
                }
            ],
        )
        from mempalace.mcp_server import tool_get_drawer
        result = tool_get_drawer("drawer_leak_probe")
        assert result["drawer_id"] == "drawer_leak_probe"
        assert result["metadata"]["source_file"] == "notes.md"
        # Defense-in-depth: no field anywhere in the response should
        # contain the absolute path or its parent directory.
        serialized = json.dumps(result)
        assert absolute_source not in serialized
        assert secret_dir not in serialized
    def test_list_drawers(self, monkeypatch, config, palace_path, seeded_collection, kg):
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace.mcp_server import tool_list_drawers
@@ -650,6 +751,90 @@ class TestKGTools:
            ended="2026-03-01",
        )
        assert result["success"] is True
        # Regression #1314: response must echo the actual ended date,
        # not silently drop it and return the literal string "today".
        assert result["ended"] == "2026-03-01"
    def test_kg_add_forwards_valid_to(self, monkeypatch, config, palace_path, kg):
        """Regression #1314 case 1: valid_to must round-trip through kg_add."""
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace.mcp_server import tool_kg_add
        result = tool_kg_add(
            subject="_test_temporal",
            predicate="had_value",
            object="probe",
            valid_from="2026-01-01",
            valid_to="2026-04-28",
        )
        assert result["success"] is True
        facts = kg.query_entity("_test_temporal")
        assert len(facts) == 1
        assert facts[0]["valid_from"] == "2026-01-01"
        assert facts[0]["valid_to"] == "2026-04-28"
        # An already-ended fact must not be reported as still current.
        assert facts[0]["current"] is False
    def test_kg_add_forwards_source_provenance(self, monkeypatch, config, palace_path, kg):
        """Regression #1314 case 3: source_file / source_drawer_id reach storage."""
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace.mcp_server import tool_kg_add
        result = tool_kg_add(
            subject="operating-verb",
            predicate="candidate",
            object="husbandry",
            valid_from="2026-04-28",
            source_closet="closet-42",
            source_file="docs/decisions.md",
            source_drawer_id="drawer_abc123",
        )
        assert result["success"] is True
        triple_id = result["triple_id"]
        # Read raw row to verify all provenance columns persisted.
        with kg._lock:
            row = (
                kg._conn()
                .execute(
                    "SELECT source_closet, source_file, source_drawer_id FROM triples WHERE id = ?",
                    (triple_id,),
                )
                .fetchone()
            )
        assert row is not None
        assert row["source_closet"] == "closet-42"
        assert row["source_file"] == "docs/decisions.md"
        assert row["source_drawer_id"] == "drawer_abc123"
    def test_kg_invalidate_returns_actual_ended_date(
        self, monkeypatch, config, palace_path, seeded_kg
    ):
        """Regression #1314 case 2: response reports the resolved date, not 'today'."""
        from datetime import date as _date
        _patch_mcp_server(monkeypatch, config, seeded_kg)
        from mempalace.mcp_server import tool_kg_invalidate
        # Caller-supplied date round-trips into the response.
        explicit = tool_kg_invalidate(
            subject="Max",
            predicate="does",
            object="swimming",
            ended="2026-04-28",
        )
        assert explicit["ended"] == "2026-04-28"
        # Caller-omitted date resolves to today's ISO date — never the
        # literal string "today" the buggy implementation used to return.
        implicit = tool_kg_invalidate(
            subject="Max",
            predicate="loves",
            object="Chess",
        )
        assert implicit["ended"] != "today"
        assert implicit["ended"] == _date.today().isoformat()
    def test_kg_timeline(self, monkeypatch, config, palace_path, seeded_kg):
        _patch_mcp_server(monkeypatch, config, seeded_kg)
@@ -665,6 +850,59 @@ class TestKGTools:
        result = tool_kg_stats()
        assert result["entities"] >= 4
    # --- Date validation at the MCP boundary (issue #1164) ---
    def test_kg_add_rejects_invalid_valid_from(self, monkeypatch, config, palace_path, kg):
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace.mcp_server import tool_kg_add
        result = tool_kg_add(
            subject="Alice",
            predicate="likes",
            object="coffee",
            valid_from="Jan 2025",
        )
        assert result["success"] is False
        assert "valid_from" in result["error"]
        assert "ISO-8601" in result["error"]
    def test_kg_query_rejects_invalid_as_of(self, monkeypatch, config, palace_path, seeded_kg):
        _patch_mcp_server(monkeypatch, config, seeded_kg)
        from mempalace.mcp_server import tool_kg_query
        result = tool_kg_query(entity="Max", as_of="March 2026")
        assert "error" in result
        assert "as_of" in result["error"]
    def test_kg_invalidate_rejects_invalid_ended(self, monkeypatch, config, palace_path, seeded_kg):
        _patch_mcp_server(monkeypatch, config, seeded_kg)
        from mempalace.mcp_server import tool_kg_invalidate
        result = tool_kg_invalidate(
            subject="Max",
            predicate="does",
            object="chess",
            ended="yesterday",
        )
        assert result["success"] is False
        assert "ended" in result["error"]
    def test_kg_query_rejects_partial_iso_dates(self, monkeypatch, config, palace_path, seeded_kg):
        _patch_mcp_server(monkeypatch, config, seeded_kg)
        from mempalace.mcp_server import tool_kg_query
        # Partial ISO dates are rejected: KG queries compare TEXT dates
        # lexicographically, so "2026-01-01" <= "2026" is False, which
        # silently excludes facts. Reject at the boundary — only YYYY-MM-DD
        # produces correct results.
        for value in ("2026", "2026-03"):
            result = tool_kg_query(entity="Max", as_of=value)
            assert "error" in result, f"accepted partial date {value!r}: {result}"
        # Full ISO-8601 dates still pass.
        result = tool_kg_query(entity="Max", as_of="2026-03-15")
        assert "error" not in result, f"rejected valid date: {result}"
 # ── Diary Tools ─────────────────────────────────────────────────────────
@@ -682,7 +920,8 @@ class TestDiaryTools:
            topic="architecture",
        )
        assert w["success"] is True
-        assert w["agent"] == "TestAgent"
+        # agent_name is normalized to lowercase on write (#1243).
        assert w["agent"] == "testagent"
        r = tool_diary_read(agent_name="TestAgent")
        assert r["total"] == 1
@@ -774,6 +1013,50 @@ class TestDiaryTools:
        assert r_scoped["total"] == 1
        assert r_scoped["entries"][0]["content"] == "project-wing entry"
    def test_diary_read_case_insensitive_agent(self, monkeypatch, config, palace_path, kg):
        """Regression for #1243: diary_read must be case-insensitive over
        agent_name. Writing as "Claude" and reading as "claude" (or vice
        versa) must surface the same entries — sanitize_name preserved
        case, which silently dropped reads when the agent name's casing
        differed from the write."""
        _patch_mcp_server(monkeypatch, config, kg)
        _client, _col = _get_collection(palace_path, create=True)
        del _client
        from mempalace.mcp_server import tool_diary_read, tool_diary_write
        # Write as "Claude" → read as "claude" should match.
        w1 = tool_diary_write(
            agent_name="Claude",
            entry="entry written as Claude",
            topic="general",
        )
        assert w1["success"]
        r1 = tool_diary_read(agent_name="claude")
        assert "entries" in r1, r1
        contents1 = {e["content"] for e in r1["entries"]}
        assert "entry written as Claude" in contents1
        # Write as "CLAUDE" → read as "Claude" should also match the
        # same agent. After normalization both writes target the same
        # lowercase agent identity, so both entries are returned.
        w2 = tool_diary_write(
            agent_name="CLAUDE",
            entry="entry written as CLAUDE",
            topic="general",
        )
        assert w2["success"]
        r2 = tool_diary_read(agent_name="Claude")
        contents2 = {e["content"] for e in r2["entries"]}
        assert "entry written as Claude" in contents2
        assert "entry written as CLAUDE" in contents2
        # The stored agent metadata is the lowercase form, and the
        # default wing is derived from that lowercase form too.
        assert w1["agent"] == "claude"
        assert w2["agent"] == "claude"
 # ── Cache Invalidation (inode/mtime) ──────────────────────────────────
@@ -919,3 +1202,352 @@ class TestCacheInvalidation:
        col2 = mcp_server._get_collection(create=True)
        assert col2 is not None
        assert calls == [], f"get_or_create_collection was called: {calls}"
    def test_get_collection_passes_embedding_function(self, monkeypatch, config, palace_path, kg):
        """Regression for #1299.
        ``mcp_server._get_collection`` must pass ``embedding_function=`` into
        both ``client.get_collection`` and ``client.create_collection``,
        mirroring ``ChromaBackend.get_collection``. Without it, ChromaDB 1.x
        falls back to its built-in ``DefaultEmbeddingFunction`` (whose lazy
        ONNX provider selection has SIGSEGV'd on python 3.14 + Apple Silicon),
        and writers/readers can disagree with the miner about which EF is
        bound to the collection. The miner / Stop hook ingest path routes
        through ``ChromaBackend.get_collection`` which does this correctly;
        the MCP server must match.
        """
        _patch_mcp_server(monkeypatch, config, kg)
        from mempalace import mcp_server
        client = mcp_server._get_client()
        client_cls = type(client)
        captured: dict[str, list[dict]] = {"get": [], "create": []}
        real_get = client_cls.get_collection
        real_create = client_cls.create_collection
        def _spy_get(self, name, **kwargs):
            captured["get"].append(dict(kwargs))
            return real_get(self, name, **kwargs)
        def _spy_create(self, name, **kwargs):
            captured["create"].append(dict(kwargs))
            return real_create(self, name, **kwargs)
        monkeypatch.setattr(client_cls, "get_collection", _spy_get)
        monkeypatch.setattr(client_cls, "create_collection", _spy_create)
        mcp_server._collection_cache = None
        col = mcp_server._get_collection(create=True)
        assert col is not None
        all_calls = captured["get"] + captured["create"]
        assert all_calls, "expected get_collection or create_collection to be called"
        for kwargs in all_calls:
            assert (
                "embedding_function" in kwargs
            ), f"missing embedding_function= in chromadb call: {kwargs}"
            assert kwargs["embedding_function"] is not None
        # Same expectation on the create=False (cache-miss) reopen path.
        mcp_server._collection_cache = None
        captured["get"].clear()
        captured["create"].clear()
        col2 = mcp_server._get_collection()
        assert col2 is not None
        assert captured["get"], "expected get_collection on cache-miss reopen"
        for kwargs in captured["get"]:
            assert "embedding_function" in kwargs
            assert kwargs["embedding_function"] is not None
    def test_get_collection_retries_once_on_exception(self, monkeypatch, config, palace_path, kg):
        """Regression: a transient failure inside _get_collection must trigger
        one retry after clearing the client/collection caches, not silently
        return None.
        Before this fix, a stale chromadb handle (e.g. the rust bindings
        invalidating after an out-of-band write) would raise inside the
        single ``try`` block, get swallowed by ``except Exception: return
        None``, and every subsequent tool call would hit the same poisoned
        cache returning None. The retry forces ``_get_client()`` to rebuild
        the client (which re-runs ``quarantine_stale_hnsw`` per #1322), so
        the second attempt heals the common stale-handle case.
        """
        _patch_mcp_server(monkeypatch, config, kg)
        _client, _col = _get_collection(palace_path, create=True)
        del _client
        from mempalace import mcp_server
        # Force a cold cache so the first call goes through the open path.
        mcp_server._client_cache = None
        mcp_server._collection_cache = None
        real_get_client = mcp_server._get_client
        attempts = {"count": 0}
        def flaky_get_client():
            attempts["count"] += 1
            if attempts["count"] == 1:
                raise RuntimeError("simulated transient chromadb failure")
            return real_get_client()
        monkeypatch.setattr(mcp_server, "_get_client", flaky_get_client)
        col = mcp_server._get_collection()
        # Both attempts ran and the second succeeded.
        assert attempts["count"] == 2
        assert col is not None
    def test_get_collection_returns_none_after_two_failures(
        self, monkeypatch, config, palace_path, kg
    ):
        """If both attempts fail, return None (matches the prior contract for
        permanent failures — only the transient case is now self-healing)."""
        _patch_mcp_server(monkeypatch, config, kg)
        _client, _col = _get_collection(palace_path, create=True)
        del _client
        from mempalace import mcp_server
        mcp_server._client_cache = None
        mcp_server._collection_cache = None
        attempts = {"count": 0}
        def always_fails():
            attempts["count"] += 1
            raise RuntimeError("permanent chromadb failure")
        monkeypatch.setattr(mcp_server, "_get_client", always_fails)
        col = mcp_server._get_collection()
        assert attempts["count"] == 2
        assert col is None
 class TestKGLazyCache:
    """Lazy per-path KnowledgeGraph cache (issue #1136)."""
    def test_lazy_init_no_import_side_effect(self, tmp_path):
        """Importing mcp_server must not create knowledge_graph.sqlite3.
        Runs in a fresh subprocess with HOME pointed at tmp_path so the
        assertion targets a clean filesystem, independent of conftest's
        session-level HOME patch.
        """
        import subprocess
        import sys
        kg_file = tmp_path / ".mempalace" / "knowledge_graph.sqlite3"
        env = {k: v for k, v in os.environ.items() if not k.startswith("MEMPAL")}
        env["HOME"] = str(tmp_path)
        env["USERPROFILE"] = str(tmp_path)
        result = subprocess.run(
            [sys.executable, "-c", "import mempalace.mcp_server"],
            env=env,
            capture_output=True,
            text=True,
            timeout=30,
        )
        assert result.returncode == 0, f"import failed: {result.stderr}"
        assert not kg_file.exists(), f"import created sqlite file at {kg_file} as a side effect"
    def test_get_kg_returns_same_instance(self, tmp_path, monkeypatch):
        """Two calls with the same resolved path return the same KG."""
        from mempalace import mcp_server
        monkeypatch.setattr(mcp_server, "_kg_by_path", {})
        monkeypatch.setattr(mcp_server, "_palace_flag_given", True)
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_path))
        kg1 = mcp_server._get_kg()
        kg2 = mcp_server._get_kg()
        assert kg1 is kg2
        assert len(mcp_server._kg_by_path) == 1
    def test_get_kg_different_paths_different_instances(self, tmp_path, monkeypatch):
        """Different palace paths map to different KG instances."""
        from mempalace import mcp_server
        tmp_a = tmp_path / "a"
        tmp_b = tmp_path / "b"
        tmp_a.mkdir()
        tmp_b.mkdir()
        monkeypatch.setattr(mcp_server, "_kg_by_path", {})
        monkeypatch.setattr(mcp_server, "_palace_flag_given", True)
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_a))
        kg_a = mcp_server._get_kg()
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_b))
        kg_b = mcp_server._get_kg()
        assert kg_a is not kg_b
        assert len(mcp_server._kg_by_path) == 2
    def test_multi_tenant_env_switch(self, tmp_path, monkeypatch):
        """The issue #1136 acceptance scenario.
        Rotating MEMPALACE_PALACE_PATH between MCP tool calls must route
        each call to the correct tenant's KG sqlite file.
        """
        from mempalace import mcp_server
        tmp_a = tmp_path / "tenant_a"
        tmp_b = tmp_path / "tenant_b"
        tmp_a.mkdir()
        tmp_b.mkdir()
        monkeypatch.setattr(mcp_server, "_kg_by_path", {})
        monkeypatch.setattr(mcp_server, "_palace_flag_given", True)
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_a))
        add_result = mcp_server.tool_kg_add(
            subject="alice_secret",
            predicate="owns",
            object="repo_a",
        )
        assert add_result.get("success") is True, add_result
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_b))
        query_b = mcp_server.tool_kg_query(entity="alice_secret")
        assert query_b.get("count", 0) == 0, f"tenant B leaked tenant A's fact: {query_b}"
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_a))
        query_a = mcp_server.tool_kg_query(entity="alice_secret")
        assert query_a.get("count", 0) >= 1, f"tenant A lost its own fact: {query_a}"
    def test_cache_thread_safe(self, tmp_path, monkeypatch):
        """Concurrent _get_kg() for the same path yields one instance."""
        import concurrent.futures
        from mempalace import mcp_server
        monkeypatch.setattr(mcp_server, "_kg_by_path", {})
        monkeypatch.setattr(mcp_server, "_palace_flag_given", True)
        monkeypatch.setenv("MEMPALACE_PALACE_PATH", str(tmp_path))
        with concurrent.futures.ThreadPoolExecutor(max_workers=16) as pool:
            results = list(pool.map(lambda _: mcp_server._get_kg(), range(16)))
        ids = {id(kg) for kg in results}
        assert len(ids) == 1, f"expected 1 unique instance, got {len(ids)}"
        assert len(mcp_server._kg_by_path) == 1
    def test_tool_reconnect_drains_kg_cache(self, monkeypatch):
        """``tool_reconnect`` must close cached KG instances and clear the dict.
        Without this, an external replacement of ``knowledge_graph.sqlite3``
        leaves the server pinned to a stale ``sqlite3.Connection``.
        """
        from mempalace import mcp_server
        class _FakeKG:
            def __init__(self):
                self.closed = False
            def close(self):
                self.closed = True
        fake_a = _FakeKG()
        fake_b = _FakeKG()
        monkeypatch.setattr(mcp_server, "_kg_by_path", {"/a": fake_a, "/b": fake_b})
        # Bypass real ChromaDB so the test isolates KG-cache behaviour.
        monkeypatch.setattr(mcp_server, "_get_collection", lambda: None)
        mcp_server.tool_reconnect()
        assert fake_a.closed is True
        assert fake_b.closed is True
        assert mcp_server._kg_by_path == {}
    def test_tool_reconnect_swallows_kg_close_errors(self, monkeypatch):
        """A failing ``close()`` on one cached KG must not block cache clearing."""
        from mempalace import mcp_server
        class _BoomKG:
            def close(self):
                raise RuntimeError("boom")
        monkeypatch.setattr(mcp_server, "_kg_by_path", {"/a": _BoomKG()})
        monkeypatch.setattr(mcp_server, "_get_collection", lambda: None)
        mcp_server.tool_reconnect()
        assert mcp_server._kg_by_path == {}
    def test_call_kg_retries_after_concurrent_close(self, monkeypatch):
        """A KG closed mid-handler must trigger a one-shot retry with a fresh
        instance — not surface a -32000 to the MCP client."""
        import sqlite3 as _sqlite3
        from mempalace import mcp_server
        path = "/fake/palace/knowledge_graph.sqlite3"
        monkeypatch.setattr(mcp_server, "_resolve_kg_path", lambda: path)
        class _ClosedKG:
            def query_entity(self, entity, **kwargs):
                raise _sqlite3.ProgrammingError("Cannot operate on a closed database")
        class _FreshKG:
            def query_entity(self, entity, **kwargs):
                return [{"entity": entity}]
        cache = {os.path.abspath(path): _ClosedKG()}
        monkeypatch.setattr(mcp_server, "_kg_by_path", cache)
        # Second _get_kg() call (after the cache eviction) constructs a new
        # KG. Patch the constructor so we don't open a real sqlite file.
        monkeypatch.setattr(mcp_server, "KnowledgeGraph", lambda **_: _FreshKG())
        result = mcp_server._call_kg(lambda kg: kg.query_entity("Alice"))
        assert result == [{"entity": "Alice"}]
        # The closed instance must be evicted; the fresh one must be cached.
        assert isinstance(cache[os.path.abspath(path)], _FreshKG)
    def test_call_kg_does_not_retry_on_other_errors(self, monkeypatch):
        """Non-ProgrammingError exceptions must propagate without retry —
        we don't want the retry guard masking real bugs."""
        from mempalace import mcp_server
        path = "/fake/palace/knowledge_graph.sqlite3"
        monkeypatch.setattr(mcp_server, "_resolve_kg_path", lambda: path)
        calls = {"count": 0}
        class _FailingKG:
            def query_entity(self, entity, **kwargs):
                calls["count"] += 1
                raise ValueError("bad input")
        monkeypatch.setattr(mcp_server, "_kg_by_path", {os.path.abspath(path): _FailingKG()})
        monkeypatch.setattr(mcp_server, "KnowledgeGraph", lambda **_: _FailingKG())
        with pytest.raises(ValueError, match="bad input"):
            mcp_server._call_kg(lambda kg: kg.query_entity("Alice"))
        assert calls["count"] == 1, "non-ProgrammingError must not trigger retry"
    def test_call_kg_gives_up_after_one_retry(self, monkeypatch):
        """If the second attempt also hits a closed DB, give up rather than
        loop forever — a sustained close-stream is a different bug."""
        import sqlite3 as _sqlite3
        from mempalace import mcp_server
        path = "/fake/palace/knowledge_graph.sqlite3"
        monkeypatch.setattr(mcp_server, "_resolve_kg_path", lambda: path)
        calls = {"count": 0}
        class _AlwaysClosedKG:
            def query_entity(self, entity, **kwargs):
                calls["count"] += 1
                raise _sqlite3.ProgrammingError("closed again")
        cache = {}
        monkeypatch.setattr(mcp_server, "_kg_by_path", cache)
        monkeypatch.setattr(mcp_server, "KnowledgeGraph", lambda **_: _AlwaysClosedKG())
        with pytest.raises(_sqlite3.ProgrammingError):
            mcp_server._call_kg(lambda kg: kg.query_entity("Alice"))
        assert calls["count"] == 2, "expected exactly one retry beyond the initial attempt"
@@ -135,19 +135,77 @@ def test_different_palaces_dont_conflict(tmp_path, monkeypatch):
 def test_palace_path_is_normalized(tmp_path, monkeypatch):
-    """Relative and absolute forms of the same path must use the same lock."""
+    """Relative and absolute forms of the same path must use the same lock.
    Cross-process variant: a child holds the absolute form, a relative form
    in the parent must hash to the same lock key and raise
    ``MineAlreadyRunning``. (The same-thread case is now a re-entrant
    pass-through by design — see ``test_reentrant_same_thread_passes_through``
    — so we exercise the normalization invariant across a process boundary
    where re-entrance does not apply.)
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    monkeypatch.chdir(tmp_path)
    os.makedirs(tmp_path / "palace", exist_ok=True)
    absolute = str(tmp_path / "palace")
-    relative = "palace"
+    ready = str(tmp_path / "ready")
    release = str(tmp_path / "release")
-    # Hold the lock with the absolute form; attempting to re-acquire with
+    ctx = _get_mp_context()
-    # the relative form (which resolves to the same absolute path) must fail.
+    holder = ctx.Process(target=_hold_lock, args=(absolute, ready, release))
-    with mine_palace_lock(absolute):
+    holder.start()
    try:
        for _ in range(500):
            if os.path.exists(ready):
                break
            time.sleep(0.01)
        assert os.path.exists(ready), "holder failed to acquire lock in time"
        # Parent holds CWD = tmp_path so "palace" is the same on-disk dir as
        # the absolute form. The lock key is sha256(realpath+normcase) so the
        # two forms must collide.
        with pytest.raises(MineAlreadyRunning):
-            with mine_palace_lock(relative):
+            with mine_palace_lock("palace"):
                pytest.fail("normalized path collision should have raised")
    finally:
        open(release, "w").close()
        holder.join(timeout=5)
 def test_reentrant_same_thread_passes_through(tmp_path, monkeypatch):
    """Same thread re-acquiring the same palace lock must not deadlock or raise.
    This is the invariant that makes ``ChromaCollection`` write methods (which
    take ``mine_palace_lock`` for MCP/direct-writer protection) compose with
    ``miner.mine()`` (which already holds the lock for the entire mine
    pipeline). Without the per-thread re-entrant guard the inner acquire
    would self-deadlock on the outer flock.
    """
    monkeypatch.setenv("HOME", str(tmp_path))
    palace = str(tmp_path / "palace")
    with mine_palace_lock(palace):
        # Re-enter from the same thread — must yield without raising or hanging.
        with mine_palace_lock(palace):
            pass
        # After the inner exits, the outer is still held: confirm via a
        # subprocess that tries to acquire and reports back.
        ctx = _get_mp_context()
        result_q = ctx.Queue()
        child = ctx.Process(target=_try_acquire_expect_busy, args=(palace, result_q))
        child.start()
        child.join(timeout=5)
        assert (
            result_q.get(timeout=1) == "busy"
        ), "outer lock should still be held by parent after inner re-entrant exit"
 def _try_acquire_expect_busy(palace_path, result_q):
    """Helper: try to acquire, push 'busy' (raised) or 'free' (acquired) into queue."""
    try:
        with mine_palace_lock(palace_path):
            result_q.put("free")
    except MineAlreadyRunning:
        result_q.put("busy")
 def test_mine_global_lock_is_alias_for_back_compat(tmp_path, monkeypatch):
@@ -120,6 +120,65 @@ class TestSearchMemories:
        assert none_hit["wing"] == "unknown"
        assert none_hit["room"] == "unknown"
    def test_effective_distance_clamped_to_valid_cosine_range(self):
        """A strong closet boost (up to 0.40) applied to a low-distance drawer
        can drive ``dist - boost`` negative. That violates the cosine-distance
        invariant ``[0, 2]``: the API returns ``similarity > 1.0`` and the
        internal ``_sort_key`` sinks below ordinary positive distances,
        inverting the ranking so the best hybrid matches sort last.
        With the clamp, ``effective_distance`` stays in ``[0, 2]``,
        ``similarity`` stays in ``[0, 1]``, and the sort order is stable.
        """
        # Drawer a.md gets a tiny base distance (0.08) — nearly exact match.
        # Drawer b.md gets a larger base distance (0.35).
        drawers_col = MagicMock()
        drawers_col.query.return_value = {
            "documents": [["doc-a", "doc-b"]],
            "metadatas": [
                [
                    {"source_file": "a.md", "wing": "w", "room": "r", "chunk_index": 0},
                    {"source_file": "b.md", "wing": "w", "room": "r", "chunk_index": 0},
                ]
            ],
            "distances": [[0.08, 0.35]],
            "ids": [["d-a", "d-b"]],
        }
        # A strong closet at rank 0 points at a.md → boost = 0.40,
        # which exceeds a.md's base distance and would go negative without
        # the clamp. No closet for b.md.
        closets_col = MagicMock()
        closets_col.query.return_value = {
            "documents": [["closet-preview-a"]],
            "metadatas": [[{"source_file": "a.md"}]],
            "distances": [[0.2]],  # within CLOSET_DISTANCE_CAP (1.5)
            "ids": [["c-a"]],
        }
        with (
            patch("mempalace.searcher.get_collection", return_value=drawers_col),
            patch("mempalace.searcher.get_closets_collection", return_value=closets_col),
        ):
            result = search_memories("query", "/fake/path", n_results=5)
        hits = result["results"]
        assert hits, "should return results"
        # Invariants on every hit.
        for h in hits:
            assert (
                0.0 <= h["similarity"] <= 1.0
            ), f"similarity out of range: {h['similarity']} for {h['source_file']}"
            assert 0.0 <= h["effective_distance"] <= 2.0, (
                f"effective_distance out of range: {h['effective_distance']} "
                f"for {h['source_file']}"
            )
        # With the clamp, the closet-boosted a.md still ranks ahead of b.md —
        # the boost still wins, but it no longer flips the ranking.
        assert hits[0]["source_file"] == "a.md"
        assert hits[0]["matched_via"] == "drawer+closet"
 # ── BM25 internals: None / empty document safety ─────────────────────
@@ -0,0 +1,39 @@
 #!/usr/bin/env bash
 # backup_claude_jsonls.sh
 #
 # Claude Code stores every conversation as a JSONL transcript at
 #   ~/.claude/projects/<encoded-project>/<session-uuid>.jsonl
 # Anthropic auto-deletes those files after 30 DAYS:
 #   https://docs.claude.com/en/docs/claude-code/data-usage
 #
 # This script copies them, read-only, into ~/Documents/Claude_JSONL_Backup/
 # so the 30-day clock no longer applies. Re-run any time — rsync is incremental.
 # It NEVER deletes, modifies, or touches files inside ~/.claude/.
 set -eu
 SRC="${HOME}/.claude/projects/"
 DST="${HOME}/Documents/Claude_JSONL_Backup/"
 [ -d "$SRC" ] || { echo "ERROR: $SRC does not exist."; exit 1; }
 mkdir -p "$DST"
 echo "Backing up $SRC -> $DST"
 rsync -a --times "$SRC" "$DST"
 src_count=$(find "$SRC" -type f -name '*.jsonl' | wc -l | tr -d ' ')
 dst_count=$(find "$DST" -type f -name '*.jsonl' | wc -l | tr -d ' ')
 oldest=$(find "$DST" -type f -name '*.jsonl' -exec stat -f '%Sm %N' -t '%Y-%m-%d' {} \; 2>/dev/null \
        || find "$DST" -type f -name '*.jsonl' -printf '%TY-%Tm-%Td %p\n' 2>/dev/null)
 oldest_date=$(echo "$oldest" | sort | head -n 1 | awk '{print $1}')
 newest_date=$(echo "$oldest" | sort | tail -n 1 | awk '{print $1}')
 echo "Source JSONL count : $src_count"
 echo "Backup JSONL count : $dst_count"
 echo "Oldest backup file : ${oldest_date:-n/a}"
 echo "Newest backup file : ${newest_date:-n/a}"
 if [ "$src_count" -ne "$dst_count" ]; then
  echo "FAIL: count mismatch ($src_count vs $dst_count)"; exit 2
 fi
 echo "OK: backup verified."
@@ -0,0 +1,115 @@
 #!/usr/bin/env bash
 # find_orphan_claude_jsonls.sh — v3 (multi-line shape + verb-aware preview)
 # -----------------------------------------------------------------------------
 # Finds Claude Code conversation transcripts (.jsonl) that may have survived in
 # backup/sync locations. Claude Code stores transcripts at
 # ~/.claude/projects/<encoded>/<session>.jsonl and auto-deletes them locally
 # after 30 days. If your machine syncs to iCloud, Dropbox, Google Drive,
 # OneDrive, Time Machine, or you copied transcripts elsewhere manually, those
 # copies still exist. This script finds them and shows a topic preview from
 # the first substantive user message — strips leading filler interjections
 # ("ok so", "oh", "well", "hey") so previews surface the actual content.
 #
 # Read-only. Safe to re-run.
 # -----------------------------------------------------------------------------
 set -eu
 LOCATIONS=(
  "$HOME/Library/Mobile Documents" "$HOME/Dropbox" "$HOME/Google Drive"
  "$HOME/OneDrive" "$HOME/Documents" "$HOME/Desktop" "/Volumes"
 )
 TMP="$(mktemp)"; trap 'rm -f "$TMP" "$TMP.s"' EXIT
 printf "Scanning backup locations" >&2
 for loc in "${LOCATIONS[@]}"; do
  [ -d "$loc" ] || continue
  printf "." >&2
  while IFS= read -r -d '' f; do
    # Combined: shape detection (multi-line) + verb-aware topic preview
    if preview="$(python3 - "$f" 2>/dev/null <<'PYEOF'
 import json, sys, re
 # Single-word/short greetings — message gets skipped entirely if it is just one of these
 GREETINGS = {'hi','hey','hello','thanks','thank you','ok','okay','yes','no',
             'sure','cool','great','good','done','yep','nope','perfect','copy'}
 # Leading filler — interjections that get STRIPPED from the start of a message
 # before the preview is taken. Iterative — handles "ok so well, then..." → "then..."
 LEADING_FILLER = re.compile(
    r'^(?:ok(?:ay)?|so|oh|well|anyway|btw|hmm+|um+|uh+|hey|hi|hello|right|'
    r'yes|no|sure|cool|great|good|listen|look|wait|actually|alright|gotcha|'
    r'yeah|yep|nope|nah)\b[\s,!.?:;-]*',
    re.IGNORECASE
 )
 path = sys.argv[1]
 shape_ok = False
 preview = ""
 try:
    with open(path, 'r', errors='replace') as fh:
        for i, line in enumerate(fh):
            if i >= 30: break
            try:
                d = json.loads(line)
            except Exception:
                continue
            if not isinstance(d, dict): continue
            # Shape check — accept if any line in first 30 has session fields
            if not shape_ok and 'sessionId' in d and 'timestamp' in d and 'message' in d:
                shape_ok = True
            # Preview — first user message after stripping leading filler
            if not preview:
                role = d.get('type', '') or d.get('message', {}).get('role', '')
                if role == 'user':
                    content = d.get('message', {}).get('content', '')
                    if isinstance(content, list):
                        text = ' '.join(
                            c.get('text', '') for c in content
                            if isinstance(c, dict) and c.get('type') == 'text'
                        )
                    elif isinstance(content, str):
                        text = content
                    else:
                        text = ''
                    text = re.sub(r'\s+', ' ', text).strip()
                    # Skip messages that are pure greetings
                    if text.lower() in GREETINGS:
                        continue
                    # Iteratively strip leading filler tokens until stable
                    prev_text = None
                    while prev_text != text:
                        prev_text = text
                        text = LEADING_FILLER.sub('', text).strip()
                    # Skip if what remains is too short
                    if len(text) < 20:
                        continue
                    preview = text[:80] + ('...' if len(text) > 80 else '')
            if shape_ok and preview: break
 except Exception:
    pass
 if shape_ok:
    print(preview if preview else "(no preview — first 30 lines were greetings or short)")
    sys.exit(0)
 sys.exit(1)
 PYEOF
 )"; then
      mtime="$(stat -f '%Sm' -t '%Y-%m-%d' "$f" 2>/dev/null || stat -c '%y' "$f" 2>/dev/null | cut -d' ' -f1)"
      size="$(stat -f '%z' "$f" 2>/dev/null || stat -c '%s' "$f" 2>/dev/null)"
      printf '%s\t%s\t%s\t%s\n' "$mtime" "$size" "$f" "$preview" >>"$TMP"
    fi
  done < <(find "$loc" -type f -name '*.jsonl' -print0 2>/dev/null)
 done
 printf "\n" >&2
 count=$(wc -l <"$TMP" | tr -d ' ')
 if [ "$count" -eq 0 ]; then
  echo "No orphan Claude Code transcripts found in scanned backup locations."
  exit 0
 fi
 sort -k1,1 "$TMP" >"$TMP.s"
 oldest="$(head -n 1 "$TMP.s" | cut -f1)"
 newest="$(tail -n 1 "$TMP.s" | cut -f1)"
 echo "Found $count orphan Claude Code transcript(s). Oldest: $oldest  Newest: $newest"
 echo "----------------------------------------------------------------------"
 awk -F'\t' '{ printf "%s  %10s  %s\n              \"%s\"\n\n", $1, $2, $3, $4 }' "$TMP.s"
@@ -0,0 +1,80 @@
 #!/usr/bin/env python3
 """render_jsonl.py — turn one Claude Code JSONL transcript into readable text.
 Claude Code stores conversations at ~/.claude/projects/<proj>/<uuid>.jsonl and
 Anthropic auto-deletes them after 30 days
 (https://docs.claude.com/en/docs/claude-code/data-usage). This script renders a
 JSONL into a clean .txt so you can keep / read / share it without the tooling.
 Usage:
    python3 render_jsonl.py <input.jsonl> [output.txt]
 Stdlib only. Python 3.9+. Read-only on the input.
 """
 import json
 import sys
 from pathlib import Path
 def extract_text(content):
    if isinstance(content, str):
        return content.strip()
    if isinstance(content, list):
        parts = []
        for blk in content:
            if isinstance(blk, dict) and blk.get("type") == "text":
                t = (blk.get("text") or "").strip()
                if t:
                    parts.append(t)
        return "\n".join(parts)
    return ""
 def main():
    if len(sys.argv) < 2:
        print(__doc__)
        sys.exit(1)
    src = Path(sys.argv[1])
    if not src.is_file():
        print(f"ERROR: not a file: {src}")
        sys.exit(1)
    out = open(sys.argv[2], "w", encoding="utf-8") if len(sys.argv) > 2 else sys.stdout
    turns, stamps = [], []
    for raw in src.read_text(encoding="utf-8", errors="replace").splitlines():
        if not raw.strip():
            continue
        try:
            obj = json.loads(raw)
        except json.JSONDecodeError:
            continue
        role = obj.get("type") or (obj.get("message") or {}).get("role")
        if role not in ("user", "assistant"):
            continue
        msg = obj.get("message") or obj
        text = extract_text(msg.get("content"))
        if not text:
            continue
        ts = obj.get("timestamp") or ""
        if ts:
            stamps.append(ts)
        turns.append((ts, role, text))
    header = [
        f"# Claude Code transcript: {src}",
        f"# Total turns: {len(turns)}",
        f"# Date range : {min(stamps) if stamps else 'n/a'}  ->  {max(stamps) if stamps else 'n/a'}",
        "#" + "-" * 70,
        "",
    ]
    out.write("\n".join(header))
    for ts, role, text in turns:
        out.write(f"\n[{ts}] {role.upper()}\n{text}\n\n{'-'*72}\n")
    if out is not sys.stdout:
        out.close()
        print(f"Wrote {len(turns)} turns to {sys.argv[2]}")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,26 @@
 ---
 description: Save the current Claude Code session into MemPalace. Idempotent — won't dupe.
 ---
 # /save
 Save the current Claude Code session into MemPalace. Run this when you
 want a checkpoint. Safe to run repeatedly — drawer IDs are content-hashed
 so re-running on the same session overwrites in place, no duplicates.
 Behavior:
 1. Find the current session's JSONL transcript path (Claude Code passes
   it via the conversation context — look for `~/.claude/projects/` paths).
 2. Run via bash:
   ```
   mempalace mine "<TRANSCRIPT_PATH>" --mode convos --wing claude_imports
   ```
 3. If the user supplied an argument after `/save`, use it as the wing name
   instead of `claude_imports` (e.g. `/save my_research` →
   `--wing my_research`).
 4. Report back: how many drawers were filed, into which wing/room.
 Requires `mempalace` to be installed (`pip install mempalace`).
@@ -10,7 +10,7 @@ Palace overview: total drawers, wing and room counts, AAAK spec, and memory prot
 **Parameters:** None
-**Returns:** `{ total_drawers, wings, rooms, palace_path, protocol, aaak_dialect }`
+**Returns:** `{ total_drawers, wings, rooms, protocol, aaak_dialect }`
 ---
@@ -122,7 +122,7 @@ Fetch a single drawer by ID — returns full content and metadata.
 |-----------|------|----------|-------------|
 | `drawer_id` | string | **Yes** | ID of the drawer to fetch |
-**Returns:** `{ drawer: { id, wing, room, content, ... } }`
+**Returns:** `{ drawer_id, content, wing, room, metadata }` where `metadata.source_file`, when present, is the basename only — the absolute path written by the miners is reduced before the dict is returned to MCP clients.
 ---
@@ -378,4 +378,4 @@ Force a reconnect to the palace database. Use this after external scripts or CLI
 **Parameters:** None
-**Returns:** `{ success, palace_path }`
+**Returns:** `{ success, message, drawers, vector_disabled[, vector_disabled_reason] }` (on no-palace: `{ success: false, message, drawers, vector_disabled }`; on exception: `{ success: false, error }`)