fix: address PR review — per-palace lock, MCP server path, hook timeout, tests
Addresses the six Copilot review comments on the initial commit. 1) #6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend The MCP server creates its palace collection directly via `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`, not through `ChromaBackend.get_collection`. That path was missing the `hnsw:num_threads=1` metadata, so the primary crash surface for #974 and #965 was untouched by the original patch. Fixed by passing `hnsw:num_threads=1` at the mcp_server create site too. Documented in a code comment that the setting is only honored at creation time — existing palaces created before this fix still need a `mempalace nuke` + re-mine to gain the protection. 2) #3 — mine_global_lock over-serialized mines across unrelated palaces Replaced the single global lock file `mine_global.lock` with a per-palace lock keyed by `sha256(os.path.abspath(palace_path))` (`mine_palace_<hash>.lock`). Mines against the same palace still collapse to a single runner (the correctness boundary), but mines against *different* palaces are now free to run in parallel. `mine_global_lock` is kept as a backward-compatible alias for `mine_palace_lock` so any external callers that imported the previous name keep working. 3) #1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow palaces. The previous `except OSError` clause didn't catch it, so the hook could raise and fail to emit any JSON decision — leaving the harness without a block/passthrough signal. Fixed by catching `(OSError, subprocess.TimeoutExpired)` together and always falling through to the block decision so the hook reliably emits a response. 4) #2 + #4 — tests - tests/test_hooks_cli.py: added `test_precompact_first_two_attempts_block`, `test_precompact_passes_through_after_cap`, and `test_precompact_counter_is_per_session` to lock in the #955 deadlock fix. - tests/test_palace_locks.py (new): covers `mine_palace_lock` single-acquire, reuse-after-release, cross-process serialization on the same palace, non-interference across different palaces, path normalization, and the `mine_global_lock` back-compat alias. 5) #5 — known limitation, documented but not auto-fixed Copilot suggested detecting collections missing `hnsw:num_threads=1` and calling `collection.modify(metadata=...)` to retrofit existing palaces. Verified against chromadb 1.5.7: `modify(metadata=...)` replaces metadata rather than merging, and re-passing `hnsw:space="cosine"` then raises `ValueError: Changing the distance function of a collection once it is created is not supported currently.` The HNSW runtime configuration (`configuration_json`) also does not expose `num_threads` in chromadb 1.5.x, so the flag appears to be read only at creation time. Rather than paper over the limitation with a best-effort `modify` that silently drops `hnsw:space`, documented in the mcp_server comment that pre-existing palaces need a `mempalace nuke` + re-mine to gain the protection. Fresh palaces are always protected. Testing - pytest tests/test_palace_locks.py tests/test_hooks_cli.py tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**. - Runtime validation with two concurrent `mempalace mine` calls: - Different palaces → both complete in parallel ✓ - Same palace → one completes, the other exits with "another `mine` is already running against <palace> — exiting cleanly." ✓
This commit is contained in:
committed by
Igor Lins e Silva
parent
7e18a70796
commit
99b820cb42
@@ -217,9 +217,16 @@ def _get_collection(create=False):
|
||||
try:
|
||||
client = _get_client()
|
||||
if create:
|
||||
# hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
|
||||
# HNSW insert path, which has a race in repairConnectionsForUpdate /
|
||||
# addPoint (see issues #974, #965). The setting is only honored at
|
||||
# collection creation time — pre-existing palaces created before
|
||||
# this fix keep the unsafe default; users must `mempalace nuke` +
|
||||
# re-mine to get the protection on legacy palaces.
|
||||
_collection_cache = ChromaCollection(
|
||||
client.get_or_create_collection(
|
||||
_config.collection_name, metadata={"hnsw:space": "cosine"}
|
||||
_config.collection_name,
|
||||
metadata={"hnsw:space": "cosine", "hnsw:num_threads": 1},
|
||||
)
|
||||
)
|
||||
_metadata_cache = None
|
||||
|
||||
+4
-3
@@ -25,8 +25,8 @@ from .palace import (
|
||||
file_already_mined,
|
||||
get_closets_collection,
|
||||
get_collection,
|
||||
mine_global_lock,
|
||||
mine_lock,
|
||||
mine_palace_lock,
|
||||
purge_file_closets,
|
||||
upsert_closet_lines,
|
||||
)
|
||||
@@ -1008,7 +1008,7 @@ def mine(
|
||||
)
|
||||
|
||||
try:
|
||||
with mine_global_lock():
|
||||
with mine_palace_lock(palace_path):
|
||||
return _mine_impl(
|
||||
project_dir,
|
||||
palace_path,
|
||||
@@ -1021,7 +1021,8 @@ def mine(
|
||||
)
|
||||
except MineAlreadyRunning:
|
||||
print(
|
||||
"mempalace: another `mine` is already running — exiting cleanly.",
|
||||
f"mempalace: another `mine` is already running against "
|
||||
f"{palace_path} — exiting cleanly.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return
|
||||
|
||||
+23
-11
@@ -311,27 +311,33 @@ def mine_lock(source_file: str):
|
||||
|
||||
|
||||
class MineAlreadyRunning(RuntimeError):
|
||||
"""Raised when another `mempalace mine` process already holds the global lock."""
|
||||
"""Raised when another `mempalace mine` already holds the per-palace lock."""
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def mine_global_lock():
|
||||
"""Process-wide non-blocking lock around the full `mine` pipeline.
|
||||
def mine_palace_lock(palace_path: str):
|
||||
"""Per-palace non-blocking lock around the full `mine` pipeline.
|
||||
|
||||
The per-file `mine_lock` only protects delete+insert interleave for a
|
||||
single source; it does not prevent N copies of `mempalace mine <dir>`
|
||||
from being spawned concurrently by hooks. When that happens, each copy
|
||||
drives ChromaDB HNSW inserts in parallel, which (combined with
|
||||
chromadb's multi-threaded ParallelFor) can corrupt the HNSW graph and
|
||||
produce sparse link_lists.bin blowups.
|
||||
drives ChromaDB HNSW inserts in parallel against the same palace,
|
||||
which (combined with chromadb's multi-threaded ParallelFor) can
|
||||
corrupt the HNSW graph and produce sparse link_lists.bin blowups.
|
||||
|
||||
This lock is non-blocking: if another `mine` is already running, we
|
||||
The lock file is keyed by sha256(palace_path) so mines against
|
||||
*different* palaces can still run in parallel — we only serialize
|
||||
writes into the same palace, which is the correctness boundary.
|
||||
|
||||
Non-blocking: if another `mine` is already writing to this palace,
|
||||
raise MineAlreadyRunning so the caller can exit cleanly instead of
|
||||
piling up waiting workers.
|
||||
piling up as a waiting worker.
|
||||
"""
|
||||
lock_dir = os.path.join(os.path.expanduser("~"), ".mempalace", "locks")
|
||||
os.makedirs(lock_dir, exist_ok=True)
|
||||
lock_path = os.path.join(lock_dir, "mine_global.lock")
|
||||
resolved = os.path.abspath(os.path.expanduser(palace_path))
|
||||
palace_key = hashlib.sha256(resolved.encode()).hexdigest()[:16]
|
||||
lock_path = os.path.join(lock_dir, f"mine_palace_{palace_key}.lock")
|
||||
|
||||
lf = open(lock_path, "w")
|
||||
acquired = False
|
||||
@@ -344,7 +350,7 @@ def mine_global_lock():
|
||||
acquired = True
|
||||
except OSError as exc:
|
||||
raise MineAlreadyRunning(
|
||||
"another `mempalace mine` is already running"
|
||||
f"another `mempalace mine` is already running against {resolved}"
|
||||
) from exc
|
||||
else:
|
||||
import fcntl
|
||||
@@ -354,7 +360,7 @@ def mine_global_lock():
|
||||
acquired = True
|
||||
except BlockingIOError as exc:
|
||||
raise MineAlreadyRunning(
|
||||
"another `mempalace mine` is already running"
|
||||
f"another `mempalace mine` is already running against {resolved}"
|
||||
) from exc
|
||||
yield
|
||||
finally:
|
||||
@@ -373,6 +379,12 @@ def mine_global_lock():
|
||||
lf.close()
|
||||
|
||||
|
||||
# Backward-compatible alias (previous patch iteration used a single global
|
||||
# lock). Kept so third-party callers that imported it continue to work; new
|
||||
# code should use `mine_palace_lock(palace_path)` for per-palace scoping.
|
||||
mine_global_lock = mine_palace_lock
|
||||
|
||||
|
||||
def file_already_mined(collection, source_file: str, check_mtime: bool = False) -> bool:
|
||||
"""Check if a file has already been filed in the palace.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user