os.path.expanduser("~") reads HOME on POSIX but USERPROFILE on Windows;
the lock-body bound test was monkeypatching HOME only, so on
test-windows the lock file landed in the runner's real ~/.mempalace
and the tmp_path glob found nothing.
Patch USERPROFILE in addition to HOME, and read the body as bytes so
the byte-0 sentinel doesn't trip a UTF-8 decode warning. Assertion
shifts from line-count to size-bound (still detects unbounded growth
across re-acquires).
When a `mempalace mine` collided with another writer (live mcp_server,
another mine, anything taking mine_palace_lock), the operator saw a
generic "another `mempalace mine` is already running" message and the
CLI exited 0 — making the contention invisible to nohup or scripts
checking $?. The reporter ran a `nohup mempalace mine ... & disown`
and got a 200-byte log with only the auto-defaults warning, no clue
that an MCP server was holding the store.
palace.py: the lock file now records the holder's PID + first three
argv tokens on acquire. A failed acquire reads the file and surfaces
"palace <path> is held by PID N (mempalace mcp_server); wait for it
to finish or stop the holder before retrying" in the
MineAlreadyRunning message. Open mode changes from "w" to "a+" so the
prior holder's identity survives long enough to be read.
miner.mine() now lets MineAlreadyRunning propagate. cmd_mine catches
it, prints the holder-aware message to stderr, and exits non-zero so
shell wrappers detect the contention.
Note: this is a behavior change for in-process callers that depended
on miner.mine() silently swallowing MineAlreadyRunning. The silent
swallow was the bug.
Closes#1264
#976 protects `mempalace mine`, but MCP/direct backend writers still call
ChromaCollection.add/upsert/update/delete without the palace lock. This
moves the lock boundary to the Chroma backend seam so all Chroma writes
share the same palace-level serialization, with a re-entrant guard for
miner paths that already hold the lock.
mine_palace_lock(palace_path) gains a per-thread re-entrant guard
(threading.local + pid-tag against fork inheritance) so
ChromaCollection write methods can take the lock without
self-deadlocking when called from inside miner.mine()'s outer hold.
ChromaCollection.__init__ accepts an optional palace_path; when set,
add/upsert/update/delete wrap their underlying chromadb call with
mine_palace_lock(palace_path). palace_path=None preserves the legacy
no-lock behaviour for direct callers and tests. ChromaBackend's
get_collection/create_collection pass palace_path through;
mcp_server._get_collection forwards _config.palace_path so all MCP
write tools inherit the wrapping.
Tests: 5 new in tests/test_chroma_collection_lock.py covering opt-in,
writer-blocks-during-mine, re-entrant-inside-mine, two-process
serialization, and a source-level read-path-not-locked pin. Plus 1 new
+ 1 rewritten in tests/test_palace_locks.py for the re-entrant
semantics. 52 passed in 1.01s including the existing test_backends.py
regression suite.
Refs #1161.
Addresses the two actionable Copilot comments from the 2nd review pass.
tests/test_palace_locks.py (#7, #8)
multiprocessing.get_context("fork") is unavailable on Windows, so the
cross-process tests would crash the Windows CI runner. Added
`_get_mp_context()` that picks "spawn" on Windows and "fork" elsewhere.
Spawn re-imports the module in the child; it inherits os.environ
(including the monkeypatched HOME), which is all these tests need.
mempalace/palace.py (#10)
The per-palace lock key was computed from os.path.abspath(palace_path).
On Windows the filesystem is case-insensitive, so `C:\\Palace` and
`c:\\palace` would hash to different keys and two concurrent mines
could touch the same on-disk palace. Switched to
`os.path.normcase(os.path.realpath(...))` so:
* realpath resolves symlinks and `..` segments
* normcase folds case on Windows (no-op on POSIX)
Testing
pytest tests/test_palace_locks.py tests/test_hooks_cli.py
tests/test_backends.py tests/test_cli.py
→ 98 passed, 0 failed.
Addresses the six Copilot review comments on the initial commit.
1) #6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend
The MCP server creates its palace collection directly via
`chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
not through `ChromaBackend.get_collection`. That path was missing the
`hnsw:num_threads=1` metadata, so the primary crash surface for #974
and #965 was untouched by the original patch. Fixed by passing
`hnsw:num_threads=1` at the mcp_server create site too. Documented
in a code comment that the setting is only honored at creation
time — existing palaces created before this fix still need a
`mempalace nuke` + re-mine to gain the protection.
2) #3 — mine_global_lock over-serialized mines across unrelated palaces
Replaced the single global lock file `mine_global.lock` with a
per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
(`mine_palace_<hash>.lock`). Mines against the same palace still
collapse to a single runner (the correctness boundary), but mines
against *different* palaces are now free to run in parallel.
`mine_global_lock` is kept as a backward-compatible alias for
`mine_palace_lock` so any external callers that imported the
previous name keep working.
3) #1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired
`subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
palaces. The previous `except OSError` clause didn't catch it, so
the hook could raise and fail to emit any JSON decision — leaving
the harness without a block/passthrough signal. Fixed by catching
`(OSError, subprocess.TimeoutExpired)` together and always falling
through to the block decision so the hook reliably emits a response.
4) #2 + #4 — tests
- tests/test_hooks_cli.py: added
`test_precompact_first_two_attempts_block`,
`test_precompact_passes_through_after_cap`, and
`test_precompact_counter_is_per_session` to lock in the #955
deadlock fix.
- tests/test_palace_locks.py (new): covers `mine_palace_lock`
single-acquire, reuse-after-release, cross-process serialization
on the same palace, non-interference across different palaces,
path normalization, and the `mine_global_lock` back-compat alias.
5) #5 — known limitation, documented but not auto-fixed
Copilot suggested detecting collections missing `hnsw:num_threads=1`
and calling `collection.modify(metadata=...)` to retrofit existing
palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
replaces metadata rather than merging, and re-passing
`hnsw:space="cosine"` then raises `ValueError: Changing the
distance function of a collection once it is created is not
supported currently.` The HNSW runtime configuration
(`configuration_json`) also does not expose `num_threads` in
chromadb 1.5.x, so the flag appears to be read only at creation
time. Rather than paper over the limitation with a best-effort
`modify` that silently drops `hnsw:space`, documented in the
mcp_server comment that pre-existing palaces need a
`mempalace nuke` + re-mine to gain the protection. Fresh palaces
are always protected.
Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
- Different palaces → both complete in parallel ✓
- Same palace → one completes, the other exits with
"another `mine` is already running against <palace> — exiting
cleanly." ✓