Three conflicts, all from develop landing #1285/#1310/#1312 after this
branch was authored:
- mempalace/cli.py: keep both import sets — this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
RebuildCollectionError / _close_chroma_handles / _extract_drawers /
_rebuild_collection_via_temp added in #1285.
- mempalace/repair.py: keep this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
develop's rebuild_index signature with the collection_name parameter
added in #1312. Normalized print indent to 2 spaces matching the
rest of the file.
- tests/test_repair.py: keep both this branch's max_seq_id preflight
tests and develop's rebuild_from_sqlite + configured-collection-name
tests; they exercise distinct code paths and don't overlap.
Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
The collection_name plumbing rebase produced a few unformatted blocks
in test_mcp_server.py and test_searcher.py; bringing them in line with
the 0.4.x CI pin so test-windows / lint stay green.
Three small changes that together address the failure modes in #1296:
1. Add pnpm-lock.yaml and yarn.lock to SKIP_FILENAMES, mirroring the
existing package-lock.json rule. A 24K-line pnpm-lock.yaml produced
~1124 chunks in one batch and tripped onnxruntime bad_alloc on
Windows; pnpm/yarn lockfiles are no more useful to mine than npm's.
2. Skip any file that produces more than MAX_CHUNKS_PER_FILE (500)
chunks, with a clear log line. Catches the broader class — generated
CSV/JSON, build artifacts, etc. — that the named-file SKIP list will
never fully cover. The cap is conservative (500 chunks * 800 chars ≈
400 KB of source) so legitimate hand-written content still mines.
3. Print a partial-progress summary on any exception in _mine_impl, not
just KeyboardInterrupt, then re-raise. Without this, an arbitrary
exception (ONNX bad_alloc, chromadb HNSW error, OS fault) propagates
silently — the operator sees only the last progress line and assumes
the mine succeeded. The new path mirrors the KeyboardInterrupt
summary (files_processed, drawers_filed, last_file) plus the
exception type and message, then re-raises so the original traceback
surfaces and the exit code is non-zero.
Tests cover: SKIP_FILENAMES contents, the chunk-cap path returning
(0, room) with no upserts, and the new mine-aborted summary surfacing
both the partial counters and the exception class.
`backend: ChromaBackend | None = None` evaluates the X | None union
eagerly at function-definition time, which Python 3.9 rejects with
TypeError: unsupported operand type(s) for |: 'ABCMeta' and 'NoneType'
since the new union syntax is 3.10+. Quoting matches the existing
forward-reference style in repair.py (sqlite_drawer_count, etc.) and
defers evaluation, restoring 3.9 compatibility.
Develop (post-#1162 lock-plumbing era) refactored the per-open quarantine
pass into ChromaBackend._prepare_palace_for_open. This branch's
inline-expansion form added quarantine_invalid_hnsw_metadata as a third
check, plus a "discard from _quarantined_paths on inode swap" guard so
re-opens against a different physical DB re-run quarantine.
Resolution merges both:
- _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata,
gated by the same _quarantined_paths set.
- _client keeps the inode_changed -> _quarantined_paths.discard() guard
before calling the helper, so a fresh inode triggers a fresh pass.
- make_client collapses to a single _prepare_palace_for_open() call.
- test_backends.py keeps both the pickle (#1285) and shutil (develop)
imports — both are used.
The helper opened a chromadb PersistentClient via ChromaBackend and never
closed it, leaving rust-side SQLite/HNSW file locks alive after the
helper returned. On Windows that blocks the in-place archive rename
inside rebuild_from_sqlite with WinError 32 on data_level0.bin,
causing test_rebuild_from_sqlite_in_place_archives_when_opted_in and
test_rebuild_from_sqlite_raises_on_upsert_failure to fail in the
test-windows CI job. No test consumes the returned collection, so
closing the backend in a try/finally is safe and drops the return.
Earlier commit fixed ruff lint but missed the formatter check.
This applies `ruff format` — adds standard PEP8 blank lines between
functions, splits one inline list. No behavior change.
Verified: both `ruff format --check` and `ruff check` pass cleanly.
Tool still renders correctly.
Addresses CI lint feedback on PR #1391. No behavior change.
- Split `import json, sys` into separate lines (E401)
- Split chained `print(...); sys.exit(1)` into two lines (E702, two occurrences)
- Split inline `if ts: stamps.append(ts)` into two lines (E701)
Verified: `ruff check tools/render_jsonl.py` reports "All checks passed!"
Tool still renders correctly (3 turns from a real JSONL test, identical output to pre-fix).
Adds a brief [!IMPORTANT] callout at the top of the README pointing
users to the urgent announcement at #1388. Claude Code auto-deletes
local JSONL transcripts after 30 days; users without the auto-save
hooks wired are losing transcript data off the rolling window.
Ships 4 small standalone tools at tools/:
- backup_claude_jsonls.sh — rsync ~/.claude/projects/ to a safe folder
- render_jsonl.py — convert JSONL transcripts to readable text
- find_orphan_claude_jsonls.sh — scan backup locations for orphan
Claude Code transcripts (multi-line shape detection + topic preview)
- save.md — Claude Code slash command for manual /save into MemPalace
Tools verified by independent agent against v3.3.4 source.
Read-only on user data. POSIX bash + Python stdlib only.
A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per #1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).
Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
via a monkeypatched `_get_client`, second attempt succeeds; assert the
caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
fail, assert we exhaust the loop and return None (no infinite retry).
Surgical extraction from PR #1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.
Co-authored-by: Jeffrey Hein <jp@jphein.com>
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on #1310:
- repair.py: drawers collection name now resolves from
MempalaceConfig().collection_name via _drawers_collection_name() (closets
stays fixed by design — AAAK index references drawer IDs by string).
Lines up with the broader configured-collection work in #1312 so that
PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
_rebuild_one_collection so a Chroma "Collection already exists" failure
surfaces as RebuildPartialError with archive_path, not an unstructured
exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
with backend.close() so PersistentClient handles to dest_palace are
released on every exit path. The from-sqlite path post-dates #1285's
lifecycle hardening of the legacy rebuild, so this needed its own
cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
rebuild_from_sqlite returns {} (validation refusal sentinel), so
unattended scripts/CI distinguish "invalid inputs" from a successful
rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
now asserts every backing segment is scope='METADATA', locking in the
segment-layout assumption against future regressions that point the
JOIN at the VECTOR segment.
New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `--mode legacy` and the inline `cli.cmd_repair` rebuild path
call `Collection.count()` as their first read — the same call that
raises `chromadb.errors.InternalError: Failed to apply logs to the
hnsw segment writer` on the corruption class reported in #1308.
Repair would print "Cannot recover — palace may need to be re-mined
from source files" even though the underlying SQLite tables were
fully intact.
The new `--mode from-sqlite` reads `(id, document, metadata)` rows
directly from `chroma.sqlite3` via `segments` → `embeddings` →
`embedding_metadata` joins, never opens a chromadb client against
the corrupt palace, and re-upserts everything into a fresh palace.
- `--source PATH` extracts from a corrupt palace already moved aside
- `--archive-existing` handles the in-place case by renaming the
existing palace to `<palace>.pre-rebuild-<timestamp>` first
- Partial-rebuild failures raise `RebuildPartialError` with the
archive path so users can recover; CLI exits non-zero
- In-place mode calls `SharedSystemClient.clear_system_cache()` to
drop chromadb's process-wide System registry (cross-palace use
does not, to limit blast radius for library callers)
- Source validation runs before any destructive moves
Verified end-to-end recovering a 52,300-row real-world corrupt
palace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- B904: chain OSError/collection errors with "raise ... from e" in
normalize.py and searcher.py so the original traceback is preserved.
- B007: rename unused loop variables to _name in dedup, dialect, layers,
and room_detector_local.
- S110/S112: replace bare "try/except/pass" and "try/except/continue"
with logger.debug(..., exc_info=True) in mcp_server, searcher,
palace, palace_graph, miner, convo_miner, and fact_checker so
background failures are observable without changing behaviour.
A module-level logger ("mempalace_mcp", matching mcp_server/searcher)
is added to the five files that didn't already have one. Configured
ruff checks (E/F/W/C901) and ruff --select B, S110, S112 all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLMConfig accepted any URL scheme from LLM_ENDPOINT / --endpoint,
so a misconfigured endpoint such as file:///etc/passwd would be
passed straight to urllib.request.urlopen. Validate the scheme at
construction time and raise ValueError on anything other than
http/https, preserving the "privacy by architecture" guarantee.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without ensure_ascii=False, non-ASCII characters (e.g. Chinese) in tool
results and JSON-RPC responses are escaped as \uXXXX, which causes
downstream MCP clients to receive escaped text instead of the original
characters. This affects all platforms, not just Windows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the MCP client sends a malformed or null top-level request, prevent the AttributeError on request.get() by explicitly validating that the request is a dictionary. Returns standard JSON-RPC Error -32600 (Invalid Request) instead of crashing the server.
``search_memories`` computes ``effective_dist = dist - boost`` where
``boost`` can be as large as ``CLOSET_RANK_BOOSTS[0] == 0.40`` for a
rank-0 closet hit. When the raw drawer distance is small — any
near-exact match — the subtraction goes negative.
Two downstream effects:
1. Line 418 returns ``round(max(0.0, 1 - effective_dist), 3)`` as
``similarity``. With ``effective_dist = -0.30`` that yields
``similarity = 1.30``, outside the documented ``[0, 1]`` range.
The ``max(0.0, ...)`` only prevents negative similarities; it does
not cap above 1.
2. Line 427 stores ``_sort_key: effective_dist`` and line 435 sorts
``scored`` ascending by that key. A negative key drops *below* the
rest, so the strongest hybrid matches end up sorting after weaker
ones — ranking inversion under the exact conditions hybrid retrieval
is supposed to serve best.
Clamp ``effective_dist`` to the valid cosine-distance range ``[0, 2]``.
The boost still wins (closet-backed hit still ranks first), it just no
longer flips the order.
Test added: mock drawer_col (base dist 0.08 / 0.35 for two sources) +
closet_col (rank-0 closet for the 0.08 source) → assert all hits have
``0 <= similarity <= 1`` and ``0 <= effective_distance <= 2``, and that
the closet-boosted source still ranks first.
Relationship to other PRs:
* **#988** clamps the output ``similarity`` alone. That does not fix
the sort-key inversion or the invalid ``effective_distance`` in the
returned dict. This PR clamps at the arithmetic source so both
downstream users of the value stay in range.
* Orthogonal to **#979** (``tool_check_duplicate`` negative similarity).
ChromaDB can return None entries in metadatas/documents lists under
partial-flush, mid-delete, upgrade-boundary, and interrupted-mine
states. Add `meta = meta or {}` and `doc = doc or ""` guards in the
three result loops (search display, closet hybrid, drawer scored) so
.get() and .strip() calls never crash on None.
Fixes#1007, #1011
The alias was placed below an explanatory comment block introduced by
#1305, which trips ruff E402 (module-level import not at top of file).
Moved next to the existing 'from mempalace.hooks_cli import (...)' line.
CI lint went red on develop after #1305 merged with the failing check;
this re-greens it so subsequent PRs do not inherit the failure.