mempalace

Author	SHA1	Message	Date
Pim Messelink	9e53228ea3	test: update test_cli assertions for mempalace-mcp entry point Three assertions in test_mcp_command_* were still checking for the old `python -m mempalace.mcp_server` output string. Update to match the new `mempalace-mcp` command printed by cmd_mcp(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 01:26:47 -03:00
Pim Messelink	982d421510	fix: update mempalace mcp command to use mempalace-mcp entry point cmd_mcp() in cli.py was still printing `python -m mempalace.mcp_server` as the setup command. Update to use the mempalace-mcp console entry point added in the previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 01:26:47 -03:00
Pim Messelink	67a067701e	fix: use mempalace CLI in top-level hook scripts hooks/mempal_precompact_hook.sh and hooks/mempal_save_hook.sh used python3 -m mempalace mine which fails when mempalace is installed via pipx or uv. Switch to the mempalace CLI entry point which pipx/uv put on PATH. Also removes the now-unused PYTHON variable from mempal_save_hook.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 01:26:47 -03:00
Pim Messelink	be89e49add	fix: use mempalace CLI in hook scripts instead of python3 -m Hook scripts used `python3 -m mempalace` which fails when mempalace is installed via pipx or uv. Using the `mempalace` CLI command directly works for all installation methods. Dev users running from source should use `pip install -e .` as documented in CONTRIBUTING.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 01:26:35 -03:00
Pim Messelink	9f5b8f5fd6	fix: add mempalace-mcp console entry point for pipx/uv compatibility The MCP server config used `python -m mempalace.mcp_server` which fails when mempalace is installed via pipx or uv, since the system python cannot find the module in the isolated venv. Adding a `mempalace-mcp` console_scripts entry point ensures the MCP server works regardless of installation method (pip, pipx, uv, conda). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 01:26:00 -03:00
Igor Lins e Silva	4fb0ee57e7	Merge pull request #942 from fatkobra/fix-hooks-resolve-claude-plugin fix(hooks): resolve Claude plugin hook runner cross-platform	2026-04-21 01:21:36 -03:00
Igor Lins e Silva	1a180cd7cb	Merge pull request #1051 from itfarrier/feat/i18n-belarusian feat(i18n): add Belarusian	2026-04-21 01:09:54 -03:00
Igor Lins e Silva	6d42f61e64	Merge pull request #1001 from mvalentsev/feat/i18n-de-es-fr-entity feat(i18n): add entity detection to German, Spanish, and French locales	2026-04-21 00:55:33 -03:00
Igor Lins e Silva	2a5914b630	Merge pull request #945 from lmanchu/feat/zh-entity-detection feat(i18n): add Traditional + Simplified Chinese entity detection	2026-04-21 00:55:04 -03:00
Dzmitry Padabed	54c314d8d9	feat(i18n): add Belarusian	2026-04-20 21:00:39 +03:00
fatkobra	0b316d4053	test: normalize wrapper script path for bash on Windows	2026-04-19 10:34:11 +02:00
Ben Sigman	32ec74d8eb	Merge pull request #1023 from jphein/pr/pid-file-guard fix(hooks): PID file guard prevents stacking mine processes	2026-04-18 23:33:47 -07:00
Ben Sigman	caf503f442	Merge pull request #1000 from jphein/fix/quarantine-stale-hnsw feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift (closes #823)	2026-04-18 23:28:00 -07:00
Ben Sigman	62439e1368	Merge pull request #681 from jphein/fix/unicode-checkmark fix: replace Unicode checkmark with ASCII for Windows encoding (#535)	2026-04-18 23:27:57 -07:00
jp	dfba247454	fix: cross-platform PID check — os.kill(pid, 0) TERMINATES on Windows Real bug surfaced on CI for this PR. On POSIX, os.kill(pid, 0) is the canonical no-op existence probe. On Windows, Python's os.kill maps to TerminateProcess(handle, sig), which terminates the target with exit code sig. os.kill(pid, 0) therefore kills the target with exit code 0 — silently destroying our mine child (or, as happened in test_mine_already_running_live_pid, the pytest process itself). Fix: split into _pid_alive(pid) helper with a Windows branch using ctypes.windll.kernel32.OpenProcess + GetExitCodeProcess. PROCESS_QUERY_LIMITED_INFORMATION opens a handle only if the PID exists; STILL_ACTIVE (259) distinguishes running from exited processes. No new dependencies — stdlib ctypes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:30:17 -07:00
jp	fe6b8899bc	fix: broaden _mine_already_running catch — Windows os.kill raises plain OSError On Windows, os.kill(bogus_pid, 0) raises OSError[WinError 87] "The parameter is incorrect" — NOT ProcessLookupError. The old except tuple missed it, so test_mine_already_running_dead_pid failed on Windows CI. Catching OSError covers ProcessLookupError + PermissionError + FileNotFoundError on POSIX and WinError 87 on Windows. ValueError still guards the int() parse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:13:37 -07:00
jp	a6b6e55247	fix: PID file guard prevents stacking mine processes Every stop hook fire spawned a new background `mempalace mine` via subprocess.Popen with no dedup — 4 concurrent mines at ~770% CPU observed in production. Add `_mine_already_running()` (reads `hook_state/mine.pid`, uses `os.kill(pid, 0)` as an existence check) and `_spawn_mine()` (writes the child PID to the lock file after Popen returns). `_maybe_auto_ingest` bails early when the guard reports True. Tests: 4 new unit tests for `_mine_already_running` (no file, dead PID, live PID using `os.getpid()`, corrupt file), 1 new test covering the skip-when-running branch of `_maybe_auto_ingest`, and existing spawn tests patched to redirect `_MINE_PID_FILE` into tmp_path so they don't touch the real state dir. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-18 20:27:56 -07:00
jp	0c38deaab5	feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift Add a helper that renames HNSW segment directories whose `data_level0.bin` is significantly older than `chroma.sqlite3`. Drift between the on-disk HNSW graph and the live embeddings table is the root cause of a segfault class where the Rust graph-walk dereferences dangling neighbor pointers for entries in the metadata segment that no longer exist in the HNSW index, crashing in a background thread on `count()` or `query()`. Issue #823 describes the same drift as a silent-staleness symptom (semantic search returns stale results after `add_drawer` because `data_level0.bin` lags the sqlite metadata under the default `sync_threshold=1000`). Under heavier load or after an interrupted write, the same drift can escalate from "silent stale results" to "SIGSEGV on next open," which is the failure mode observed at neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12) and acknowledged at chroma-core/chroma#2594. On one 135K-drawer palace where `index_metadata.pickle` claimed 137,813 elements against 135,464 rows in sqlite (2,349-entry drift), fresh Python processes crashed in `col.count()` 17/20 times; after renaming the segment dir out of the way and letting ChromaDB rebuild lazily, the same 20-run check went to 0 crashes. The recovery path #823 suggests (export / recreate / reimport) is heavy — it re-embeds every drawer. This helper is lighter: rename the segment dir so ChromaDB reopens without it, and the indexer rebuilds lazily on the next write. The original directory is renamed (not deleted) so the operator can recover if the heuristic misfires. If `chroma.sqlite3` is more than `stale_seconds` (default 3600) newer than the segment's `data_level0.bin`, the segment is considered suspect. One hour is deliberately conservative — normal HNSW flush cadence is seconds to minutes, so an hour of drift implies a crashed mid-write, not routine lag. - Additive: exposes `quarantine_stale_hnsw(palace_path, stale_seconds)` as a helper. Not wired into `_client()` / startup on this PR — the goal is to land the primitive first so operators and higher layers can opt in. A follow-up could call it automatically on palace open behind an env var or config flag. - Closes #823 by giving operators a first-class recovery path without having to install `chromadb-ops` or re-mine. Four new tests in `tests/test_backends.py`: - renames drifted segment, preserves original files under `.drift-TS` suffix - leaves fresh segments alone - no-op on missing palace path / missing `chroma.sqlite3` - skips already-quarantined (`.drift-` suffixed) directories `pytest tests/test_backends.py` → 11 passed. `ruff check` / `ruff format --check` — clean.	2026-04-18 18:04:05 -07:00
Igor Lins e Silva	109d7f267c	Merge pull request #990 from MemPalace/docs/rfc-source-adapter-plugin-spec docs: RFC 002 — Source adapter plugin specification	2026-04-18 19:12:20 -03:00
Igor Lins e Silva	66090b2bcb	Merge pull request #1014 from MemPalace/refactor/rfc-002-sources-scaffolding refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext	2026-04-18 18:44:52 -03:00
Igor Lins e Silva	8a130fc509	Merge pull request #1012 from MemPalace/docs/use-real-claude-projects-path-996 docs: use real ~/.claude/projects/ path in first-run help and README (#996)	2026-04-18 18:43:53 -03:00
Igor Lins e Silva	7af3bfae8f	Merge pull request #1010 from MemPalace/fix/chromadb-1-5-4-py-3-13-compat-via-581 fix: upgrade chromadb to >=1.5.4 for Python 3.13/3.14 compatibility + fix 1.5.x queue-stall (closes #1006)	2026-04-18 18:43:34 -03:00
Ben Sigman	64266695b5	Merge pull request #1013 from MemPalace/fix/layer3-search-raw-none-guard-1011 fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011)	2026-04-18 13:41:55 -07:00
Ben Sigman	1b89b49b78	Merge pull request #999 from jphein/fix/searcher-none-metadata fix(searcher): guard against None metadata in CLI print path	2026-04-18 13:41:52 -07:00
bensig	49e9e04a12	fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011 ) Same class of bug as #1007: ChromaDB's query() can return None in the documents and metadatas arrays when a drawer's HNSW vector entry exists but its metadata/document rows haven't been materialized. The code in Layer3.search_raw (mempalace/layers.py) calls meta.get("wing", ...), meta.get("room", ...), meta.get("source_file", ...) directly without null safety, so it raises: AttributeError: 'NoneType' object has no attribute 'get' Two-line defensive coercion matching the pattern in #1009 / PR #999 for searcher.py: meta = meta or {}, doc = doc or "". The hit still appears with its real distance; source/wing/room fall back to their fallback values where the metadata row is missing. Frequently hit on chromadb 1.5.x (root cause #1006). Even after the chromadb floor lands (#1010), partial-state results remain possible during interrupted mines and schema upgrade boundaries, so the guard is worth having on its own. Fixes #1011.	2026-04-18 13:30:57 -07:00
bensig	a2da0d6ef4	docs: use real ~/.claude/projects/ path in first-run help and README (#996 ) The CLI help text and README told first-time users to mine from ~/chats/, a path that doesn't exist on any machine. Real location where Claude Code writes session JSONL is ~/.claude/projects/<escaped-project-path>/. Updates three user-visible strings: - mempalace/cli.py line 7 ("Two ways to ingest" block) - mempalace/cli.py line 25 (Examples block) - README.md line 58 (Quickstart) Website guides (website/guide/mining.md, getting-started.md) still reference ~/chats/ for ChatGPT/Slack export scenarios where that remains a valid placeholder. Those can be a separate PR if the maintainers want to tilt the website examples toward Claude Code specifically. Fixes #996.	2026-04-18 13:30:50 -07:00
Igor Lins e Silva	89904ed03f	fix(sources): address Copilot review on #1014 Five findings from the automated review, fixed with targeted tests where behavior changed: 1. Transformation Protocol (transforms.py). The registry mixed a bytes-to-str transform (utf8_replace_invalid) with str-to-str transforms under a single Callable[..., str] type, misleading static type checkers and adapter authors. Introduced a Transformation Protocol with __call__(data: bytes\|str) -> str and retyped the registry + get_transformation return. 2. Drawer-id collision risk (context.py). Switched _build_drawer_id from sha1[:16]=64 bits to sha256[:24]=96 bits. 64 bits sits uncomfortably close to the birthday bound for palace-sized corpora; 96 bits keeps the collision probability negligible while preserving the existing <prefix>_<chunk> layout adapters rely on. 3. Fresh-schema KG columns (knowledge_graph.py). source_drawer_id and adapter_name now live in the canonical CREATE TABLE so new palaces don't take an ALTER round-trip on first open. _migrate_schema stays for legacy palaces (SQLite has no ADD COLUMN IF NOT EXISTS, so PRAGMA introspection is still needed there). 4. Identity-shim comment (transforms.py). Comment said the adapter-specific transforms "raise if invoked without adapter context" but they return the input unchanged. Updated the comment to match the actual identity- shim behavior Copilot suggested. 5. Test docstring (test_sources.py). Comment mentioned default_factory=list but SourceRef.options uses default_factory=dict. Corrected. Tests: 1020 passed (up from 1018), +2 new tests for the sha256 id shape and the fresh-schema column presence on new palaces.	2026-04-18 17:17:50 -03:00
jp	3f0cfd5ed4	fix(mcp): guard tool_status/list_wings/list_rooms/get_taxonomy against None metadata Four more MCP handlers iterate a metadata list and call m.get(...) unconditionally. When the cache contains a None entry (drawers with no metadata, common on older mining paths), the try block catches the AttributeError and marks the response "partial: true" with an error message — visible as {"error": "'NoneType' object has no attribute 'get'", "partial": true} returned from mempalace_status even though the palace data is otherwise fetchable. Same m = m or {} guard we applied to searcher.py (d3a2d22, a51c3c2) and miner.status() (66f08a1). None-metadata drawers now roll up under the existing "unknown" fallback bucket instead of poisoning the response with a misleading partial flag. Regression test: mock the metadata cache with a None in the middle, assert tool_status returns clean counts and no error/partial fields. Verified the test fails without the guard. 998 tests pass.	2026-04-18 12:38:23 -07:00
Legion345	fa7fe1d51f	chromadb at <2 to guard against breaking changes in future major versions	2026-04-18 12:05:46 -07:00
Legion345	d0c8ecd847	fix: upgrade chromadb to >=1.5.4 for python 3.13/3.14 compatibility	2026-04-18 12:05:46 -07:00
Igor Lins e Silva	552e9927b7	refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).	2026-04-18 16:05:32 -03:00
Igor Lins e Silva	2b9f17c401	Merge pull request #995 from MemPalace/refactor/rfc-001-cleanup refactor(backends): RFC 001 §10 cleanup — typed results, PalaceRef, registry	2026-04-18 15:56:12 -03:00
jp	7690574dde	fix(searcher): guard API path + closet loop against None metadata too Per Copilot review on the CLI-only PR (#999): search_memories() has the same vulnerability in two additional spots, since ChromaDB can return None entries in the inner metadatas list for either the drawer query or the closets query. Without guards, the API path crashes with: AttributeError: 'NoneType' object has no attribute 'get' at either \`cmeta.get("source_file", "")\` in the closet boost lookup or \`meta.get("source_file", "") or ""\` in the drawer scoring loop. Applies the matching \`meta = meta or {}\` / \`cmeta = cmeta or {}\` guard at both sites and adds an API-path regression test that mocks a drawer query result with a None metadata entry and asserts both hits render — the None-metadata hit with the existing \`"unknown"\` sentinel values the scoring loop already writes for missing keys. Verified both the new API test and the existing CLI test fail without the guards (AttributeError) and pass with them.	2026-04-18 10:37:05 -07:00
jp	feba7e8043	fix(miner): same None-metadata guard for status() histogram loop `status()` walks `col.get(include=["metadatas"])` and buckets each drawer into a `wing_rooms[wing][room]` histogram. The same ChromaDB return shape fixed in the search print path — `None` entries in the `metadatas` list for drawers with no stored metadata — crashes the status command with: AttributeError: 'NoneType' object has no attribute 'get' Applies the matching ``m = m or {}`` guard so None-metadata drawers roll up under the existing `?/?` fallback bucket instead of killing the command mid-tally. Reproduced on a 135K-drawer palace where two drawers had `metadata=None`; both now show under `WING: ? / ROOM: ?` in the tally while the command prints the full histogram as designed. Adds a regression test that feeds `status()` a fake collection whose `get()` returns a `None` in the middle of the metadatas list and asserts both the fallback bucket and the real wing render.	2026-04-18 10:26:11 -07:00
jp	a3c778210b	fix(searcher): guard against None metadata in CLI print path `col.query(...)` can return `None` entries in the inner ``metadatas`` list for drawers whose metadata was never set (older palaces, rows written outside the normal mining path). The CLI `search()` function would render earlier results successfully and then crash mid-loop with: AttributeError: 'NoneType' object has no attribute 'get' at ``searcher.py:286`` — ``meta.get("source_file", "?")``. The user sees partial output followed by a traceback, with no indication of which drawers rendered OK and which were skipped. Guard with ``meta = meta or {}`` inside the loop so entries with missing metadata fall back to the existing ``"?"`` defaults instead of crashing, matching the hit dict assembly in ``search_memories()`` which already uses ``meta.get("wing", "unknown")`` etc. against the same data. Adds a regression test that mocks a ChromaDB result with a ``None`` metadata entry in the middle of the inner list and asserts both result blocks render to stdout.	2026-04-18 10:00:59 -07:00
mvalentsev	5189e0d652	test(i18n): add entity section smoke tests and schema invariants	2026-04-18 21:58:11 +05:00
mvalentsev	118cbe40bd	feat(i18n): add entity detection to French locale	2026-04-18 21:56:45 +05:00
mvalentsev	e17f219be8	feat(i18n): add entity detection to Spanish locale	2026-04-18 21:54:39 +05:00
Igor Lins e Silva	efaa39bea9	test(backends): dedup update-length-validation tests `24bf97b` (network-download fix) and my earlier Copilot-review commit both added tests for the same ValueError. Keep the broader one that covers both 'documents length' and 'metadatas length' mismatches; drop the narrower duplicate.	2026-04-18 13:53:46 -03:00
mvalentsev	7006a6b42d	feat(i18n): add entity detection to German locale	2026-04-18 21:53:11 +05:00
Igor Lins e Silva	61dd6e7d9c	test(backends): fix Windows file-lock in cache-invalidation test PermissionError [WinError 32] on Windows when Path.unlink() runs while chromadb.PersistentClient still holds a handle on chroma.sqlite3. Rewrite test_chroma_cache_invalidates_when_db_file_missing to prime backend._clients/_freshness with a sentinel object instead of opening a real PersistentClient, so the unlink runs against an unheld file. The assertion is also corrected: after invalidation, ChromaBackend's _client rebuilds a fresh PersistentClient which re-creates chroma.sqlite3 and re-stats it, so freshness ends up at the post-rebuild stat (not (0, 0.0) as the assertion previously expected). The meaningful invariant is "freshness advanced past the pre-unlink value AND the sentinel was replaced", which the test now checks. Ref: Windows CI failure on 995.	2026-04-18 13:52:56 -03:00
Igor Lins e Silva	74a31b70d3	Merge pull request #998 from MemPalace/fix/silent-transcript-drop Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper	2026-04-18 13:38:02 -03:00
copilot-swe-agent[bot]	24bf97bb65	fix(tests): avoid ONNX network download in update-length validation tests test_base_collection_update_default_validates_list_lengths and test_base_collection_update_default_rejects_mismatched_lengths were spinning up a real ChromaBackend and calling add(documents=...), which triggered ChromaDB's default ONNX embedding function and attempted a network download — failing in offline/sandboxed CI. BaseCollection.update() validates list lengths before any DB access, so no items need to be pre-loaded for the length-check to fire. Switch both tests to use _FakeCollection (same as the rest of the unit tests in this file) so they are pure in-memory and network-free. Also fixes a structural bug in test 1: collection._collection.add() was accidentally placed inside the pytest.raises(ValueError) block, masking the real assertion. Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>	2026-04-18 16:23:58 +00:00
Igor Lins e Silva	4a088ea8e1	Address Copilot review: cursor tie-break, honest metrics, accurate comments Six items from the automated review on PR #998: 1. Cursor tie-break bug (correctness). The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. `drawers_added` counted upserts, not adds. Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. `sweep_directory` hid failures in the summary. `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. Coordination claim was overstated. The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. MAX_FILE_SIZE comments were misleading. Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-18 13:22:18 -03:00
Igor Lins e Silva	42b940d263	fix(backends): address Copilot review on #995 Four defects surfaced by the automated review, fixed with targeted tests: 1. BaseCollection.update() default now validates that documents / metadatas / embeddings lengths match ids, raising ValueError instead of silently misaligning pairs or raising IndexError (base.py). 2. ChromaCollection.query() now rejects the two ambiguous input shapes up front — neither or both of query_texts / query_embeddings, and empty input lists — with clear ValueError messages rather than delegating to chromadb's less-obvious errors (chroma.py). 3. QueryResult.empty() accepts embeddings_requested=True to preserve the outer-query dimension with empty hit lists when the caller asked for embeddings, matching the spec rule that included fields carry the outer shape even when empty (base.py). ChromaCollection.query() threads this through on the empty-result path (chroma.py). 4. ChromaBackend cache-freshness check now matches the semantics from mcp_server._get_client (merged via #757) on three edge cases Copilot called out: (a) invalidate when chroma.sqlite3 disappears while a cached client is held, (b) treat a 0→nonzero stat transition as a change so a cache built when the DB did not yet exist is refreshed, (c) re-stat after PersistentClient constructs the DB lazily so freshness reflects the post-creation state (chroma.py). Tests: 978 passed (up from 970), 8 new tests covering the fixes.	2026-04-18 13:19:18 -03:00
Igor Lins e Silva	29ce7c7135	Harden sweeper for production: verbatim tool blocks, full session_id, logged failures Four changes on top of the proposal's initial sweeper draft, driven by the CLAUDE.md design principles: 1. Drop the 500-char truncation on tool_use / tool_result content in _flatten_content. The "verbatim always" principle forbids lossy compression of user-adjacent data; a long code-edit diff handed to the assistant must round-trip intact. Unknown block types now also serialize their full payload instead of just a type marker. New test test_parse_preserves_tool_blocks_verbatim covers a 5000-char input. 2. Use the full session_id in drawer IDs (not session_id[:12]). Rules out cross-session collisions if a transcript source ever uses non-UUID session identifiers or shared prefixes. 3. Replace silent `except Exception: return None` in get_palace_cursor with a logger.warning — the exact anti-pattern this PR otherwise criticizes in miner.py. The fallback behavior is still safe (deterministic IDs make a missed cursor recover on the next run), but the failure is now discoverable. 4. sweep_directory now collects per-file failures into the result dict and the CLI exits non-zero when any file failed, so a partial-sweep outcome is visible rather than swallowed. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-18 13:14:32 -03:00
MSL	fed69935d3	Add tandem sweeper: message-level safety net for dropped transcripts The primary miners (miner.py, convo_miner.py) operate at file granularity and can drop data for several reasons: size caps, silent OSError on read, dedup false positives, extensions the project miner does not recognize. Even with tonight's hotfixes, any future bug in the file-level path risks silent data loss. The sweeper is a second, cooperating miner that works at MESSAGE granularity: - Parses Claude Code .jsonl line by line, yielding only user/assistant records (filters progress, file-history-snapshot, etc. noise). - For each session_id, queries the palace for max(timestamp) and treats that as the cursor. - Ingests only messages newer than the cursor, as one small drawer per exchange (never hits a size cap — each drawer is 1-5 KB). - Deterministic drawer IDs from session_id + message UUID make reruns idempotent; crash mid-sweep is safe. Tandem coordination is free: if the primary miner committed up to timestamp T, the sweeper resumes from T. If the primary miner missed everything, the sweeper catches it all. Neither duplicates the other. Smoke test on a real Claude Code transcript: 1st run: +39 drawers, 0 already present 2nd run: +0 drawers, 39 already present (perfect idempotence) Opt-in via: mempalace sweep <file.jsonl> mempalace sweep <transcript-dir> No changes to existing miners. No schema migration. Purely additive. Tests: tests/test_sweeper.py (7 tests covering parsing, tandem coordination, idempotency, resume-from-cursor, metadata correctness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:06 -03:00
MSL	6f33d52681	Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB Mirrors the miner.py fix in this same branch. convo_miner.py had the exact same 10 MB cap at line 58 that silently dropped long transcripts via continue. Long Claude Code sessions, multi-year ChatGPT exports, and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern, different file. Raised to 500 MB to match miner.py for consistency; downstream chunking means source file size does not affect storage or embedding cost. Tests: tests/test_convo_miner_size_cap.py (1 test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
MSL	d137d12313	Raise MAX_FILE_SIZE cap from 10 MB to 500 MB Long Claude Code sessions routinely produce transcripts larger than 10 MB. The previous cap at miner.py:65 silently dropped them at line 732 with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same silent-failure pattern as the .jsonl extension bug. The cap exists as a safety rail against pathological binaries, not as a limit on legitimate text. Downstream chunking at 800 chars per drawer means source file size does not affect storage or embedding cost. 500 MB leaves headroom for year-long continuous transcripts while still catching accidental multi-GB binary mines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
MSL	560fdbdc9f	Fix silent drop of .jsonl files in project miner mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not `.jsonl`. Every jsonl file encountered in a mined directory was silently skipped at miner.py:722: if filepath.suffix.lower() not in READABLE_EXTENSIONS: continue Claude Code transcripts, ChatGPT exports, and every other tool writing line-delimited JSON ship as `.jsonl`. Users running `mempalace mine` against a directory of transcripts saw the command complete with no error and no log line — and their conversations never reached the palace. Silent data loss. Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text line-by-line; the existing chunking pipeline handles it the same way it handles any other text file. Tests: tests/test_miner_jsonl_visibility.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00

1 2 3 4 5 ...

500 Commits