mempalace

Author	SHA1	Message	Date
Ben Sigman	62439e1368	Merge pull request #681 from jphein/fix/unicode-checkmark fix: replace Unicode checkmark with ASCII for Windows encoding (#535)	2026-04-18 23:27:57 -07:00
Igor Lins e Silva	4a088ea8e1	Address Copilot review: cursor tie-break, honest metrics, accurate comments Six items from the automated review on PR #998: 1. Cursor tie-break bug (correctness). The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. `drawers_added` counted upserts, not adds. Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. `sweep_directory` hid failures in the summary. `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. Coordination claim was overstated. The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. MAX_FILE_SIZE comments were misleading. Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-18 13:22:18 -03:00
MSL	6f33d52681	Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB Mirrors the miner.py fix in this same branch. convo_miner.py had the exact same 10 MB cap at line 58 that silently dropped long transcripts via continue. Long Claude Code sessions, multi-year ChatGPT exports, and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern, different file. Raised to 500 MB to match miner.py for consistency; downstream chunking means source file size does not affect storage or embedding cost. Tests: tests/test_convo_miner_size_cap.py (1 test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
jp	542b53bb0f	fix: replace Unicode checkmark with ASCII + for Windows encoding (#535 ) Windows terminals using cp1251/cp1252 crash on the Unicode ✓ (U+2713) in progress output. Replace with ASCII + in convo_miner.py and split_mega_files.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 08:59:58 -07:00
Milla J	62df24599e	fix: README audit — 42 TDD tests + hall detection + 7 claim fixes (#835 ) * fix: README audit — match every claim to shipped code + add hall detection TDD audit: wrote 42 tests verifying README claims against codebase. Fixed all 7 failures: 1. Tool count: 19 → 29 (10 tools were undocumented) 2. Added tool table rows for tunnels, drawer management, system tools 3. Version badge: 3.1.0 → 3.2.0 4. dialect.py file reference: "30x lossless" → "AAAK index format for closet pointers" 5. Wake-up token cost: "~170 tokens" → "~600-900 tokens" (matches layers.py) 6. pyproject.toml version in project structure: v3.0.0 → v3.2.0 7. Hall detection: added detect_hall() to miner.py — drawers now tagged with hall metadata so palace_graph.py can build hall connections New code: - miner.py: detect_hall() — keyword scoring against config hall_keywords, writes hall field to every drawer's metadata - tests/test_hall_detection.py — 12 TDD tests (written before code) - tests/test_readme_claims.py — 42 TDD tests verifying README accuracy 859/859 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve ruff lint — unused imports and variables Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format with CI-pinned 0.4.x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use conftest fixtures in hall tests for Windows compat Windows CI fails with NotADirectoryError when ChromaDB tries to write HNSW files in short-lived TemporaryDirectory. Use conftest palace_path and tmp_dir fixtures instead — same pattern as all other tests that touch ChromaDB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address Igor's review — convo_miner halls, cached config, markdown typo TDD: wrote tests for convo_miner hall metadata and config caching BEFORE verifying the code changes. 1. README markdown typo: extra ** in wake-up token row (line 195) 2. convo_miner.py: added _detect_hall_cached() — conversation drawers now get hall metadata (was missing, Igor caught it) 3. miner.py + convo_miner.py: cached hall_keywords at module level so config.json isn't re-read per drawer during bulk mine 4. New tests: TestConvoMinerWritesHalls, TestDetectHallCaching 861/861 tests pass. ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:11:11 -07:00
Igor Lins e Silva	28e263748b	merge: develop (#784 file-locking, #820 version sync) Non-trivial merge in convo_miner.py: this branch's _file_convo_chunks (purge stale + upsert with normalize_version) and develop's _file_chunks_locked (mine_lock + double-checked file_already_mined) both touched the same critical section. Combined into a single _file_chunks_locked helper that does lock → double-check → purge → upsert, preserving both the multi-agent safety guarantee from #784 and the schema-rebuild contract from this PR. Also folds develop's mine_lock import into both miner.py and convo_miner.py alongside NORMALIZE_VERSION. 707/707 tests pass, ruff + format clean under CI-pinned 0.4.x.	2026-04-13 16:29:50 -03:00
Igor Lins e Silva	7e5eeda9a5	feat(normalize): auto-rebuild stale drawers via NORMALIZE_VERSION schema gate Without this, the strip_noise improvement only helps new mines. Every user who had already mined Claude Code JSONL sessions would keep their noise-polluted drawers forever, because convo_miner's file_already_mined skip short-circuits before re-processing. Adds a versioned schema gate so upgrades propagate silently: - palace.NORMALIZE_VERSION=2 — bumped when the normalization pipeline changes shape (this PR's strip_noise is the v1→v2 bump). - file_already_mined now returns False if the stored normalize_version is missing or less than current, triggering a rebuild on next mine. - Both miners stamp drawers with the current normalize_version. - convo_miner now purges stale drawers before inserting fresh chunks (mirrors miner.py's existing delete+insert), extracted into _file_convo_chunks helper to keep mine_convos under ruff's C901 limit. User experience: upgrade mempalace, run `mempalace mine` as usual, old noisy drawers get silently replaced with clean ones. No erase needed, no "you need to rebuild" changelog footgun. Tests: - test_file_already_mined_returns_false_for_stale_normalize_version — pins the version gate contract for missing/v1/current. - test_add_drawer_stamps_normalize_version — fresh project-miner drawers carry the field. - test_mine_convos_rebuilds_stale_drawers_after_schema_bump — end-to-end proof that a pre-v2 palace gets silently cleaned on next mine, with orphan drawers purged and NOT skipped. Existing test_file_already_mined_check_mtime updated to include the new field; all other tests unaffected.	2026-04-13 16:20:55 -03:00
Igor Lins e Silva	09f218cbb2	refactor: extract locked filing block to keep mine_convos under C901 Adding the per-file lock + double-checked file_already_mined() in the previous commit pushed mine_convos cyclomatic complexity from 25 to 26, just over ruff's max-complexity threshold. Hoist the locked critical section into _file_chunks_locked() so the outer loop stays within budget. No behavior change.	2026-04-13 15:48:54 -03:00
MSL	30a431924b	fix: add file-level locking to prevent multi-agent duplicate drawers Root cause: when multiple agents mine simultaneously, both pass file_already_mined() check, both delete+insert the same file's drawers, creating duplicates or losing data. Fix: mine_lock() in palace.py — cross-platform file lock (fcntl on Unix, msvcrt on Windows). Both miner.py and convo_miner.py now lock per-file during the delete+insert cycle and re-check after acquiring the lock. Tested: - Lock acquires and releases correctly - Second agent blocks until first releases (0.25s wait) - 33/33 existing tests pass - Cross-platform: fcntl (macOS/Linux), msvcrt (Windows) Based on v3.2.0 tag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 07:20:58 -03:00
Ben Sigman	5ac600b3bb	style: ruff format convo_miner.py (#741 )	2026-04-12 16:34:16 -07:00
Mikhail Valentsev	87e8bafad8	fix: prevent convo_miner from re-processing 0-chunk files on every run (#654 ) (#732 ) * fix: register 0-chunk files to prevent re-processing on every mine (#654) mine_convos() has three early-exit paths (OSError, content too short, zero chunks) that skip writing anything to ChromaDB. Since file_already_mined() checks for the presence of a document with a matching source_file, these files are re-read and re-processed on every subsequent run. Add _register_file() that upserts a lightweight sentinel document (room="_registry", ingest_mode="registry") so file_already_mined() returns True on future runs. Note: Bug 2 from the issue (drawers_added counter always 0) was already resolved upstream via the switch from collection.add() to collection.upsert(). * fix: resolve macOS path symlink in test + remove unused variable	2026-04-12 14:25:34 -07:00
Sanjay Ramadugu	9b60c6edd7	fix: remove 8-line AI response truncation in convo_miner (#692 ) (#708 ) The _chunk_by_exchange() function was silently truncating AI responses to 8 lines via ai_lines[:8]. Any content beyond line 8 was discarded, violating the project's verbatim storage principle. Now the full AI response is preserved. When a combined exchange exceeds CHUNK_SIZE (800 chars, aligned with miner.py), it is split across consecutive drawers instead of being truncated.	2026-04-12 14:23:57 -07:00
shafdev	d52d6c9622	fix: store full AI response in convo_miner exchange chunking (#695 )	2026-04-12 14:23:52 -07:00
MSL	71e8f2d054	fix: prevent HNSW index bloat from duplicate add() calls (#525 ) Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault. Changes: - convo_miner.py: add() → upsert() (the one-line root cause fix) - repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine. - dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 08:14:22 -07:00
bensig	1d19dfc9d5	security: harden inputs, fix shell injection, optimize DB access - Fix command injection in hook script (pass paths via sys.argv) - Add sanitize_name/sanitize_content validators in config.py - Add 10MB file size guard + symlink skip in miners - Fix SQLite connection leak in knowledge_graph.py (reuse connection) - Use `with conn:` for proper transaction handling - Consolidate shared palace operations into palace.py - Add write-ahead log for audit trail on writes/deletes - Add metadata cache with 30s TTL for status/taxonomy calls - Upgrade md5 → sha256 for drawer/triple IDs - Harden file permissions (0o700/0o600) - Pin chromadb>=0.5.0,<0.7 Based on PR #252 by @anthonyonazure with lint fixes applied. Co-Authored-By: anthonyonazure <anthonyonazure@users.noreply.github.com>	2026-04-09 08:06:30 -07:00
Ben Sigman	6af6fe3dda	Merge pull request #54 from adv3nt3/fix/narrow-exception-handling fix: narrow bare except Exception to specific types where safe	2026-04-07 13:54:05 -07:00
Ben Sigman	f1f0a4c966	Merge pull request #129 from minimexat/fix/windows-unicode-encoding fix: replace Unicode separator character for Windows compatibility	2026-04-07 13:51:58 -07:00
Ben Sigman	6aa4272b65	Merge pull request #53 from adv3nt3/fix/md5-usedforsecurity-miners fix: mark MD5 as non-security in miner drawer ID generation	2026-04-07 13:51:52 -07:00
f-hoedl	d214f6a854	fix: replace Unicode separator in convo_miner.py for Windows compatibility Replace the ─ (U+2500) separator character with - in convo_miner.py. Windows terminals using cp1252 encoding raise UnicodeEncodeError when printing this character unless PYTHONUTF8=1 is set explicitly. Fixes crash on Windows: UnicodeEncodeError: 'charmap' codec can't encode character '\u2500'	2026-04-07 21:55:34 +02:00
bensig	186bb2e3d1	fix: shell injection in hooks, Claude Code mining, chromadb pin - hooks/mempal_save_hook.sh: pass $TRANSCRIPT_PATH as sys.argv instead of interpolating into python -c string (fixes #110) - normalize.py: accept type "user" in addition to "human" for Claude Code JSONL sessions (fixes #111) - convo_miner.py: skip tool-results/, memory/ dirs and .meta.json files when scanning for conversations (fixes #111) - pyproject.toml: pin chromadb>=0.4.0,<1 to avoid crashing 1.x builds on macOS ARM64 (fixes #100)	2026-04-07 11:45:51 -07:00
adv3nt3	312d380aab	fix: narrow bare except Exception to specific types where safe Replace broad except Exception with specific exception types in 6 sites where the expected failure mode is well-defined: - normalize.py: OSError for file read, ImportError for optional import - miner.py: OSError for file read_text - entity_detector.py: OSError for file read in scan loop - convo_miner.py: (OSError, ValueError) for normalize which reads and parses files - entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError) for Wikipedia lookup fallback ChromaDB except Exception sites (~30) are left broad for now. chromadb.errors defines NotFoundError, DuplicateIDError, InvalidDimensionException etc., but narrowing those sites requires importing from chromadb.errors and validating across supported versions (>=0.4.0). MCP server handlers also left broad for resilience.	2026-04-07 13:51:27 +02:00
adv3nt3	3a2817505a	fix: mark MD5 as non-security in miner drawer ID generation Add usedforsecurity=False to hashlib.md5() calls in miner.py and convo_miner.py to document that MD5 is used for deterministic ID generation, not cryptographic security. Preserves stable drawer IDs for backward compatibility with existing palaces. Swapping to SHA-256 would change the ID formula and make existing drawers unreachable on re-ingestion. PR #34 covers the MD5 sites in knowledge_graph.py and mcp_server.py. Verified: usedforsecurity kwarg is supported since Python 3.9 (project target per pyproject.toml line 10), confirmed via Context7 CPython docs.	2026-04-07 13:41:00 +02:00
Milla Jovovich	068dbd9a7b	MemPalace: palace architecture, AAAK compression, knowledge graph The memory system: - Palace structure: Wings (people/projects) → Rooms (topics) → Closets (AAAK compressed) → Drawers (verbatim transcripts) - Halls connect related rooms within a wing - Tunnels cross-reference rooms across wings - AAAK: 30x lossless compression dialect for AI agents - Knowledge graph: temporal entity-relationship triples (SQLite) - Palace graph: room-based navigation with tunnel detection - MCP server: 19 tools — search, graph traversal, agent diary, AAAK auto-teach - Onboarding: guided setup generates wing config + AAAK entity registry - Contradiction detection: catches wrong pronouns, names, ages - Auto-save hooks for Claude Code 96.6% Recall@5 on LongMemEval — highest zero-API score published. 100% with optional Haiku rerank (500/500). Local. Free. No API key required.	2026-04-04 18:16:04 -07:00

23 Commits