035fe6d658
Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.
mempalace/ — Core Package
The Python package that powers MemPalace. All modules, all logic.
Modules
| Module | What it does |
|---|---|
cli.py |
CLI entry point — routes to mine, search, init, compress, wake-up |
config.py |
Configuration loading — ~/.mempalace/config.json, env vars, defaults |
normalize.py |
Converts 5 chat formats (Claude Code JSONL, Claude.ai JSON, ChatGPT JSON, Slack JSON, plain text) to standard transcript format |
miner.py |
Project file ingest — scans directories, chunks by paragraph, stores to ChromaDB |
convo_miner.py |
Conversation ingest — chunks by exchange pair (Q+A), detects rooms from content |
searcher.py |
Semantic search via ChromaDB vectors — filters by wing/room, returns verbatim + scores |
layers.py |
4-layer memory stack: L0 (identity), L1 (critical facts), L2 (room recall), L3 (deep search) |
dialect.py |
AAAK compression — entity codes, emotion markers, 30x lossless ratio |
knowledge_graph.py |
Temporal entity-relationship graph — SQLite, time-filtered queries, fact invalidation |
palace_graph.py |
Room-based navigation graph — BFS traversal, tunnel detection across wings |
mcp_server.py |
MCP server — 19 tools, AAAK auto-teach, Palace Protocol, agent diary |
onboarding.py |
Guided first-run setup — asks about people/projects, generates AAAK bootstrap + wing config |
entity_registry.py |
Entity code registry — maps names to AAAK codes, handles ambiguous names |
entity_detector.py |
Auto-detect people and projects from file content |
general_extractor.py |
Classifies text into 5 memory types (decision, preference, milestone, problem, emotional) |
room_detector_local.py |
Maps folders to room names using 70+ patterns — no API |
spellcheck.py |
Name-aware spellcheck — won't "correct" proper nouns in your entity registry |
split_mega_files.py |
Splits concatenated transcript files into per-session files |
Architecture
User → CLI → miner/convo_miner → ChromaDB (palace)
↕
knowledge_graph (SQLite)
↕
User → MCP Server → searcher → results
→ kg_query → entity facts
→ diary → agent journal
The palace (ChromaDB) stores verbatim content. The knowledge graph (SQLite) stores structured relationships. The MCP server exposes both to any AI tool.