Files

T

Igor Lins e Silva 035fe6d658 fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

Addresses issues found while reviewing the initial phase-2 implementation
against real data:

**Bug: uncertain bucket starved from the LLM.**
`discover_entities` was dropping the regex-uncertain bucket whenever real
git/manifest signal existed — which is exactly when `--llm` is most useful
for cleaning up prose noise. The uncertain candidates never reached the
refinement step. Fixed: only drop when `llm_provider is None`.

**Context collection: word boundaries, not substring.**
`_collect_contexts` used substring matching on lower-cased lines, so the
name "Go" matched "good", "going", "forgot". Switched to a
`(?<!\w)…(?!\w)` regex so short names only match at token boundaries.

**Authoritative-source detection replaces confidence threshold.**
Previously the refinement step skipped entries with `confidence >= 0.95`
to avoid second-guessing manifest-backed projects. That threshold was
fragile — the regex detector produces 0.99 confidence for things like
`code file reference (5x)` on framework names (OpenAPI, etc.), so those
skipped the LLM despite being regex-only noise. New helpers
`_is_authoritative_person` / `_is_authoritative_project` look at the
actual signal strings (commits, package.json, etc.) to decide.

**Now also refines regex-derived people.**
After #1148's high-pronoun-signal fix, the regex detector can promote
non-people to the `people` bucket (e.g. a capitalized common noun that
happened to appear near pronouns). The LLM now gets a chance to clean
those up, while git-authored people are still skipped.

**Robust JSON extraction.**
Small local models routinely wrap JSON output in prose ("Sure, here's
the classification: {…}"). The previous code-fence stripper failed on
that. `_extract_json_candidates` now does balanced-bracket extraction
with string-aware quote handling, so it recovers JSON from:
- raw responses
- markdown fenced blocks
- JSON embedded inside surrounding text
- multiple candidate objects/arrays

**Prompt guidance for frameworks vs user projects.**
Added an explicit instruction: frameworks, runtimes, APIs, cloud
services, and third-party vendors (Angular, OpenAPI, Terraform, Bun,
Google, etc.) are TOPIC unless the context clearly says it's the user's
own codebase. Directly addresses a false-positive pattern observed
during dev runs.

**Defensive mtime.**
`convo_scanner._safe_mtime` catches OSError during `stat()` — permission
changes, filesystem races, broken symlinks — and sorts the affected file
to the end of the newest-first order rather than crashing the scan.

**Cosmetic:** merged two adjacent f-strings on the same line in
`backends/chroma.py` and `llm_client.py` (no behaviour change).

15 new tests cover the OSError fallback, word-boundary matching, JSON
extraction variants, authoritative-source helpers, refining high-
confidence regex projects, and end-to-end LLM refinement preserving the
uncertain bucket.

2026-04-24 01:30:40 -03:00

backends

fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

2026-04-24 01:30:40 -03:00

i18n

fix(entity): reduce noise in regex-based detection

2026-04-24 00:20:32 -03:00

instructions

fix: add mempalace-mcp console entry point for pipx/uv compatibility

2026-04-21 01:26:00 -03:00

sources

fix(sources): address Copilot review on #1014

2026-04-18 17:17:50 -03:00

__init__.py

fix: upgrade chromadb to >=1.5.4 for python 3.13/3.14 compatibility

2026-04-18 12:05:46 -07:00

__main__.py

MemPalace: palace architecture, AAAK compression, knowledge graph

2026-04-04 18:16:04 -07:00

cli.py

feat(init): wire --llm flag and convo_scanner into discover_entities

2026-04-24 00:47:14 -03:00

closet_llm.py

release: v3.3.0 (#839 )

2026-04-13 18:25:01 -07:00

config.py

refactor(entity_detector): make multi-language extensible via i18n JSON

2026-04-15 08:52:42 -03:00

convo_miner.py

Merge pull request #681 from jphein/fix/unicode-checkmark

2026-04-18 23:27:57 -07:00

convo_scanner.py

fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

2026-04-24 01:30:40 -03:00

dedup.py

refactor: route all chromadb access through ChromaBackend

2026-04-14 00:31:16 -03:00

dialect.py

fix: address i18n review issues from PR #718

2026-04-15 11:03:28 +05:00

diary_ingest.py

release: v3.3.0 (#839 )

2026-04-13 18:25:01 -07:00

entity_detector.py

fix(entity): reduce noise in regex-based detection

2026-04-24 00:20:32 -03:00

entity_registry.py

Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata

2026-04-16 15:54:01 -03:00

exporter.py

fix: restrict file permissions on sensitive palace data (#814 )

2026-04-15 00:27:03 -07:00

fact_checker.py

release: v3.3.0 (#839 )

2026-04-13 18:25:01 -07:00

general_extractor.py

MemPalace: palace architecture, AAAK compression, knowledge graph

2026-04-04 18:16:04 -07:00

hooks_cli.py

fix: add wing param to diary_write/diary_read, derive from transcript path (#659 )

2026-04-23 15:07:25 -07:00

instructions_cli.py

fix: add explicit UTF-8 encoding to read_text() calls (#776 )

2026-04-16 16:00:29 +05:00

knowledge_graph.py

fix(sources): address Copilot review on #1014

2026-04-18 17:17:50 -03:00

layers.py

fix: guard Layer3.search_raw against None doc/meta from ChromaDB (#1011 )

2026-04-18 13:30:57 -07:00

llm_client.py

fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

2026-04-24 01:30:40 -03:00

llm_refine.py

fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

2026-04-24 01:30:40 -03:00

mcp_server.py

fix: treat empty string as no filter in mempalace_search wing/room (#1097 )

2026-04-23 15:19:18 -07:00

migrate.py

fix: upgrade chromadb to >=1.5.4 for python 3.13/3.14 compatibility

2026-04-18 12:05:46 -07:00

miner.py

Merge remote-tracking branch 'upstream/develop' into fix/status-paginate-large-palaces

2026-04-19 02:02:28 -05:00

normalize.py

fix: add provenance header and speaker IDs to Slack transcript imports (#815 )

2026-04-15 00:27:01 -07:00

onboarding.py

test: add comprehensive test coverage (35% → 58%, threshold 50%)

2026-04-08 20:54:56 +03:00

palace_graph.py

fix: add threading lock to graph cache, expand docstring

2026-04-16 09:00:36 -07:00

palace.py

remove unnecessary comment

2026-04-16 10:38:38 +05:00

project_scanner.py

fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources

2026-04-24 01:30:40 -03:00

py.typed

chore: tighten chromadb version range and add py.typed marker

2026-04-07 18:51:42 -03:00

query_sanitizer.py

fix: address Copilot review comments on PR #739

2026-04-12 23:07:46 -03:00

README.md

MemPalace: palace architecture, AAAK compression, knowledge graph

2026-04-04 18:16:04 -07:00

repair.py

refactor: route all chromadb access through ChromaBackend

2026-04-14 00:31:16 -03:00

room_detector_local.py

fix: skip unreachable reparse points in detect_rooms_from_folders (#558 )

2026-04-11 16:16:06 -07:00

searcher.py

Merge pull request #999 from jphein/fix/searcher-none-metadata

2026-04-18 13:41:52 -07:00

spellcheck.py

MemPalace: palace architecture, AAAK compression, knowledge graph

2026-04-04 18:16:04 -07:00

split_mega_files.py

Merge pull request #681 from jphein/fix/unicode-checkmark

2026-04-18 23:27:57 -07:00

sweeper.py

fix: address Copilot review on release/3.3.2

2026-04-19 18:19:28 -03:00

version.py

release: v3.3.3

2026-04-23 16:44:22 -07:00

README.md

mempalace/ — Core Package

The Python package that powers MemPalace. All modules, all logic.

Modules

Module	What it does
`cli.py`	CLI entry point — routes to mine, search, init, compress, wake-up
`config.py`	Configuration loading — `~/.mempalace/config.json`, env vars, defaults
`normalize.py`	Converts 5 chat formats (Claude Code JSONL, Claude.ai JSON, ChatGPT JSON, Slack JSON, plain text) to standard transcript format
`miner.py`	Project file ingest — scans directories, chunks by paragraph, stores to ChromaDB
`convo_miner.py`	Conversation ingest — chunks by exchange pair (Q+A), detects rooms from content
`searcher.py`	Semantic search via ChromaDB vectors — filters by wing/room, returns verbatim + scores
`layers.py`	4-layer memory stack: L0 (identity), L1 (critical facts), L2 (room recall), L3 (deep search)
`dialect.py`	AAAK compression — entity codes, emotion markers, 30x lossless ratio
`knowledge_graph.py`	Temporal entity-relationship graph — SQLite, time-filtered queries, fact invalidation
`palace_graph.py`	Room-based navigation graph — BFS traversal, tunnel detection across wings
`mcp_server.py`	MCP server — 19 tools, AAAK auto-teach, Palace Protocol, agent diary
`onboarding.py`	Guided first-run setup — asks about people/projects, generates AAAK bootstrap + wing config
`entity_registry.py`	Entity code registry — maps names to AAAK codes, handles ambiguous names
`entity_detector.py`	Auto-detect people and projects from file content
`general_extractor.py`	Classifies text into 5 memory types (decision, preference, milestone, problem, emotional)
`room_detector_local.py`	Maps folders to room names using 70+ patterns — no API
`spellcheck.py`	Name-aware spellcheck — won't "correct" proper nouns in your entity registry
`split_mega_files.py`	Splits concatenated transcript files into per-session files

Architecture

User → CLI → miner/convo_miner → ChromaDB (palace)
                                     ↕
                              knowledge_graph (SQLite)
                                     ↕
User → MCP Server → searcher → results
                  → kg_query → entity facts
                  → diary    → agent journal

The palace (ChromaDB) stores verbatim content. The knowledge graph (SQLite) stores structured relationships. The MCP server exposes both to any AI tool.