mempalace

Author	SHA1	Message	Date
MSL	b99e54546b	feat(init): context-aware corpus detection 10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5. This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests. The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean. But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports. ## What changes for users (in order, from `pip install` onwards) Install — `pip install mempalace` is unchanged. The package itself didn't shift. First run — `mempalace init <folder>`: 1. `init` examines your folder before classifying anything. A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context. 2. LLM-assisted classification is now ON by default. v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`: - Anthropic (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. This is the path live-verified end-to-end in this PR with Haiku 4.5. Cost: ~\$0.01 per `init`. - Ollama (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise. - OpenAI-compatible (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here. 3. `init` never blocks on a missing LLM. No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out. 4. `init` shows you what it detected. A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder. 5. Entity classification gets smarter across the board. Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all. 6. Agent personas live in their own bucket. Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean. 7. Detection result persists to `<palace>/.mempalace/origin.json` with a `schema_version: 1` envelope, so downstream tools can read it. 8. Re-running `init` is now idempotent. Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR. Later — when your folder grows: 9. `mempalace mine --redetect-origin` is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe. ## Behind the changes - New module `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation. - Entity-classification consumer wiring. `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`. - LLM-refine consumer wiring. `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged. - `init` overhaul. Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module. - `mine` flag. `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. - `CLAUDE.md` design principle reworded — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback. ## Cost story - Anthropic (verified path): ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`. - Ollama / local LLM runtime: zero cost. Fully offline. - OpenAI-compatible service: depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes. - No LLM at all: graceful fallback to heuristic-only. Zero cost. `init` never blocks. ## Backwards compatibility - All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change. - The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work. - `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3. ## Test coverage - 19 unit tests in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation. - 29 integration tests in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition). - 1354 total mempalace tests pass. 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too. - Live-smoke-tested with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures. ## Hygiene guardrail This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.	2026-04-26 12:37:26 -07:00
Igor Lins e Silva	fe051adc73	feat(graph): cross-wing tunnels by shared topics (#1180 ) When two wings have one or more confirmed TOPIC labels in common, the miner now drops a symmetric tunnel between them at mine time so the palace graph reflects shared themes (frameworks, vendors, recurring concepts). - llm_refine: TOPIC label routes to a dedicated `topics` bucket so the signal survives confirmation instead of getting collapsed into `uncertain` and dropped. - entity_detector / project_scanner: bucket plumbed through the detection pipeline; `confirm_entities` returns confirmed topics alongside people/projects. - miner.add_to_known_entities: optional `wing` parameter records the confirmed topics under `topics_by_wing` in `~/.mempalace/known_entities.json`. Wing names do NOT leak into the flat known-name set used by drawer-tagging. - palace_graph: `compute_topic_tunnels` and `topic_tunnels_for_wing` create symmetric tunnels via the existing `create_tunnel` API so they share dedup and persistence with explicit tunnels. - miner.mine: post-file-loop pass calls `topic_tunnels_for_wing` for the freshly-mined wing. Failures are logged but never abort the mine. - config: `topic_tunnel_min_count` knob (env `MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` or `~/.mempalace/config.json`), default 1. Tests cover topic persistence through init->mine, tunnel creation when wings share a topic, no tunnel below threshold, cross-wing tunnel retrieval via `list_tunnels`, dedup on recompute, case-insensitive overlap, and the end-to-end mine-time wiring. Out of scope for this PR (called out in the PR body): manifest- dependency overlap, per-topic allow/deny lists, search-result surfacing.	2026-04-24 23:06:26 -03:00
Igor Lins e Silva	6aebf458ff	fix(entity): reduce noise in regex-based detection The pattern-matching detector had several systematic false positives that crowded the init review with nonsense. Concrete fixes: - CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+\|[A-Z]{2,})+` to candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are visible. Previously `MemPalace` fragmented into `Mem` + `Palace`. - Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata line like `Created: 2026-04-21` was scoring as dialogue and classifying `Created` as a person. - Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b` (version-only). The previous `\b{name}[-v]\w+` matched `context-manager`, `multi-word`, etc. - every hyphenated compound. - Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They produce pure-English-prose noise (`Contributor`, `Software`, `Covered`, `Before`). - Extra SKIP_DIRS: `.terraform`, `vendor`, `target`. - Expand stopword list with capitalized participles/descriptors that commonly appear at sentence start: `created`, `updated`, `extracted`, `processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`, `bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc. - classify_entity: high-pronoun single-category signal now classifies as person. A diary's main character gets referenced with pronouns, not dialogue markers - requiring two signal categories demoted `Lu` (16 pronoun hits across 30 mentions) to uncertain. Gate on `pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common sentence-start words (`Never`, `Before`) with incidental proximity stay uncertain.	2026-04-24 00:20:32 -03:00
Igor Lins e Silva	f895bc58e6	fix(entity_detector): script-aware word boundaries for combining-mark scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)	2026-04-15 22:18:52 -03:00
Igor Lins e Silva	b214aced90	refactor(entity_detector): make multi-language extensible via i18n JSON Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.	2026-04-15 08:52:42 -03:00
google-labs-jules[bot]	d886a62d8a	Optimize entity detection with regex caching and pre-compilation - Use functools.lru_cache to cache compiled patterns for entity names. - Pre-compile static pronoun patterns into a single regex. - Remove redundant .lower() calls in score_entity loop. Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>	2026-04-13 21:35:53 +00:00
Igor Lins e Silva	39e1651af3	fix: correct typo in entity_detector interactive classification prompt (#755 ) '(r)roject' had a duplicate 'r', making it read as '(r)roject' instead of the intended '(r)project'. Small UX fix — no behavior change. Co-authored-by: Arnold Wender <arnold.wender@gmail.com>	2026-04-13 01:43:57 -03:00
Ben Sigman	6af6fe3dda	Merge pull request #54 from adv3nt3/fix/narrow-exception-handling fix: narrow bare except Exception to specific types where safe	2026-04-07 13:54:05 -07:00
Renato Oliveira	cfe878204e	fix: update input prompt for entity confirmation in entity_detector.py Refine the prompt for distinguishing between person and project entities by adjusting the wording for clarity.	2026-04-07 11:41:15 -03:00
adv3nt3	312d380aab	fix: narrow bare except Exception to specific types where safe Replace broad except Exception with specific exception types in 6 sites where the expected failure mode is well-defined: - normalize.py: OSError for file read, ImportError for optional import - miner.py: OSError for file read_text - entity_detector.py: OSError for file read in scan loop - convo_miner.py: (OSError, ValueError) for normalize which reads and parses files - entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError) for Wikipedia lookup fallback ChromaDB except Exception sites (~30) are left broad for now. chromadb.errors defines NotFoundError, DuplicateIDError, InvalidDimensionException etc., but narrowing those sites requires importing from chromadb.errors and validating across supported versions (>=0.4.0). MCP server handlers also left broad for resilience.	2026-04-07 13:51:27 +02:00
Milla Jovovich	068dbd9a7b	MemPalace: palace architecture, AAAK compression, knowledge graph The memory system: - Palace structure: Wings (people/projects) → Rooms (topics) → Closets (AAAK compressed) → Drawers (verbatim transcripts) - Halls connect related rooms within a wing - Tunnels cross-reference rooms across wings - AAAK: 30x lossless compression dialect for AI agents - Knowledge graph: temporal entity-relationship triples (SQLite) - Palace graph: room-based navigation with tunnel detection - MCP server: 19 tools — search, graph traversal, agent diary, AAAK auto-teach - Onboarding: guided setup generates wing config + AAAK entity registry - Contradiction detection: catches wrong pronouns, names, ages - Auto-save hooks for Claude Code 96.6% Recall@5 on LongMemEval — highest zero-API score published. 100% with optional Haiku rerank (500/500). Local. Free. No API key required.	2026-04-04 18:16:04 -07:00

11 Commits