Commit Graph

8 Commits

Author SHA1 Message Date
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Igor Lins e Silva b214aced90 refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
2026-04-15 08:52:42 -03:00
google-labs-jules[bot] d886a62d8a Optimize entity detection with regex caching and pre-compilation
- Use functools.lru_cache to cache compiled patterns for entity names.
- Pre-compile static pronoun patterns into a single regex.
- Remove redundant .lower() calls in score_entity loop.

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-13 21:35:53 +00:00
Igor Lins e Silva 39e1651af3 fix: correct typo in entity_detector interactive classification prompt (#755)
'(r)roject' had a duplicate 'r', making it read as '(r)roject'
instead of the intended '(r)project'.

Small UX fix — no behavior change.

Co-authored-by: Arnold Wender <arnold.wender@gmail.com>
2026-04-13 01:43:57 -03:00
Ben Sigman 6af6fe3dda Merge pull request #54 from adv3nt3/fix/narrow-exception-handling
fix: narrow bare except Exception to specific types where safe
2026-04-07 13:54:05 -07:00
Renato Oliveira cfe878204e fix: update input prompt for entity confirmation in entity_detector.py
Refine the prompt for distinguishing between person and project entities by adjusting the wording for clarity.
2026-04-07 11:41:15 -03:00
adv3nt3 312d380aab fix: narrow bare except Exception to specific types where safe
Replace broad except Exception with specific exception types in 6
sites where the expected failure mode is well-defined:

- normalize.py: OSError for file read, ImportError for optional import
- miner.py: OSError for file read_text
- entity_detector.py: OSError for file read in scan loop
- convo_miner.py: (OSError, ValueError) for normalize which reads
  and parses files
- entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError)
  for Wikipedia lookup fallback

ChromaDB except Exception sites (~30) are left broad for now.
chromadb.errors defines NotFoundError, DuplicateIDError,
InvalidDimensionException etc., but narrowing those sites requires
importing from chromadb.errors and validating across supported
versions (>=0.4.0). MCP server handlers also left broad for
resilience.
2026-04-07 13:51:27 +02:00
Milla Jovovich 068dbd9a7b MemPalace: palace architecture, AAAK compression, knowledge graph
The memory system:
- Palace structure: Wings (people/projects) → Rooms (topics) → Closets (AAAK compressed) → Drawers (verbatim transcripts)
- Halls connect related rooms within a wing
- Tunnels cross-reference rooms across wings
- AAAK: 30x lossless compression dialect for AI agents
- Knowledge graph: temporal entity-relationship triples (SQLite)
- Palace graph: room-based navigation with tunnel detection
- MCP server: 19 tools — search, graph traversal, agent diary, AAAK auto-teach
- Onboarding: guided setup generates wing config + AAAK entity registry
- Contradiction detection: catches wrong pronouns, names, ages
- Auto-save hooks for Claude Code

96.6% Recall@5 on LongMemEval — highest zero-API score published.
100% with optional Haiku rerank (500/500).
Local. Free. No API key required.
2026-04-04 18:16:04 -07:00