mempalace

Author	SHA1	Message	Date
Igor Lins e Silva	f895bc58e6	fix(entity_detector): script-aware word boundaries for combining-mark scripts Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)	2026-04-15 22:18:52 -03:00
Igor Lins e Silva	122ce38811	Merge pull request #907 from Archetipo95/feat/italian-i18n-support feat: add Italian language support	2026-04-15 18:05:13 -03:00
mvalentsev	4221589df2	fix(i18n): address review feedback on pt-br.json - dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)	2026-04-15 23:32:31 +05:00
mvalentsev	3d13a72ae0	feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117 ) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from #911.	2026-04-15 23:32:31 +05:00
Martin Masevski	69453b2180	feat: add italian entity patterns	2026-04-15 19:18:23 +02:00
Martin Masevski	2e998db0b9	feat: add italian i18n support	2026-04-15 19:15:55 +02:00
Igor Lins e Silva	73a2f82d5b	Merge pull request #760 from mvalentsev/feat/i18n-russian feat: add Russian language support (ru.json)	2026-04-15 13:46:04 -03:00
Igor Lins e Silva	312b3b5f0e	Merge pull request #758 from mvalentsev/fix/i18n-review-issues fix: address i18n review issues from PR #718	2026-04-15 13:45:49 -03:00
mvalentsev	4b998de77a	feat(i18n): expand Russian entity stopwords with prepositions and conjunctions Adds 34 prepositions and conjunctions to reduce false positives in entity detection when these words appear sentence-initial. Co-Authored-By: almirus <almirus@users.noreply.github.com>	2026-04-15 21:14:51 +05:00
mvalentsev	3e49522a42	fix(i18n): apply review feedback on ru.json (#760 ) - mine_skip: "повторной раскопки" -> "повторной обработки" - quote_pattern: add Russian guillemet quotes «» Co-Authored-By: almirus <almirus@users.noreply.github.com>	2026-04-15 20:17:16 +05:00
mvalentsev	d6bd7de5f6	feat(i18n): add entity detection section to Russian locale Cyrillic candidate/multi-word patterns, person-verb patterns (сказал, спросил, ответил, etc.), pronoun patterns, dialogue markers, direct address, and Russian stopwords. Follows the i18n entity framework from #911.	2026-04-15 18:16:25 +05:00
mvalentsev	b87ada3c96	feat: add Russian language support to i18n module Add ru.json with full Russian translations for CLI strings, palace terminology, AAAK compression instruction, and regex patterns for topic/action extraction with Cyrillic character classes. No code changes needed -- the i18n module auto-discovers language files via *.json glob in the i18n directory.	2026-04-15 18:15:15 +05:00
Igor Lins e Silva	b214aced90	refactor(entity_detector): make multi-language extensible via i18n JSON Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.	2026-04-15 08:52:42 -03:00
mvalentsev	d565718922	fix: address i18n review issues from PR #718 Three issues flagged by bensig on the i18n PR before merge: 1. ko.json: status_drawers used {drawers} instead of {count}, causing the Korean UI to show the raw template string instead of the actual drawer count. All other 7 languages use {count}. 2. Test file was shipped inside the package at mempalace/i18n/test_i18n.py with a sys.path.insert hack. Moved to tests/test_i18n.py per the project convention in AGENTS.md. 3. Dialect.from_config() passed lang=config.get("lang") which defaults to None, causing __init__ to inherit whatever language was loaded earlier via module-level state. Now defaults to "en" explicitly so from_config is deterministic regardless of prior load_lang() calls. Added two regression tests for the ko.json fix and the state leak.	2026-04-15 11:03:28 +05:00
Igor Lins e Silva	c3f9b76d9a	fix(ci): resolve ruff lint + format failures - Remove unused `json` and `current_lang` imports from mempalace/i18n/test_i18n.py (F401) - Reformat Dialect.__init__ signature in mempalace/dialect.py (ruff format collapses multi-line signature, adds blank line after lazy import) Both auto-fixes from `ruff check --fix` / `ruff format`. No behavioral changes.	2026-04-12 17:14:06 -03:00
MSL	baf3c0ab64	feat: i18n support — 8 languages for MemPalace Add language dictionaries: English, French, Korean, Japanese, Spanish, German, Simplified Chinese, Traditional Chinese. Each language is a single JSON file with: - Localized terms (palace, wing, closet, drawer, etc.) - CLI output strings with {var} interpolation - AAAK compression instructions in that language - Regex patterns for offline topic/quote/action extraction Usage: Dialect(lang="ko") or set "language": "ko" in config. Contributors can add new languages by copying en.json and translating. Dialect class now accepts lang param and loads AAAK instruction + regex patterns from the i18n dictionary automatically. Tests: mempalace/i18n/test_i18n.py — all 8 languages pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:09:47 -07:00

16 Commits