refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
This commit is contained in:
@@ -197,6 +197,42 @@ class MempalaceConfig:
|
||||
"""Mapping of hall names to keyword lists."""
|
||||
return self._file_config.get("hall_keywords", DEFAULT_HALL_KEYWORDS)
|
||||
|
||||
@property
|
||||
def entity_languages(self):
|
||||
"""Languages whose entity-detection patterns should be applied.
|
||||
|
||||
Reads from env var ``MEMPALACE_ENTITY_LANGUAGES`` (comma-separated)
|
||||
first, then the ``entity_languages`` field in ``config.json``,
|
||||
defaulting to ``["en"]``.
|
||||
"""
|
||||
env_val = os.environ.get("MEMPALACE_ENTITY_LANGUAGES") or os.environ.get(
|
||||
"MEMPAL_ENTITY_LANGUAGES"
|
||||
)
|
||||
if env_val:
|
||||
return [s.strip() for s in env_val.split(",") if s.strip()] or ["en"]
|
||||
cfg = self._file_config.get("entity_languages")
|
||||
if isinstance(cfg, list) and cfg:
|
||||
return [str(s) for s in cfg]
|
||||
return ["en"]
|
||||
|
||||
def set_entity_languages(self, languages):
|
||||
"""Persist the entity-detection language list to ``config.json``."""
|
||||
normalized = [s.strip() for s in languages if s and s.strip()]
|
||||
if not normalized:
|
||||
normalized = ["en"]
|
||||
self._file_config["entity_languages"] = normalized
|
||||
self._config_dir.mkdir(parents=True, exist_ok=True)
|
||||
try:
|
||||
with open(self._config_file, "w", encoding="utf-8") as f:
|
||||
json.dump(self._file_config, f, indent=2, ensure_ascii=False)
|
||||
except OSError:
|
||||
pass
|
||||
try:
|
||||
self._config_file.chmod(0o600)
|
||||
except (OSError, NotImplementedError):
|
||||
pass
|
||||
return normalized
|
||||
|
||||
@property
|
||||
def hook_silent_save(self):
|
||||
"""Whether the stop hook saves directly (True) or blocks for MCP calls (False)."""
|
||||
|
||||
Reference in New Issue
Block a user