refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
This commit is contained in:
@@ -583,15 +583,19 @@ class EntityRegistry:
|
||||
|
||||
# ── Learn from sessions ──────────────────────────────────────────────────
|
||||
|
||||
def learn_from_text(self, text: str, min_confidence: float = 0.75) -> list:
|
||||
def learn_from_text(self, text: str, min_confidence: float = 0.75, languages=("en",)) -> list:
|
||||
"""
|
||||
Scan session text for new entity candidates.
|
||||
Returns list of newly discovered candidates for review.
|
||||
|
||||
``languages`` is forwarded to entity detection — pass the user's
|
||||
configured ``MempalaceConfig().entity_languages`` to match the
|
||||
locales used at ``mempalace init`` time.
|
||||
"""
|
||||
from mempalace.entity_detector import extract_candidates, score_entity, classify_entity
|
||||
|
||||
lines = text.splitlines()
|
||||
candidates = extract_candidates(text)
|
||||
candidates = extract_candidates(text, languages=languages)
|
||||
new_candidates = []
|
||||
|
||||
for name, frequency in candidates.items():
|
||||
@@ -599,7 +603,7 @@ class EntityRegistry:
|
||||
if name in self.people or name in self.projects:
|
||||
continue
|
||||
|
||||
scores = score_entity(name, text, lines)
|
||||
scores = score_entity(name, text, lines, languages=languages)
|
||||
entity = classify_entity(name, frequency, scores)
|
||||
|
||||
if entity["type"] == "person" and entity["confidence"] >= min_confidence:
|
||||
|
||||
Reference in New Issue
Block a user