fix: use i18n candidate patterns for entity extraction in miner and palace

entity_detector.py was refactored in #911 to load candidate patterns from i18n locale JSON files, supporting non-Latin scripts (Cyrillic, accented Latin, etc.). But three other code paths still hardcoded the ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity names in metadata tagging, closet indexing, and registry lookups. Replace the hardcoded regex with a shared _candidate_entity_words() helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 05:23:33 +05:00
parent d4c942417a
commit 8bf940f861
4 changed files with 56 additions and 5 deletions
@@ -656,7 +656,9 @@ class EntityRegistry:
        Find capitalized words in query that aren't in registry or common words.
        These are candidates for Wikipedia research.
        """
-        candidates = re.findall(r"\b[A-Z][a-z]{2,15}\b", query)
+        from .palace import _candidate_entity_words
+
+        candidates = _candidate_entity_words(query)
        unknown = []
        for word in set(candidates):
            if word.lower() in COMMON_ENGLISH_WORDS: