When two wings have one or more confirmed TOPIC labels in common, the
miner now drops a symmetric tunnel between them at mine time so the
palace graph reflects shared themes (frameworks, vendors, recurring
concepts).
- llm_refine: TOPIC label routes to a dedicated `topics` bucket so the
signal survives confirmation instead of getting collapsed into
`uncertain` and dropped.
- entity_detector / project_scanner: bucket plumbed through the
detection pipeline; `confirm_entities` returns confirmed topics
alongside people/projects.
- miner.add_to_known_entities: optional `wing` parameter records the
confirmed topics under `topics_by_wing` in
`~/.mempalace/known_entities.json`. Wing names do NOT leak into the
flat known-name set used by drawer-tagging.
- palace_graph: `compute_topic_tunnels` and `topic_tunnels_for_wing`
create symmetric tunnels via the existing `create_tunnel` API so they
share dedup and persistence with explicit tunnels.
- miner.mine: post-file-loop pass calls `topic_tunnels_for_wing` for
the freshly-mined wing. Failures are logged but never abort the mine.
- config: `topic_tunnel_min_count` knob (env
`MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` or `~/.mempalace/config.json`),
default 1.
Tests cover topic persistence through init->mine, tunnel creation when
wings share a topic, no tunnel below threshold, cross-wing tunnel
retrieval via `list_tunnels`, dedup on recompute, case-insensitive
overlap, and the end-to-end mine-time wiring.
Out of scope for this PR (called out in the PR body): manifest-
dependency overlap, per-topic allow/deny lists, search-result surfacing.
The pattern-matching detector had several systematic false positives that
crowded the init review with nonsense. Concrete fixes:
- CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to
candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are
visible. Previously `MemPalace` fragmented into `Mem` + `Palace`.
- Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata
line like `Created: 2026-04-21` was scoring as dialogue and classifying
`Created` as a person.
- Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b`
(version-only). The previous `\b{name}[-v]\w+` matched `context-manager`,
`multi-word`, etc. - every hyphenated compound.
- Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They
produce pure-English-prose noise (`Contributor`, `Software`, `Covered`,
`Before`).
- Extra SKIP_DIRS: `.terraform`, `vendor`, `target`.
- Expand stopword list with capitalized participles/descriptors that
commonly appear at sentence start: `created`, `updated`, `extracted`,
`processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`,
`bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc.
- classify_entity: high-pronoun single-category signal now classifies as
person. A diary's main character gets referenced with pronouns, not
dialogue markers - requiring two signal categories demoted `Lu` (16
pronoun hits across 30 mentions) to uncertain. Gate on
`pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common
sentence-start words (`Never`, `Before`) with incidental proximity
stay uncertain.
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.
This adds entity sections for both locales:
- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
surnames covering >95% of Taiwanese / PRC names), length capped
at {1,2} trailing chars so greedy matches don't swallow the
trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK
transitions. This is the same mechanism used for Devanagari,
applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
name with no whitespace, so patterns are written as `{name}說`,
`{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
question words, conjunctions, UI nouns, and politeness forms.
**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.
**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.
Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`
Full suite: 957 passed, 0 failed.
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.
Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.
Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.
Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.
Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
_wrap_candidate, _collect_entity_section; candidate_patterns are now
returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
(name extraction with/without boundary_chars, person-verb firing,
English regression)
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
- Add tests for config, convo_miner, spellcheck, knowledge_graph
- Fix Windows PermissionError in test cleanup (chromadb file locks)
- Add UTF-8 encoding to split_mega_files, entity_registry, hooks_cli
- Fix mcp_server parse_known_args logging for unknown args
- Set coverage threshold to 85 in pyproject.toml and CI
- Reset all version files to 3.0.11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>