mempalace/tests at 683e940f7009487fd05667e623a775e734c27c9d - mempalace - GIT

jason/mempalace

Files

T

History

Lman Chu 683e940f70 feat(i18n): add Traditional + Simplified Chinese entity detection

zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `：`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.

2026-04-16 17:43:09 +08:00

..

perf: optimize regex compilation in entity extraction

2026-04-14 17:43:26 +00:00

conftest.py

Fix: ruff format with CI-pinned version (0.4.x)

2026-04-13 18:29:48 -04:00

test_backends.py

Fix: set cosine distance metadata on all collection creation sites

2026-04-13 11:00:52 -04:00

test_cli.py

refactor: route all chromadb access through ChromaBackend

2026-04-14 00:31:16 -03:00

test_closet_llm.py

merge: pr/closet-llm-generic + harden LLM regen path for production

2026-04-13 18:40:36 -03:00

test_closets.py

test: verify mine_lock via disjoint critical-section intervals

2026-04-13 19:08:57 -03:00

test_config_extra.py

test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11

2026-04-08 21:38:12 +03:00

test_config.py

fix: use permissive validator for KG entity values (closes #455 )

2026-04-14 09:26:47 -04:00

test_convo_miner_unit.py

fix: store full AI response in convo_miner exchange chunking (#695 )

2026-04-12 14:23:52 -07:00

test_convo_miner.py

feat(normalize): auto-rebuild stale drawers via NORMALIZE_VERSION schema gate

2026-04-13 16:20:55 -03:00

test_dedup.py

refactor: route all chromadb access through ChromaBackend

2026-04-14 00:31:16 -03:00

test_dialect.py

fix: align cmd_compress dict keys with compression_stats() return values (#569 )

2026-04-11 16:16:31 -07:00

test_empty_chromadb_results.py

fix(searcher): guard against empty ChromaDB query results (#195 ) (#865 )

2026-04-15 00:26:38 -07:00

test_entity_detector.py

feat(i18n): add Traditional + Simplified Chinese entity detection

2026-04-16 17:43:09 +08:00

test_entity_registry.py

fix: make entity_registry.research() local-only by default (#811 )

2026-04-15 00:26:24 -07:00

test_exporter.py

feat: new MCP tools — get/list/update drawer, hook settings, export (resolves #635 ) (#667 )

2026-04-11 21:25:04 -07:00

test_fact_checker.py

merge: full hardened stack + rewrite fact_checker around actual KG API

2026-04-13 18:20:11 -03:00

test_general_extractor.py

style: format test files with ruff

2026-04-08 21:08:49 +03:00

test_hall_detection.py

fix: README audit — 42 TDD tests + hall detection + 7 claim fixes (#835 )

2026-04-13 17:11:11 -07:00

test_hooks_cli.py

fix(hooks): stop precompact hook from blocking compaction (#856 , #858 ) (#863 )

2026-04-15 00:26:54 -07:00

test_hybrid_search.py

merge: pr/closet-llm-generic + harden LLM regen path for production

2026-04-13 18:40:36 -03:00

test_i18n.py

fix: address i18n review issues from PR #718

2026-04-15 11:03:28 +05:00

test_init_gitignore_protection.py

fix(init): auto-add per-project files to .gitignore in git repos (#185 ) (#866 )

2026-04-15 00:26:41 -07:00

test_instructions_cli.py

test: add comprehensive test coverage (35% → 58%, threshold 50%)

2026-04-08 20:54:56 +03:00

test_kg_thread_safety.py

fix: add missing self._lock to KnowledgeGraph.close()

2026-04-14 13:09:10 -07:00

test_knowledge_graph_extra.py

test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11

2026-04-08 21:38:12 +03:00

test_knowledge_graph.py

fix: ruff format test_hooks_cli.py and test_knowledge_graph.py

2026-04-08 15:12:12 -03:00

test_layers.py

Мempalace backend seam (#413 )

2026-04-11 16:16:49 -07:00

test_mcp_server.py

fix: return empty status instead of error on cold-start palace (#830 ) (#831 )

2026-04-15 00:26:35 -07:00

test_mcp_stdio_protection.py

fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225 ) (#864 )

2026-04-15 00:26:51 -07:00

test_migrate.py

chore: clarify security guardrails

2026-04-12 22:19:58 -03:00

test_miner.py

fix: allow mining directories without local mempalace.yaml

2026-04-14 13:53:07 -03:00

test_normalize.py

fix: add provenance header and speaker IDs to Slack transcript imports (#815 )

2026-04-15 00:27:01 -07:00

test_onboarding.py

test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11

2026-04-08 21:38:12 +03:00

test_palace_graph_tunnels.py

test: add palace_graph tunnel helper coverage

2026-04-15 11:38:18 +02:00

test_palace_graph.py

style: format test files with ruff

2026-04-08 21:08:49 +03:00

test_query_sanitizer.py

fix: make quote trimming explicit

2026-04-12 22:19:58 -03:00

test_readme_claims.py

docs+tests: fix CI after README slim (#875 )

2026-04-14 21:59:55 -03:00

test_repair.py

refactor: route all chromadb access through ChromaBackend

2026-04-14 00:31:16 -03:00

test_room_detector_local.py

fix: skip unreachable reparse points in detect_rooms_from_folders (#558 )

2026-04-11 16:16:06 -07:00

test_save_hook_mines.py

fix: save hook auto-mines transcript without MEMPAL_DIR (#840 )

2026-04-13 18:09:59 -07:00

test_save_hook_verbose.py

feat: add MEMPAL_VERBOSE toggle — developers see diaries in chat (#871 )

2026-04-14 10:55:56 -07:00

test_searcher.py

feat: include created_at timestamp in search results (#846 )

2026-04-15 00:26:57 -07:00

test_spellcheck_extra.py

test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11

2026-04-08 21:38:12 +03:00

test_spellcheck.py

style: format test files with ruff

2026-04-08 21:08:49 +03:00

test_split_mega_files.py

test: expand coverage to 70%, fix mcp_server CI crash (threshold 60%)

2026-04-08 21:07:03 +03:00

test_version_consistency.py

fix: unify package and MCP version reporting

2026-04-07 08:53:25 +01:00