mempalace

Author	SHA1	Message	Date
Igor Lins e Silva	74a31b70d3	Merge pull request #998 from MemPalace/fix/silent-transcript-drop Fix silent transcript drop: .jsonl ingestion + 500 MB cap + tandem sweeper	2026-04-18 13:38:02 -03:00
copilot-swe-agent[bot]	24bf97bb65	fix(tests): avoid ONNX network download in update-length validation tests test_base_collection_update_default_validates_list_lengths and test_base_collection_update_default_rejects_mismatched_lengths were spinning up a real ChromaBackend and calling add(documents=...), which triggered ChromaDB's default ONNX embedding function and attempted a network download — failing in offline/sandboxed CI. BaseCollection.update() validates list lengths before any DB access, so no items need to be pre-loaded for the length-check to fire. Switch both tests to use _FakeCollection (same as the rest of the unit tests in this file) so they are pure in-memory and network-free. Also fixes a structural bug in test 1: collection._collection.add() was accidentally placed inside the pytest.raises(ValueError) block, masking the real assertion. Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>	2026-04-18 16:23:58 +00:00
Igor Lins e Silva	4a088ea8e1	Address Copilot review: cursor tie-break, honest metrics, accurate comments Six items from the automated review on PR #998: 1. Cursor tie-break bug (correctness). The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. `drawers_added` counted upserts, not adds. Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. `sweep_directory` hid failures in the summary. `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. Coordination claim was overstated. The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. MAX_FILE_SIZE comments were misleading. Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-18 13:22:18 -03:00
Igor Lins e Silva	42b940d263	fix(backends): address Copilot review on #995 Four defects surfaced by the automated review, fixed with targeted tests: 1. BaseCollection.update() default now validates that documents / metadatas / embeddings lengths match ids, raising ValueError instead of silently misaligning pairs or raising IndexError (base.py). 2. ChromaCollection.query() now rejects the two ambiguous input shapes up front — neither or both of query_texts / query_embeddings, and empty input lists — with clear ValueError messages rather than delegating to chromadb's less-obvious errors (chroma.py). 3. QueryResult.empty() accepts embeddings_requested=True to preserve the outer-query dimension with empty hit lists when the caller asked for embeddings, matching the spec rule that included fields carry the outer shape even when empty (base.py). ChromaCollection.query() threads this through on the empty-result path (chroma.py). 4. ChromaBackend cache-freshness check now matches the semantics from mcp_server._get_client (merged via #757) on three edge cases Copilot called out: (a) invalidate when chroma.sqlite3 disappears while a cached client is held, (b) treat a 0→nonzero stat transition as a change so a cache built when the DB did not yet exist is refreshed, (c) re-stat after PersistentClient constructs the DB lazily so freshness reflects the post-creation state (chroma.py). Tests: 978 passed (up from 970), 8 new tests covering the fixes.	2026-04-18 13:19:18 -03:00
Igor Lins e Silva	29ce7c7135	Harden sweeper for production: verbatim tool blocks, full session_id, logged failures Four changes on top of the proposal's initial sweeper draft, driven by the CLAUDE.md design principles: 1. Drop the 500-char truncation on tool_use / tool_result content in _flatten_content. The "verbatim always" principle forbids lossy compression of user-adjacent data; a long code-edit diff handed to the assistant must round-trip intact. Unknown block types now also serialize their full payload instead of just a type marker. New test test_parse_preserves_tool_blocks_verbatim covers a 5000-char input. 2. Use the full session_id in drawer IDs (not session_id[:12]). Rules out cross-session collisions if a transcript source ever uses non-UUID session identifiers or shared prefixes. 3. Replace silent `except Exception: return None` in get_palace_cursor with a logger.warning — the exact anti-pattern this PR otherwise criticizes in miner.py. The fallback behavior is still safe (deterministic IDs make a missed cursor recover on the next run), but the failure is now discoverable. 4. sweep_directory now collects per-file failures into the result dict and the CLI exits non-zero when any file failed, so a partial-sweep outcome is visible rather than swallowed. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-18 13:14:32 -03:00
MSL	fed69935d3	Add tandem sweeper: message-level safety net for dropped transcripts The primary miners (miner.py, convo_miner.py) operate at file granularity and can drop data for several reasons: size caps, silent OSError on read, dedup false positives, extensions the project miner does not recognize. Even with tonight's hotfixes, any future bug in the file-level path risks silent data loss. The sweeper is a second, cooperating miner that works at MESSAGE granularity: - Parses Claude Code .jsonl line by line, yielding only user/assistant records (filters progress, file-history-snapshot, etc. noise). - For each session_id, queries the palace for max(timestamp) and treats that as the cursor. - Ingests only messages newer than the cursor, as one small drawer per exchange (never hits a size cap — each drawer is 1-5 KB). - Deterministic drawer IDs from session_id + message UUID make reruns idempotent; crash mid-sweep is safe. Tandem coordination is free: if the primary miner committed up to timestamp T, the sweeper resumes from T. If the primary miner missed everything, the sweeper catches it all. Neither duplicates the other. Smoke test on a real Claude Code transcript: 1st run: +39 drawers, 0 already present 2nd run: +0 drawers, 39 already present (perfect idempotence) Opt-in via: mempalace sweep <file.jsonl> mempalace sweep <transcript-dir> No changes to existing miners. No schema migration. Purely additive. Tests: tests/test_sweeper.py (7 tests covering parsing, tandem coordination, idempotency, resume-from-cursor, metadata correctness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:06 -03:00
MSL	6f33d52681	Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB Mirrors the miner.py fix in this same branch. convo_miner.py had the exact same 10 MB cap at line 58 that silently dropped long transcripts via continue. Long Claude Code sessions, multi-year ChatGPT exports, and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern, different file. Raised to 500 MB to match miner.py for consistency; downstream chunking means source file size does not affect storage or embedding cost. Tests: tests/test_convo_miner_size_cap.py (1 test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
MSL	d137d12313	Raise MAX_FILE_SIZE cap from 10 MB to 500 MB Long Claude Code sessions routinely produce transcripts larger than 10 MB. The previous cap at miner.py:65 silently dropped them at line 732 with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same silent-failure pattern as the .jsonl extension bug. The cap exists as a safety rail against pathological binaries, not as a limit on legitimate text. Downstream chunking at 800 chars per drawer means source file size does not affect storage or embedding cost. 500 MB leaves headroom for year-long continuous transcripts while still catching accidental multi-GB binary mines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
MSL	560fdbdc9f	Fix silent drop of .jsonl files in project miner mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not `.jsonl`. Every jsonl file encountered in a mined directory was silently skipped at miner.py:722: if filepath.suffix.lower() not in READABLE_EXTENSIONS: continue Claude Code transcripts, ChatGPT exports, and every other tool writing line-delimited JSON ship as `.jsonl`. Users running `mempalace mine` against a directory of transcripts saw the command complete with no error and no log line — and their conversations never reached the palace. Silent data loss. Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text line-by-line; the existing chunking pipeline handles it the same way it handles any other text file. Tests: tests/test_miner_jsonl_visibility.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:52:01 -03:00
Igor Lins e Silva	a17a8b734a	refactor(backends): typed QueryResult/GetResult, PalaceRef, BaseBackend registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).	2026-04-18 12:45:16 -03:00
bensig	41d45d9336	docs: RFC 002 — source adapter plugin specification Draft plugin specification for source adapters, mirroring RFC 001's role for storage backends. Formalizes the contract six community ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's metadata-only mode have been reinventing ad-hoc, so adapter authors can build to a stable surface. Key decisions: - Single ingest() method; lazy adapters yield SourceItemMetadata ahead of drawers, eager adapters interleave - Declared-transformation model (§1.4) replaces informal verbatim promise with a verifiable one; byte_preserving adapters declare the empty set, declared_lossy adapters enumerate. Existing miner.py and the convo_miner+normalize pipeline map cleanly - Palace is the incremental cursor via is_current(item, metadata); no sidecar persistence - Routing is adapter-owned; detect_room/detect_hall move into the filesystem adapter - Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as json_string field, KG triples route to SQLite knowledge graph - Closets stay core-built as a post-step; adapters may emit flat closet_hints. Closes existing gap where convo drawers get no closets - No per-drawer field renames: source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode all preserved. Spec adds adapter_name, adapter_version, privacy_class §9 enumerates the cleanup PR prerequisites (mempalace/sources/ module, PalaceContext facade, KnowledgeGraph.add_triple gaining backwards-compatible source_drawer_id + adapter_name params). Tracking issue: #989	2026-04-17 23:42:46 -07:00
Igor Lins e Silva	e4a2cd48a2	Merge pull request #984 from domiscd/feat/landing-page-update feat/landing-page: Improve landing page readability	2026-04-17 19:47:39 -03:00
Dominique Deschatre	2e3e0b979c	Update landing.css	2026-04-17 19:40:25 -03:00
Dominique Deschatre	9e8281aab5	(landing) svg icons animations	2026-04-17 19:37:30 -03:00
Dominique Deschatre	e5f5009f80	(landing) added Closets section	2026-04-17 19:18:10 -03:00
Dominique Deschatre	89f0eb5cb3	refactor(website): split Landing.vue into section components Extract 2002-line monolith into landing/ subfolder: - 8 section components (FolioHeader, HeroSection, ForgettingSection, AnatomySection, DialectSection, MechanicsSection, InstallSection, CatalogFooter) - useLandingEffects.js composable for all vanilla-JS effects - landing.css for all styles - Landing.vue reduced to 28-line orchestrator Also restores upstream hero lede text ("permanent. Designed for total recall.").	2026-04-17 18:49:41 -03:00
Dominique Deschatre	8c3d1ba86c	Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update Co-authored-by: Copilot <copilot@github.com>	2026-04-17 17:00:47 -03:00
Dominique Deschatre	28d4f67ba2	landing hero container	2026-04-17 15:53:50 -03:00
dependabot[bot]	0e632df85d	chore(deps): bump actions/checkout from 4 to 6 Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-04-17 07:56:09 +00:00
dependabot[bot]	04d80eb363	chore(deps): bump actions/upload-pages-artifact from 3 to 5 Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 5. - [Release notes](https://github.com/actions/upload-pages-artifact/releases) - [Commits](https://github.com/actions/upload-pages-artifact/compare/v3...v5) --- updated-dependencies: - dependency-name: actions/upload-pages-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-04-17 07:56:05 +00:00
dependabot[bot]	c942f5866c	chore(deps): bump actions/deploy-pages from 4 to 5 Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5. - [Release notes](https://github.com/actions/deploy-pages/releases) - [Commits](https://github.com/actions/deploy-pages/compare/v4...v5) --- updated-dependencies: - dependency-name: actions/deploy-pages dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>	2026-04-17 07:56:02 +00:00
Igor Lins e Silva	6889c6ff33	Merge pull request #957 from MemPalace/release/3.3.1 release: v3.3.1	2026-04-17 00:37:45 -03:00
Igor Lins e Silva	41bff266a4	Merge pull request #918 from almirus/develop feat(cli): add version display and version flag to CLI	2026-04-17 00:29:55 -03:00
Igor Lins e Silva	596f3d3a8e	Merge pull request #964 from MemPalace/fix/website-false-claims fix(website): correct false claims and stale numbers in live docs	2026-04-16 23:38:08 -03:00
Igor Lins e Silva	0cb9ee5c58	fix(website): correct false claims and stale numbers in live docs - Landing: replace nonexistent `mempalace remember` CLI demo with real `mempalace mine ./notes` - Landing: soften unverifiable absolutes ("forever available", "100% recall by design", "<50 ms", "90%+ compression", "two-thousand-year-old", "tens of thousands of entries") - MCP tool count: 19 → 29 across mcp-integration, claude-code, openclaw, and modules; expand tool overview with Drawers, Tunnels, and System categories to match mcp_server.py - Wake-up token range: ~170–900 → ~600–900 in cli/api-reference/python-api to match cli.py help text and concept docs - Gemini CLI: move `--scope user` before target name and add `--` separator so `-m mempalace.mcp_server` isn't parsed as Gemini flags	2026-04-16 23:31:35 -03:00
Igor Lins e Silva	51919fef0c	Merge pull request #963 from domiscd/feat/landing-page-update feat(website): update landing page	2026-04-16 22:37:16 -03:00
Dominique Deschatre	c8727b3a2d	chore(website): add Google Analytics	2026-04-16 22:34:37 -03:00
Dominique Deschatre	44c525ddd3	Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update # Conflicts: # website/index.md	2026-04-16 22:31:22 -03:00
Dominique Deschatre	d8ac4c3abb	new landing page pt 2	2026-04-16 22:24:15 -03:00
Dominique Deschatre	9893fa2383	new landing page	2026-04-16 21:46:03 -03:00
Lman Chu	c88b8a2e17	style: fix ruff format for test_entity_detector.py Collapse implicit string concatenation to single-line strings to satisfy ruff format --check in CI. Co-Authored-By: Claude <noreply@anthropic.com>	2026-04-17 06:40:41 +08:00
Igor Lins e Silva	b552bcf3ea	Merge pull request #958 from MemPalace/fix/release-3.3.1-plugin-manifests release: bump plugin manifests to 3.3.1	2026-04-16 16:26:35 -03:00
Igor Lins e Silva	05ad2dc194	release: bump plugin manifests to 3.3.1 version-guard workflow checks five sources must agree: mempalace/version.py, pyproject.toml, .claude-plugin/marketplace.json, .claude-plugin/plugin.json, .codex-plugin/plugin.json. Initial release commit missed the three plugin manifests.	2026-04-16 16:25:00 -03:00
Igor Lins e Silva	fd89303fe1	docs(changelog): backfill post-v3.3.0 PRs missed by initial boundary Advisor caught: initial boundary (962776c..develop) skipped PRs that landed on develop after v3.3.0 tag but before the sync-back merge. Adds entries for #871 MEMPAL_VERBOSE, #811 research() local-only default, #866 init .gitignore, #864 MCP stdout redirect, #863 precompact hook, #865 searcher empty results, #831 cold-start palace, #862 init help, #815 Slack provenance, #840 save hook auto-mine. Also drops the awkward caveat on #846 created_at — it's post-v3.3.0.	2026-04-16 16:12:37 -03:00
Igor Lins e Silva	2087869752	release: v3.3.1 Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements	2026-04-16 16:09:02 -03:00
Igor Lins e Silva	55a004fe1e	Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata fix: use i18n candidate patterns for entity extraction in miner and palace	2026-04-16 15:54:01 -03:00
Igor Lins e Silva	c5e249bba8	Merge pull request #946 from mvalentsev/fix/utf8-read-text fix: add explicit UTF-8 encoding to read_text() calls (#776)	2026-04-16 15:52:42 -03:00
Igor Lins e Silva	65f99ad7e6	Merge pull request #928 from arnoldwender/fix/i18n-lang-case-insensitive fix(i18n): resolve language codes case-insensitively (#927)	2026-04-16 15:44:36 -03:00
Igor Lins e Silva	29112fab82	Merge pull request #778 from dominosaurs/feat/id-lang feat: add Indonesian language support	2026-04-16 15:44:26 -03:00
Igor Lins e Silva	4215be3926	Merge pull request #773 from tejasashinde/feat/add-i18n-hindi feat: add Hindi language support to i18n module	2026-04-16 15:44:08 -03:00
jp	8adf35a13c	fix: add threading lock to graph cache, expand docstring Address review feedback from @bensig: 1. Wrap cache reads/writes in threading.Lock for thread safety 2. Promote the col-arg caveat from inline comment to docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 09:00:36 -07:00
jp	1657a79649	fix: clarify cache docs, skip caching empty graphs Addresses Copilot review feedback on #661. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 09:00:27 -07:00
jp	84e2aa16e4	perf: graph cache with write-invalidation in build_graph() build_graph() scans every drawer's metadata in 1000-item batches on every call — O(n) per graph build with no caching. At 50K+ drawers this costs several seconds per MCP tool call (traverse, find_tunnels, graph_stats all call build_graph on every invocation). Add a module-level cache (nodes + edges + timestamp) with a 60-second TTL. Cache is invalidated via invalidate_graph_cache(), exported for write operations to call. Tests updated with setup_method cache resets and two new tests verifying cache hit and invalidation behaviour.	2026-04-16 09:00:27 -07:00
jp	15ea385554	fix: replace all non-ASCII progress markers for Windows encoding Also fix miner.py checkmark and box-drawing/arrow chars (─, →) in both miner.py and split_mega_files.py that would crash on cp1251/cp1252. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 08:59:58 -07:00
jp	542b53bb0f	fix: replace Unicode checkmark with ASCII + for Windows encoding (#535 ) Windows terminals using cp1251/cp1252 crash on the Unicode ✓ (U+2713) in progress output. Replace with ASCII + in convo_miner.py and split_mega_files.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 08:59:58 -07:00
mvalentsev	09fe2dda3c	fix: add explicit UTF-8 encoding to read_text() calls (#776 ) On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults to platform encoding, breaking onboarding tests and any source code that reads JSON/markdown with non-ASCII content. 5 files, 8 call sites fixed.	2026-04-16 16:00:29 +05:00
🍕	939d4c1e74	feat: Update Indonesian translations Refine AAAK instruction and expand entity detection patterns.	2026-04-16 17:43:51 +08:00
Lman Chu	683e940f70	feat(i18n): add Traditional + Simplified Chinese entity detection zh-TW and zh-CN previously had no `entity` section. Calling `detect_entities(..., languages=("zh-TW",))` silently fell back to English patterns (i18n/__init__.py:231-233), so no Chinese names were ever extracted — Chinese-speaking users got zero people or projects detected from their own notes. This adds entity sections for both locales: - `candidate_pattern`: common-surname-prefixed CJK n-grams (~100 surnames covering >95% of Taiwanese / PRC names), length capped at {1,2} trailing chars so greedy matches don't swallow the trailing verb character (e.g. 朱宜振說). - `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK transitions. This is the same mechanism used for Devanagari, applied to the CJK range. - `person_verb_patterns`: Chinese verbs attach directly to the name with no whitespace, so patterns are written as `{name}說`, `{name}問`, `{name}決定` — no `\b` or `\s+` separators. - `dialogue_patterns`: full-width colon `：`, Chinese quotes 「」『』, plus the standard Latin forms. - `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱. - `stopwords`: ~140 common particles, pronouns, time expressions, question words, conjunctions, UI nouns, and politeness forms. Known limitation (explicitly covered by a test): CJK scripts have no word delimiters, so a name flanked by CJK on both sides with no punctuation or whitespace break is not extracted. This is a fundamental limit of regex-based CJK entity detection — resolving it would require a dictionary tokeniser. Realistic Chinese technical writing contains enough non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines) that 3+ occurrences normally produce matches. Verified against a realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences with 0.99 person-classification confidence. Follow-ups (separate PRs): same pattern for `ja` and `ko`, both of which currently share the silent fallback-to-English bug. Tests: 7 new tests in `tests/test_entity_detector.py`: - `test_zh_tw_candidate_extraction_at_boundaries` - `test_zh_tw_person_classification` - `test_zh_tw_stopwords_filter_common_particles` - `test_zh_tw_falls_back_to_english_for_non_cjk_names` - `test_zh_cn_candidate_extraction` - `test_zh_cn_and_zh_tw_union_covers_both_variants` - `test_zh_tw_known_limitation_inline_name_no_boundary` Full suite: 957 passed, 0 failed.	2026-04-16 17:43:09 +08:00
fatkobra	1dc55a791d	test: make Claude plugin wrapper tests portable on Windows	2026-04-16 11:41:53 +02:00
fatkobra	be9214a190	Update mempal-precompact-hook.sh	2026-04-16 10:42:20 +02:00

... 2 3 4 5 6 ...

624 Commits