Commit Graph

842 Commits

Author SHA1 Message Date
Igor Lins e Silva fd89303fe1 docs(changelog): backfill post-v3.3.0 PRs missed by initial boundary
Advisor caught: initial boundary (962776c..develop) skipped PRs that
landed on develop after v3.3.0 tag but before the sync-back merge.
Adds entries for #871 MEMPAL_VERBOSE, #811 research() local-only
default, #866 init .gitignore, #864 MCP stdout redirect, #863
precompact hook, #865 searcher empty results, #831 cold-start palace,
#862 init help, #815 Slack provenance, #840 save hook auto-mine.
Also drops the awkward caveat on #846 created_at — it's post-v3.3.0.
2026-04-16 16:12:37 -03:00
Igor Lins e Silva 2087869752 release: v3.3.1
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
2026-04-16 16:09:02 -03:00
Igor Lins e Silva 55a004fe1e Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata
fix: use i18n candidate patterns for entity extraction in miner and palace
2026-04-16 15:54:01 -03:00
Igor Lins e Silva c5e249bba8 Merge pull request #946 from mvalentsev/fix/utf8-read-text
fix: add explicit UTF-8 encoding to read_text() calls (#776)
2026-04-16 15:52:42 -03:00
Igor Lins e Silva 65f99ad7e6 Merge pull request #928 from arnoldwender/fix/i18n-lang-case-insensitive
fix(i18n): resolve language codes case-insensitively (#927)
2026-04-16 15:44:36 -03:00
Igor Lins e Silva 29112fab82 Merge pull request #778 from dominosaurs/feat/id-lang
feat: add Indonesian language support
2026-04-16 15:44:26 -03:00
Igor Lins e Silva 4215be3926 Merge pull request #773 from tejasashinde/feat/add-i18n-hindi
feat: add Hindi language support to i18n module
2026-04-16 15:44:08 -03:00
jp 8adf35a13c fix: add threading lock to graph cache, expand docstring
Address review feedback from @bensig:
1. Wrap cache reads/writes in threading.Lock for thread safety
2. Promote the col-arg caveat from inline comment to docstring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 09:00:36 -07:00
jp 1657a79649 fix: clarify cache docs, skip caching empty graphs
Addresses Copilot review feedback on #661.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 09:00:27 -07:00
jp 84e2aa16e4 perf: graph cache with write-invalidation in build_graph()
build_graph() scans every drawer's metadata in 1000-item batches on
every call — O(n) per graph build with no caching. At 50K+ drawers
this costs several seconds per MCP tool call (traverse, find_tunnels,
graph_stats all call build_graph on every invocation).

Add a module-level cache (nodes + edges + timestamp) with a 60-second
TTL. Cache is invalidated via invalidate_graph_cache(), exported for
write operations to call. Tests updated with setup_method cache resets
and two new tests verifying cache hit and invalidation behaviour.
2026-04-16 09:00:27 -07:00
jp 15ea385554 fix: replace all non-ASCII progress markers for Windows encoding
Also fix miner.py checkmark and box-drawing/arrow chars (─, →) in
both miner.py and split_mega_files.py that would crash on cp1251/cp1252.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 08:59:58 -07:00
jp 542b53bb0f fix: replace Unicode checkmark with ASCII + for Windows encoding (#535)
Windows terminals using cp1251/cp1252 crash on the Unicode ✓ (U+2713)
in progress output. Replace with ASCII + in convo_miner.py and
split_mega_files.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 08:59:58 -07:00
mvalentsev 09fe2dda3c fix: add explicit UTF-8 encoding to read_text() calls (#776)
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
2026-04-16 16:00:29 +05:00
🍕 939d4c1e74 feat: Update Indonesian translations
Refine AAAK instruction and expand entity detection patterns.
2026-04-16 17:43:51 +08:00
Lman Chu 683e940f70 feat(i18n): add Traditional + Simplified Chinese entity detection
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.
2026-04-16 17:43:09 +08:00
fatkobra 1dc55a791d test: make Claude plugin wrapper tests portable on Windows 2026-04-16 11:41:53 +02:00
fatkobra be9214a190 Update mempal-precompact-hook.sh 2026-04-16 10:42:20 +02:00
fatkobra 5fe0c1c2ac Update mempal-stop-hook.sh 2026-04-16 10:33:34 +02:00
fatkobra e083cd6c84 Create test_claude_plugin_hook_wrappers.py 2026-04-16 10:32:17 +02:00
🍕 88f5b5fa0e Add Indonesian language support
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
2026-04-16 16:15:47 +08:00
mvalentsev cde0f5b9e7 remove unnecessary comment 2026-04-16 10:38:38 +05:00
mvalentsev 973bd62a9a fix: use pre-wrapped candidate patterns after #932 refactor 2026-04-16 10:37:18 +05:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
JunghwanNA fb1cf53919 fix: harden repair backup scope and migrate swap rollback
- repair.py: define backup_path before the conditional block so it is
  always in scope when the except handler references it
- migrate.py: restore old palace from .old if both os.rename and
  shutil.move fail during the swap step
2026-04-16 14:04:26 +09:00
tejasashinde 21da870bd0 fix(i18n/hi): add boundary_chars and update action_pattern for Devanagari-aware matching 2026-04-16 09:21:21 +05:30
JunghwanNA 5bf826046c fix: sanitize topic parameter in tool_diary_write
agent_name and entry are validated via sanitize_name/sanitize_content,
but topic is stored raw into ChromaDB metadata. Apply the same
sanitize_name guard to reject null bytes, path traversal, and
oversized payloads.
2026-04-16 12:12:17 +09:00
JunghwanNA 5dfe853154 fix: guard against data loss in repair, migrate, and CLI rebuild
- repair.py: wrap upsert loop in try/except; restore from backup on
  failure instead of leaving a partially rebuilt collection
- migrate.py: replace non-atomic rmtree+move with rename-aside swap
  so a crash between the two calls does not destroy both copies
- cli.py: use offset += len(batch["ids"]) with empty-batch guard
  instead of fixed offset += batch_size to prevent skipping drawers
2026-04-16 12:11:18 +09:00
Igor Lins e Silva d4c942417a Merge pull request #932 from MemPalace/fix/entity-detector-non-latin-boundaries
fix(entity_detector): script-aware word boundaries for combining-mark scripts
2026-04-15 22:38:59 -03:00
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Arnold Wender 6caac50138 fix(i18n): use Optional[str] for Python 3.9 compatibility
PEP 604 union syntax (str | None) requires Python 3.10+. The project
supports 3.9 per CI matrix, so use typing.Optional instead.
2026-04-15 23:37:12 +02:00
Arnold Wender 0174b93d0f fix(i18n): resolve language codes case-insensitively (#927)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.

The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.

Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.

Closes #927
2026-04-15 23:33:42 +02:00
Igor Lins e Silva 122ce38811 Merge pull request #907 from Archetipo95/feat/italian-i18n-support
feat: add Italian language support
2026-04-15 18:05:13 -03:00
Igor Lins e Silva 57b0b14192 Merge pull request #156 from mvalentsev/feat/pt-br-entity-detection
feat: add Brazilian Portuguese support to entity_detector (closes #117)
2026-04-15 17:53:30 -03:00
almirus 10cdd93cec feat(cli): add version display and version flag to CLI
Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit.
2026-04-15 21:44:20 +03:00
mvalentsev 4221589df2 fix(i18n): address review feedback on pt-br.json
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching)
- entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives
- pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
2026-04-15 23:32:31 +05:00
mvalentsev 3d13a72ae0 feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117)
CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).

Follows the i18n entity framework from #911.
2026-04-15 23:32:31 +05:00
Tejas Shinde 33a98fb9d1 Updated hi.json to support infra for entity,pronoun_patterns,dialogue_patterns,direct_address_pattern, project_verb_patterns and stopwords 2026-04-15 23:33:24 +05:30
Tejas Shinde ce3ae0a668 Merge branch 'MemPalace:develop' into feat/add-i18n-hindi 2026-04-15 23:19:57 +05:30
Martin Masevski 69453b2180 feat: add italian entity patterns 2026-04-15 19:18:23 +02:00
Martin Masevski 2e998db0b9 feat: add italian i18n support 2026-04-15 19:15:55 +02:00
Igor Lins e Silva 73a2f82d5b Merge pull request #760 from mvalentsev/feat/i18n-russian
feat: add Russian language support (ru.json)
2026-04-15 13:46:04 -03:00
Igor Lins e Silva 312b3b5f0e Merge pull request #758 from mvalentsev/fix/i18n-review-issues
fix: address i18n review issues from PR #718
2026-04-15 13:45:49 -03:00
mvalentsev 4b998de77a feat(i18n): expand Russian entity stopwords with prepositions and conjunctions
Adds 34 prepositions and conjunctions to reduce false positives
in entity detection when these words appear sentence-initial.

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 21:14:51 +05:00
mvalentsev 3e49522a42 fix(i18n): apply review feedback on ru.json (#760)
- mine_skip: "повторной раскопки" -> "повторной обработки"
- quote_pattern: add Russian guillemet quotes «»

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 20:17:16 +05:00
mvalentsev d6bd7de5f6 feat(i18n): add entity detection section to Russian locale
Cyrillic candidate/multi-word patterns, person-verb patterns
(сказал, спросил, ответил, etc.), pronoun patterns, dialogue
markers, direct address, and Russian stopwords.

Follows the i18n entity framework from #911.
2026-04-15 18:16:25 +05:00
mvalentsev b87ada3c96 feat: add Russian language support to i18n module
Add ru.json with full Russian translations for CLI strings, palace
terminology, AAAK compression instruction, and regex patterns for
topic/action extraction with Cyrillic character classes.

No code changes needed -- the i18n module auto-discovers language
files via *.json glob in the i18n directory.
2026-04-15 18:15:15 +05:00
Igor Lins e Silva 3bac3654c4 Merge pull request #911 from MemPalace/refactor/entity-detector-i18n
refactor(entity_detector): make multi-language extensible via i18n JSON
2026-04-15 09:40:36 -03:00
Igor Lins e Silva c722c91e2a test: document orphan-locale recovery for _temp_locale helper 2026-04-15 08:54:23 -03:00
Igor Lins e Silva b214aced90 refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
2026-04-15 08:52:42 -03:00
Igor Lins e Silva 56b6a6360f Merge pull request #908 from fatkobra/test/palace-graph-tunnels
test: add palace_graph tunnel helper coverage
2026-04-15 08:23:18 -03:00