test_base_collection_update_default_validates_list_lengths and
test_base_collection_update_default_rejects_mismatched_lengths were
spinning up a real ChromaBackend and calling add(documents=...), which
triggered ChromaDB's default ONNX embedding function and attempted a
network download — failing in offline/sandboxed CI.
BaseCollection.update() validates list lengths before any DB access, so
no items need to be pre-loaded for the length-check to fire. Switch both
tests to use _FakeCollection (same as the rest of the unit tests in this
file) so they are pure in-memory and network-free.
Also fixes a structural bug in test 1: collection._collection.add() was
accidentally placed inside the pytest.raises(ValueError) block, masking
the real assertion.
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d
Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
Four defects surfaced by the automated review, fixed with targeted tests:
1. BaseCollection.update() default now validates that documents / metadatas /
embeddings lengths match ids, raising ValueError instead of silently
misaligning pairs or raising IndexError (base.py).
2. ChromaCollection.query() now rejects the two ambiguous input shapes up
front — neither or both of query_texts / query_embeddings, and empty input
lists — with clear ValueError messages rather than delegating to chromadb's
less-obvious errors (chroma.py).
3. QueryResult.empty() accepts embeddings_requested=True to preserve the
outer-query dimension with empty hit lists when the caller asked for
embeddings, matching the spec rule that included fields carry the outer
shape even when empty (base.py). ChromaCollection.query() threads this
through on the empty-result path (chroma.py).
4. ChromaBackend cache-freshness check now matches the semantics from
mcp_server._get_client (merged via #757) on three edge cases Copilot
called out: (a) invalidate when chroma.sqlite3 disappears while a cached
client is held, (b) treat a 0→nonzero stat transition as a change so a
cache built when the DB did not yet exist is refreshed, (c) re-stat
after PersistentClient constructs the DB lazily so freshness reflects
the post-creation state (chroma.py).
Tests: 978 passed (up from 970), 8 new tests covering the fixes.
Extract 2002-line monolith into landing/ subfolder:
- 8 section components (FolioHeader, HeroSection, ForgettingSection, AnatomySection, DialectSection, MechanicsSection, InstallSection, CatalogFooter)
- useLandingEffects.js composable for all vanilla-JS effects
- landing.css for all styles
- Landing.vue reduced to 28-line orchestrator
Also restores upstream hero lede text ("permanent. Designed for total recall.").
- Landing: replace nonexistent `mempalace remember` CLI demo with real
`mempalace mine ./notes`
- Landing: soften unverifiable absolutes ("forever available",
"100% recall by design", "<50 ms", "90%+ compression",
"two-thousand-year-old", "tens of thousands of entries")
- MCP tool count: 19 → 29 across mcp-integration, claude-code, openclaw,
and modules; expand tool overview with Drawers, Tunnels, and System
categories to match mcp_server.py
- Wake-up token range: ~170–900 → ~600–900 in cli/api-reference/python-api
to match cli.py help text and concept docs
- Gemini CLI: move `--scope user` before target name and add `--`
separator so `-m mempalace.mcp_server` isn't parsed as Gemini flags
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.
5 files, 8 call sites fixed.
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.
Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.
Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.
Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.
Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.
Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.
Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
_wrap_candidate, _collect_entity_section; candidate_patterns are now
returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
(name extraction with/without boundary_chars, person-verb firing,
English regression)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.
The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.
Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.
Closes#927
Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit.
CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).
Follows the i18n entity framework from #911.
Adds 34 prepositions and conjunctions to reduce false positives
in entity detection when these words appear sentence-initial.
Co-Authored-By: almirus <almirus@users.noreply.github.com>
Add ru.json with full Russian translations for CLI strings, palace
terminology, AAAK compression instruction, and regex patterns for
topic/action extraction with Cyrillic character classes.
No code changes needed -- the i18n module auto-discovers language
files via *.json glob in the i18n directory.