Commit Graph

13 Commits

Author SHA1 Message Date
Arnold Wender 2e441d17a2 fix(entity_registry): fsync parent dir after rename for ext4 durability
Without this, on ext4 (and similar) filesystems the rename ack does not
guarantee durability across power loss — a crash can revert to a state
where the temp file is present and the target is at the old version.

Suggested by @jphein on #1215.
2026-05-04 11:08:14 +02:00
Arnold Wender 4f36145c2e fix(entity_registry): atomic write to prevent partial corruption on crash
EntityRegistry.save() called Path.write_text() directly, which truncates
the target file and then writes — so a crash mid-write (power loss, OOM,
filesystem-full mid-flush) leaves an empty or half-written
entity_registry.json. The whole people/projects map is lost; the system
falls back to an empty registry on next load.

Switch to the standard atomic-write pattern: serialize to a sibling
.tmp file in the same directory (so os.replace stays on one filesystem),
fsync, chmod 0o600, then os.replace over the target. The replace is
atomic on POSIX and Windows, so any crash leaves the previous registry
intact instead of a truncated file.

Tests cover: no leftover .tmp on success, and previous content preserved
when os.replace itself raises mid-save.
2026-05-04 11:08:14 +02:00
Igor Lins e Silva 55a004fe1e Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata
fix: use i18n candidate patterns for entity extraction in miner and palace
2026-04-16 15:54:01 -03:00
mvalentsev 09fe2dda3c fix: add explicit UTF-8 encoding to read_text() calls (#776)
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
2026-04-16 16:00:29 +05:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
Igor Lins e Silva b214aced90 refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
2026-04-15 08:52:42 -03:00
Marcio E. Heiderscheidt b524b31839 fix: restrict file permissions on sensitive palace data (#814)
* fix: restrict file permissions on sensitive palace data

On Linux with default umask (022), several files and directories
containing personal data were created world-readable. This patch
applies chmod 0o700 to directories and 0o600 to files immediately
after creation, wrapped in try/except for Windows compatibility.

Files hardened:
- hooks_cli.py: hook_state/ directory and hook.log
- entity_registry.py: entity_registry.json (names, relationships)
- knowledge_graph.py: knowledge_graph.sqlite3 parent directory
- exporter.py: export output directory and wing subdirectories
- config.py: people_map.json (name mappings)
- mcp_server.py: WAL file creation uses atomic os.open (TOCTOU fix)

Refs: MemPalace/mempalace#809

* fix: avoid redundant chmod calls on hot paths

- hooks_cli.py: chmod STATE_DIR and hook.log only on first creation,
  not on every _log() call (hooks fire on every Stop event)
- exporter.py: track created wing dirs to skip redundant makedirs +
  chmod on the same directory across batches
- mcp_server.py: remove redundant _WAL_FILE.chmod after os.open
  already set mode=0o600 atomically

Refs: MemPalace/mempalace#809
2026-04-15 00:27:03 -07:00
Marcio E. Heiderscheidt f20f45a2da fix: make entity_registry.research() local-only by default (#811)
* fix: make entity_registry.research() local-only by default

research() previously called _wikipedia_lookup() unconditionally,
sending entity names to en.wikipedia.org on every uncached lookup.
This violates the project's local-first and privacy-by-architecture
principles documented in CLAUDE.md.

Changes:
- research() now returns "unknown" for uncached words by default
- New allow_network=True parameter required for Wikipedia lookups
- Wikipedia 404 now returns "unknown" instead of asserting "person"
  with 0.70 confidence, preventing entity registry poisoning
- Added privacy warning docstring to _wikipedia_lookup()
- Added tests for local-only default, opt-in network, 404 handling,
  and cache-not-persisted-on-local-only behaviour

Refs: MemPalace/mempalace#809

* fix: improve research() cache read path and deduplicate test mocks

- Use .get() instead of .setdefault() for cache reads in research()
  so the local-only path never mutates _data unnecessarily
- Move .setdefault() to the network-write path only
- Use result.setdefault() for word/confirmed keys to ensure
  consistent return shape across all _wikipedia_lookup error paths
- Extract duplicated mock_result dict into _MOCK_SAOIRSE_PERSON
  constant shared by 3 test functions
2026-04-15 00:26:24 -07:00
Tal Muskal abd52534bb test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11
- Add tests for config, convo_miner, spellcheck, knowledge_graph
- Fix Windows PermissionError in test cleanup (chromadb file locks)
- Add UTF-8 encoding to split_mega_files, entity_registry, hooks_cli
- Fix mcp_server parse_known_args logging for unknown args
- Set coverage threshold to 85 in pyproject.toml and CI
- Reset all version files to 3.0.11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 21:38:12 +03:00
Ben Sigman 6af6fe3dda Merge pull request #54 from adv3nt3/fix/narrow-exception-handling
fix: narrow bare except Exception to specific types where safe
2026-04-07 13:54:05 -07:00
adv3nt3 312d380aab fix: narrow bare except Exception to specific types where safe
Replace broad except Exception with specific exception types in 6
sites where the expected failure mode is well-defined:

- normalize.py: OSError for file read, ImportError for optional import
- miner.py: OSError for file read_text
- entity_detector.py: OSError for file read in scan loop
- convo_miner.py: (OSError, ValueError) for normalize which reads
  and parses files
- entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError)
  for Wikipedia lookup fallback

ChromaDB except Exception sites (~30) are left broad for now.
chromadb.errors defines NotFoundError, DuplicateIDError,
InvalidDimensionException etc., but narrowing those sites requires
importing from chromadb.errors and validating across supported
versions (>=0.4.0). MCP server handlers also left broad for
resilience.
2026-04-07 13:51:27 +02:00
adv3nt3 3c78e2fbb5 fix: remove dead code and duplicate set items in entity_registry.py
Remove discarded `query.lower()` call in `extract_people_from_query` —
strings are immutable so the result was always thrown away. The existing
`re.IGNORECASE` flag already handles case-insensitive matching.

Remove duplicate literals in COMMON_ENGLISH_WORDS set: "hunter" (consecutive
duplicate), "april" and "june" (appeared in both names and months sections).
2026-04-07 13:00:59 +02:00
Milla Jovovich 068dbd9a7b MemPalace: palace architecture, AAAK compression, knowledge graph
The memory system:
- Palace structure: Wings (people/projects) → Rooms (topics) → Closets (AAAK compressed) → Drawers (verbatim transcripts)
- Halls connect related rooms within a wing
- Tunnels cross-reference rooms across wings
- AAAK: 30x lossless compression dialect for AI agents
- Knowledge graph: temporal entity-relationship triples (SQLite)
- Palace graph: room-based navigation with tunnel detection
- MCP server: 19 tools — search, graph traversal, agent diary, AAAK auto-teach
- Onboarding: guided setup generates wing config + AAAK entity registry
- Contradiction detection: catches wrong pronouns, names, ages
- Auto-save hooks for Claude Code

96.6% Recall@5 on LongMemEval — highest zero-API score published.
100% with optional Haiku rerank (500/500).
Local. Free. No API key required.
2026-04-04 18:16:04 -07:00