Commit Graph

15 Commits

Author SHA1 Message Date
Marcio E. Heiderscheidt e61dc2adf8 fix: add provenance header and speaker IDs to Slack transcript imports (#815)
* fix: add provenance header and speaker IDs to Slack transcript imports

Slack exports are multi-party chats where no speaker is inherently
the "user" or "assistant". The parser previously assigned these roles
purely by position, allowing a crafted export to place attacker text
in the "user" role — making it appear as the memory owner's words
in all future retrieval (data poisoning via stored memory).

Changes:
- Add provenance header marking Slack transcripts as multi-party
  with positional (unverified) role assignment
- Prefix each message with the original speaker ID ([U1], [U2], etc.)
  so downstream consumers can distinguish authors
- Keep user/assistant role alternation for exchange-pair chunking
  compatibility with convo_miner.py

Tests:
- Provenance header presence and content
- Speaker ID preservation in output
- Attacker-first-message attribution verification

Refs: MemPalace/mempalace#809

* fix: move Slack provenance to footer, sanitize speaker IDs, extract constant

- Move provenance notice from header to footer to prevent it becoming
  a standalone ChromaDB drawer via paragraph chunking on exports
  with fewer than 3 exchange pairs (violates verbatim-always principle)
- Sanitize speaker user_id/username: strip brackets, newlines, and
  control characters to prevent chunk-boundary injection via crafted
  Slack exports
- Extract header string to _SLACK_PROVENANCE_FOOTER module constant,
  consistent with _TOOL_RESULT_* constants pattern; tests import it
  instead of duplicating the literal

Refs: MemPalace/mempalace#809
2026-04-15 00:27:01 -07:00
Igor Lins e Silva ca2598a9f6 fix(normalize): make strip_noise verbatim-safe and scope it to Claude Code JSONL
The initial strip_noise() regressed on three fronts when audited against
adversarial user content — each verified with executable repros against
the cherry-picked code:

  1. `<tag>.*?</tag>` with re.DOTALL span-ate across messages: one
     stray unclosed <system-reminder> anywhere in a session merged with
     the next closing tag, silently deleting everything between them
     (including full assistant replies).
  2. `.*\(ctrl\+o to expand\).*\n?` nuked entire lines of user prose
     whenever a user happened to document the TUI shortcut.
  3. `Ran \d+ (?:stop|pre|post)\s*hook.*` with IGNORECASE ate the
     second sentence from "our CI has a stop hook ... Ran 2 stop hooks
     last week" — legitimate user commentary.

These are unambiguous violations of the project's "Verbatim always"
design principle.

Fixes:

- All tag patterns are now line-anchored (`(?m)^(?:> )?<tag>`) and their
  body forbids crossing a blank line (`(?:(?!\n\s*\n)[\s\S])*?`), so a
  dangling open tag cannot eat neighboring messages.
- `_NOISE_LINE_PREFIXES` are line-anchored and case-sensitive — user
  prose mentioning "CURRENT TIME:" mid-sentence is preserved.
- Hook-run chrome requires `(?m)^`, explicit hook names (Stop,
  PreCompact, PreToolUse, etc.), and no IGNORECASE.
- "… +N lines" is line-anchored.
- "(ctrl+o to expand)" only matches Claude Code's actual collapsed-
  output chrome shape `[N tokens] (ctrl+o to expand)`; a bare
  parenthetical in user prose stays intact.

Scope:

- `strip_noise()` is no longer called on every normalization path.
  Only `_try_claude_code_jsonl` invokes it, per-extracted-message — so
  Claude.ai exports, ChatGPT exports, Slack JSON, Codex JSONL, and
  plain text with `>` markers pass through fully verbatim. Per-message
  application also makes span-eating structurally impossible.

Tests:

- 15 new tests in test_normalize.py pin the boundary: 6 guard user
  content that must survive (each of the adversarial repros), 9 assert
  real system chrome is still stripped. All pass; full suite 702 pass
  (2 failures are the unrelated pre-existing version.py bug, cleared
  by #820).

Known limitation (not fixed here): convo_miner.py does not delete
drawers on re-mine, so transcripts mined before this PR keep noise-
filled drawers until the user manually erases + re-mines. Proper fix
needs a schema-version field on drawer metadata + re-mine trigger —
out of scope for this PR.
2026-04-13 16:11:03 -03:00
MSL 9b99c136ee fix: strip system tags, hook output, and Claude UI chrome from drawers
normalize.py now strips before filing:
- <system-reminder>, <command-message>, <command-name> tags
- <task-notification>, <user-prompt-submit-hook>, <hook_output> tags
- Hook status messages (CURRENT TIME, Checking verified facts, etc.)
- Claude Code UI chrome (ctrl+o to expand, progress bars, etc.)
- Collapsed runs of blank lines

This noise was going straight into drawers, wasting storage space
and polluting search results. strip_noise() runs on all normalized
output regardless of input format (JSONL, JSON, plain text).

689/689 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 07:30:12 -03:00
Mikhail Valentsev a2432a3245 fix: parse Claude.ai privacy export with messages key and sender field (#677) (#685)
* fix: parse Claude.ai privacy export with messages key and sender field (#677)

The privacy-export branch in _try_claude_ai_json only checked for the
"chat_messages" key, missing exports that use "messages" instead.  It
also only read the "role" field while real privacy exports use "sender".
Both gaps caused the file to fall through to plain-text, producing a
single giant drawer.

Changes:
- Accept "messages" alongside "chat_messages" in the conversation-object
  guard and inner extraction.
- Accept "sender" alongside "role" as the author field.
- Fall back to a top-level "text" key when content blocks are empty.
- Produce one transcript per conversation instead of concatenating all
  conversations into a single blob.
- Extract shared logic into _collect_claude_messages helper.
- Add 6 regression tests covering each variant.

* style: apply ruff format to normalize.py

* fix: guard against null text field in Claude.ai export parsing

item.get("text", "").strip() crashes when "text" is explicitly null
in the JSON (legal and observed in some exports). Use
(item.get("text") or "").strip() and add a regression test.

---------

Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com>
2026-04-13 02:11:03 -03:00
Ben Sigman 4621f85d7c style: ruff format all Python files (#675) 2026-04-11 22:59:34 -07:00
Ben Sigman 20c8f8e57b feat: new MCP tools — get/list/update drawer, hook settings, export (resolves #635) (#667)
* feat: MCP reliability — inode detection, WAL rotation, metadata cache, search limits

Infrastructure hardening for the MCP server:
- Detect palace DB replacement via inode tracking (repair command support)
- WAL rotation to prevent unbounded WAL growth
- _fetch_all_metadata() + _get_cached_metadata() with 60s TTL for taxonomy/status
- _MAX_RESULTS cap (100) with limit clamping [1, _MAX_RESULTS]
- max_distance parameter for similarity threshold in search
- Handle all notifications/* methods, null arguments, method=None
- Remove duplicate _client_cache = None declarations
- searcher.py max_distance parameter passthrough

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: new MCP tools (get/list/update drawer, hook settings, memories filed), export, normalize

New MCP tools:
- mempalace_get_drawer: fetch single drawer by ID with full content
- mempalace_list_drawers: paginated listing with wing/room filter
- mempalace_update_drawer: update content/wing/room on existing drawers
- mempalace_hook_settings: get/set hook behavior (silent_save, desktop_toast)
- mempalace_memories_filed_away: check latest checkpoint status

Also includes:
- exporter.py: export palace as browsable markdown files
- normalize.py: tool_use/tool_result capture for richer transcript mining
- layers.py: updated for new tool integration
- config.py: hook settings properties (hook_silent_save, hook_desktop_toast)

Depends on PR 3 (reliability) for _MAX_RESULTS, _metadata_cache, WAL logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: normalize.py handles string messages and Read offset type mismatch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: params null guard, L2→cosine docs, empty tool_use_map key guard

- Handle explicit null in MCP params (request.get("params") or {})
- Fix search tool description: L2 → cosine distance (collection uses hnsw:space=cosine)
- Guard against empty string key in tool_use_map from malformed JSONL entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rename ambiguous var 'l' to 'line' (E741 lint)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address code review findings (5 issues)

1. min_similarity backwards-compat: convert similarity to distance scale
   (1.0 - similarity) instead of passing raw value as max_distance
2. Restore structured error reporting (error + partial fields) in
   tool_status, tool_list_wings, tool_list_rooms, tool_get_taxonomy
   — reverts silent except:pass that dropped #647 security hardening
3. inode cache: remove falsy-zero short-circuit so missing DB file
   triggers reconnect instead of reusing stale client
4. _fetch_all_metadata: check for empty batch before extending/advancing
   offset to prevent infinite loop on concurrent deletion
5. KG initialization: only override path when --palace is explicit;
   default runs use KnowledgeGraph's built-in default path

Co-authored-by: jphein <jphein@users.noreply.github.com>

---------

Co-authored-by: jp <jp@jphein.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: jphein <jphein@users.noreply.github.com>
2026-04-11 21:25:04 -07:00
bensig b1adc047e6 fix: address Octocode review — move size check, add tests for all 3 fixes
- Move file size check before try block so IOError propagates cleanly
  (not caught by the except OSError handler below it)
- Wrap os.path.getsize in its own try/except to preserve existing
  test_normalize_io_error behavior on missing files
- Add test_normalize_rejects_large_file (mocked getsize)
- Add test_null_arguments_does_not_hang (#394)
- Add test_cmd_repair_trailing_slash_does_not_recurse (#395)

532 tests pass locally, 0 regressions.
2026-04-09 10:40:53 -07:00
bensig 0720fb84f8 fix: MCP null args hang, repair infinite recursion, OOM on large files
Three critical bugfixes:

1. MCP server hangs on null arguments (#394) — `params.get("arguments", {})`
   returns None when JSON has `"arguments": null`. Changed to `or {}`.

2. cmd_repair infinite recursion (#395) — trailing slash on palace_path
   caused backup_path to be inside the source dir. Strip trailing sep.

3. OOM on large transcript files (#396) — split_mega_files.py and
   normalize.py load entire files into memory. Added 500MB safety limit
   with clear skip/error messages.

Closes #394, #395, #396.
2026-04-09 10:05:37 -07:00
Ben Sigman 0b0e123f42 Merge pull request #61 from adv3nt3/feat/codex-cli-normalizer
feat: add OpenAI Codex CLI JSONL normalizer
2026-04-07 13:54:08 -07:00
Ben Sigman 6af6fe3dda Merge pull request #54 from adv3nt3/fix/narrow-exception-handling
fix: narrow bare except Exception to specific types where safe
2026-04-07 13:54:05 -07:00
bensig 5e8a039e7c fix: repair command, split args, Claude export, room keywords
- Add `mempalace repair` command to rebuild vector index from SQLite
  when HNSW files are corrupted after crash/interrupt (fixes #74, #72, #96)
- Fix split command passing dir as positional instead of --source
  flag to split_mega_files (fixes #63)
- Handle Claude privacy export format (array of conversation objects
  with chat_messages inside each) in normalize.py (fixes #63)
- Persist room keywords in mempalace.yaml so mine can match files
  in docs/ to room "documentation" (fixes #108)
2026-04-07 12:02:34 -07:00
bensig 186bb2e3d1 fix: shell injection in hooks, Claude Code mining, chromadb pin
- hooks/mempal_save_hook.sh: pass $TRANSCRIPT_PATH as sys.argv
  instead of interpolating into python -c string (fixes #110)
- normalize.py: accept type "user" in addition to "human" for
  Claude Code JSONL sessions (fixes #111)
- convo_miner.py: skip tool-results/, memory/ dirs and .meta.json
  files when scanning for conversations (fixes #111)
- pyproject.toml: pin chromadb>=0.4.0,<1 to avoid crashing 1.x
  builds on macOS ARM64 (fixes #100)
2026-04-07 11:45:51 -07:00
adv3nt3 d4e1945f77 feat: add OpenAI Codex CLI JSONL normalizer
Add _try_codex_jsonl parser for Codex CLI session files stored at
~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl.

Uses only event_msg entries (user_message / agent_message) which
represent the canonical conversation turns. response_item entries
are intentionally skipped — they include synthetic context injections
(environment_context) and can duplicate real messages when both
representations are present in the same rollout.

Format based on Codex source tests (codex-rs/rollout/src/recorder_tests.rs).
Requires session_meta header to reduce false positives on other JSONL.

Refs: #59
2026-04-07 14:50:04 +02:00
adv3nt3 312d380aab fix: narrow bare except Exception to specific types where safe
Replace broad except Exception with specific exception types in 6
sites where the expected failure mode is well-defined:

- normalize.py: OSError for file read, ImportError for optional import
- miner.py: OSError for file read_text
- entity_detector.py: OSError for file read in scan loop
- convo_miner.py: (OSError, ValueError) for normalize which reads
  and parses files
- entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError)
  for Wikipedia lookup fallback

ChromaDB except Exception sites (~30) are left broad for now.
chromadb.errors defines NotFoundError, DuplicateIDError,
InvalidDimensionException etc., but narrowing those sites requires
importing from chromadb.errors and validating across supported
versions (>=0.4.0). MCP server handlers also left broad for
resilience.
2026-04-07 13:51:27 +02:00
Milla Jovovich 068dbd9a7b MemPalace: palace architecture, AAAK compression, knowledge graph
The memory system:
- Palace structure: Wings (people/projects) → Rooms (topics) → Closets (AAAK compressed) → Drawers (verbatim transcripts)
- Halls connect related rooms within a wing
- Tunnels cross-reference rooms across wings
- AAAK: 30x lossless compression dialect for AI agents
- Knowledge graph: temporal entity-relationship triples (SQLite)
- Palace graph: room-based navigation with tunnel detection
- MCP server: 19 tools — search, graph traversal, agent diary, AAAK auto-teach
- Onboarding: guided setup generates wing config + AAAK entity registry
- Contradiction detection: catches wrong pronouns, names, ages
- Auto-save hooks for Claude Code

96.6% Recall@5 on LongMemEval — highest zero-API score published.
100% with optional Haiku rerank (500/500).
Local. Free. No API key required.
2026-04-04 18:16:04 -07:00