* fix: add provenance header and speaker IDs to Slack transcript imports
Slack exports are multi-party chats where no speaker is inherently
the "user" or "assistant". The parser previously assigned these roles
purely by position, allowing a crafted export to place attacker text
in the "user" role — making it appear as the memory owner's words
in all future retrieval (data poisoning via stored memory).
Changes:
- Add provenance header marking Slack transcripts as multi-party
with positional (unverified) role assignment
- Prefix each message with the original speaker ID ([U1], [U2], etc.)
so downstream consumers can distinguish authors
- Keep user/assistant role alternation for exchange-pair chunking
compatibility with convo_miner.py
Tests:
- Provenance header presence and content
- Speaker ID preservation in output
- Attacker-first-message attribution verification
Refs: MemPalace/mempalace#809
* fix: move Slack provenance to footer, sanitize speaker IDs, extract constant
- Move provenance notice from header to footer to prevent it becoming
a standalone ChromaDB drawer via paragraph chunking on exports
with fewer than 3 exchange pairs (violates verbatim-always principle)
- Sanitize speaker user_id/username: strip brackets, newlines, and
control characters to prevent chunk-boundary injection via crafted
Slack exports
- Extract header string to _SLACK_PROVENANCE_FOOTER module constant,
consistent with _TOOL_RESULT_* constants pattern; tests import it
instead of duplicating the literal
Refs: MemPalace/mempalace#809
The initial strip_noise() regressed on three fronts when audited against
adversarial user content — each verified with executable repros against
the cherry-picked code:
1. `<tag>.*?</tag>` with re.DOTALL span-ate across messages: one
stray unclosed <system-reminder> anywhere in a session merged with
the next closing tag, silently deleting everything between them
(including full assistant replies).
2. `.*\(ctrl\+o to expand\).*\n?` nuked entire lines of user prose
whenever a user happened to document the TUI shortcut.
3. `Ran \d+ (?:stop|pre|post)\s*hook.*` with IGNORECASE ate the
second sentence from "our CI has a stop hook ... Ran 2 stop hooks
last week" — legitimate user commentary.
These are unambiguous violations of the project's "Verbatim always"
design principle.
Fixes:
- All tag patterns are now line-anchored (`(?m)^(?:> )?<tag>`) and their
body forbids crossing a blank line (`(?:(?!\n\s*\n)[\s\S])*?`), so a
dangling open tag cannot eat neighboring messages.
- `_NOISE_LINE_PREFIXES` are line-anchored and case-sensitive — user
prose mentioning "CURRENT TIME:" mid-sentence is preserved.
- Hook-run chrome requires `(?m)^`, explicit hook names (Stop,
PreCompact, PreToolUse, etc.), and no IGNORECASE.
- "… +N lines" is line-anchored.
- "(ctrl+o to expand)" only matches Claude Code's actual collapsed-
output chrome shape `[N tokens] (ctrl+o to expand)`; a bare
parenthetical in user prose stays intact.
Scope:
- `strip_noise()` is no longer called on every normalization path.
Only `_try_claude_code_jsonl` invokes it, per-extracted-message — so
Claude.ai exports, ChatGPT exports, Slack JSON, Codex JSONL, and
plain text with `>` markers pass through fully verbatim. Per-message
application also makes span-eating structurally impossible.
Tests:
- 15 new tests in test_normalize.py pin the boundary: 6 guard user
content that must survive (each of the adversarial repros), 9 assert
real system chrome is still stripped. All pass; full suite 702 pass
(2 failures are the unrelated pre-existing version.py bug, cleared
by #820).
Known limitation (not fixed here): convo_miner.py does not delete
drawers on re-mine, so transcripts mined before this PR keep noise-
filled drawers until the user manually erases + re-mines. Proper fix
needs a schema-version field on drawer metadata + re-mine trigger —
out of scope for this PR.
normalize.py now strips before filing:
- <system-reminder>, <command-message>, <command-name> tags
- <task-notification>, <user-prompt-submit-hook>, <hook_output> tags
- Hook status messages (CURRENT TIME, Checking verified facts, etc.)
- Claude Code UI chrome (ctrl+o to expand, progress bars, etc.)
- Collapsed runs of blank lines
This noise was going straight into drawers, wasting storage space
and polluting search results. strip_noise() runs on all normalized
output regardless of input format (JSONL, JSON, plain text).
689/689 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: parse Claude.ai privacy export with messages key and sender field (#677)
The privacy-export branch in _try_claude_ai_json only checked for the
"chat_messages" key, missing exports that use "messages" instead. It
also only read the "role" field while real privacy exports use "sender".
Both gaps caused the file to fall through to plain-text, producing a
single giant drawer.
Changes:
- Accept "messages" alongside "chat_messages" in the conversation-object
guard and inner extraction.
- Accept "sender" alongside "role" as the author field.
- Fall back to a top-level "text" key when content blocks are empty.
- Produce one transcript per conversation instead of concatenating all
conversations into a single blob.
- Extract shared logic into _collect_claude_messages helper.
- Add 6 regression tests covering each variant.
* style: apply ruff format to normalize.py
* fix: guard against null text field in Claude.ai export parsing
item.get("text", "").strip() crashes when "text" is explicitly null
in the JSON (legal and observed in some exports). Use
(item.get("text") or "").strip() and add a regression test.
---------
Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com>
Three critical bugfixes:
1. MCP server hangs on null arguments (#394) — `params.get("arguments", {})`
returns None when JSON has `"arguments": null`. Changed to `or {}`.
2. cmd_repair infinite recursion (#395) — trailing slash on palace_path
caused backup_path to be inside the source dir. Strip trailing sep.
3. OOM on large transcript files (#396) — split_mega_files.py and
normalize.py load entire files into memory. Added 500MB safety limit
with clear skip/error messages.
Closes#394, #395, #396.
- Add `mempalace repair` command to rebuild vector index from SQLite
when HNSW files are corrupted after crash/interrupt (fixes#74, #72, #96)
- Fix split command passing dir as positional instead of --source
flag to split_mega_files (fixes#63)
- Handle Claude privacy export format (array of conversation objects
with chat_messages inside each) in normalize.py (fixes#63)
- Persist room keywords in mempalace.yaml so mine can match files
in docs/ to room "documentation" (fixes#108)
- hooks/mempal_save_hook.sh: pass $TRANSCRIPT_PATH as sys.argv
instead of interpolating into python -c string (fixes#110)
- normalize.py: accept type "user" in addition to "human" for
Claude Code JSONL sessions (fixes#111)
- convo_miner.py: skip tool-results/, memory/ dirs and .meta.json
files when scanning for conversations (fixes#111)
- pyproject.toml: pin chromadb>=0.4.0,<1 to avoid crashing 1.x
builds on macOS ARM64 (fixes#100)
Add _try_codex_jsonl parser for Codex CLI session files stored at
~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl.
Uses only event_msg entries (user_message / agent_message) which
represent the canonical conversation turns. response_item entries
are intentionally skipped — they include synthetic context injections
(environment_context) and can duplicate real messages when both
representations are present in the same rollout.
Format based on Codex source tests (codex-rs/rollout/src/recorder_tests.rs).
Requires session_meta header to reduce false positives on other JSONL.
Refs: #59
Replace broad except Exception with specific exception types in 6
sites where the expected failure mode is well-defined:
- normalize.py: OSError for file read, ImportError for optional import
- miner.py: OSError for file read_text
- entity_detector.py: OSError for file read in scan loop
- convo_miner.py: (OSError, ValueError) for normalize which reads
and parses files
- entity_registry.py: (URLError, OSError, JSONDecodeError, KeyError)
for Wikipedia lookup fallback
ChromaDB except Exception sites (~30) are left broad for now.
chromadb.errors defines NotFoundError, DuplicateIDError,
InvalidDimensionException etc., but narrowing those sites requires
importing from chromadb.errors and validating across supported
versions (>=0.4.0). MCP server handlers also left broad for
resilience.