Harden sweeper for production: verbatim tool blocks, full session_id, logged failures

Four changes on top of the proposal's initial sweeper draft, driven by
the CLAUDE.md design principles:

1. Drop the 500-char truncation on tool_use / tool_result content in
   _flatten_content. The "verbatim always" principle forbids lossy
   compression of user-adjacent data; a long code-edit diff handed to
   the assistant must round-trip intact. Unknown block types now also
   serialize their full payload instead of just a type marker. New test
   test_parse_preserves_tool_blocks_verbatim covers a 5000-char input.

2. Use the full session_id in drawer IDs (not session_id[:12]). Rules
   out cross-session collisions if a transcript source ever uses
   non-UUID session identifiers or shared prefixes.

3. Replace silent `except Exception: return None` in get_palace_cursor
   with a logger.warning — the exact anti-pattern this PR otherwise
   criticizes in miner.py. The fallback behavior is still safe
   (deterministic IDs make a missed cursor recover on the next run),
   but the failure is now discoverable.

4. sweep_directory now collects per-file failures into the result dict
   and the CLI exits non-zero when any file failed, so a partial-sweep
   outcome is visible rather than swallowed.

Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
This commit is contained in:
Igor Lins e Silva
2026-04-18 12:58:33 -03:00
parent fed69935d3
commit 29ce7c7135
4 changed files with 174 additions and 70 deletions
+2
View File
@@ -108,9 +108,11 @@ class TestJsonlNotSilentlySkipped:
def fake_stat(self, *args, **kwargs):
result = real_stat(self, *args, **kwargs)
if self.name == "big_transcript.jsonl":
class _FakeStat:
st_size = fake_size
st_mode = result.st_mode
return _FakeStat()
return result