Harden sweeper for production: verbatim tool blocks, full session_id, logged failures
Four changes on top of the proposal's initial sweeper draft, driven by the CLAUDE.md design principles: 1. Drop the 500-char truncation on tool_use / tool_result content in _flatten_content. The "verbatim always" principle forbids lossy compression of user-adjacent data; a long code-edit diff handed to the assistant must round-trip intact. Unknown block types now also serialize their full payload instead of just a type marker. New test test_parse_preserves_tool_blocks_verbatim covers a 5000-char input. 2. Use the full session_id in drawer IDs (not session_id[:12]). Rules out cross-session collisions if a transcript source ever uses non-UUID session identifiers or shared prefixes. 3. Replace silent `except Exception: return None` in get_palace_cursor with a logger.warning — the exact anti-pattern this PR otherwise criticizes in miner.py. The fallback behavior is still safe (deterministic IDs make a missed cursor recover on the next run), but the failure is now discoverable. 4. sweep_directory now collects per-file failures into the result dict and the CLI exits non-zero when any file failed, so a partial-sweep outcome is visible rather than swallowed. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
This commit is contained in:
@@ -108,9 +108,11 @@ class TestJsonlNotSilentlySkipped:
|
||||
def fake_stat(self, *args, **kwargs):
|
||||
result = real_stat(self, *args, **kwargs)
|
||||
if self.name == "big_transcript.jsonl":
|
||||
|
||||
class _FakeStat:
|
||||
st_size = fake_size
|
||||
st_mode = result.st_mode
|
||||
|
||||
return _FakeStat()
|
||||
return result
|
||||
|
||||
|
||||
Reference in New Issue
Block a user