fix: prevent HNSW index bloat from duplicate add() calls (#525)

Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault. Changes: - convo_miner.py: add() → upsert() (the one-line root cause fix) - repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine. - dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 08:14:22 -07:00
parent a036b4300d
commit 71e8f2d054
3 changed files with 527 additions and 1 deletions
@@ -334,7 +334,7 @@ def mine_convos(
                room_counts[chunk_room] += 1
            drawer_id = f"drawer_{wing}_{chunk_room}_{hashlib.sha256((source_file + str(chunk['chunk_index'])).encode()).hexdigest()[:24]}"
            try:
-                collection.add(
+                collection.upsert(
                    documents=[chunk["content"]],
                    ids=[drawer_id],
                    metadatas=[