fix: prevent HNSW index bloat from duplicate add() calls (#525)

Root cause: convo_miner.py used collection.add() instead of upsert(),
so repeated mine runs pushed duplicate entries into the HNSW graph.
At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes
and eventually segfault.

Changes:
- convo_miner.py: add() → upsert() (the one-line root cause fix)
- repair.py: new module — scan for corrupt IDs, prune them, or rebuild
  the HNSW index from scratch. Backs up only chroma.sqlite3 (not the
  bloated HNSW files). Recreates collection with hnsw:space=cosine.
- dedup.py: new module — detect and remove near-duplicate drawers from
  the same source file using cosine similarity. No API calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
MSL
2026-04-10 08:14:22 -07:00
parent a036b4300d
commit 71e8f2d054
3 changed files with 527 additions and 1 deletions
+1 -1
View File
@@ -334,7 +334,7 @@ def mine_convos(
room_counts[chunk_room] += 1
drawer_id = f"drawer_{wing}_{chunk_room}_{hashlib.sha256((source_file + str(chunk['chunk_index'])).encode()).hexdigest()[:24]}"
try:
collection.add(
collection.upsert(
documents=[chunk["content"]],
ids=[drawer_id],
metadatas=[