fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min

make_client() called _fix_blob_seq_ids but skipped quarantine_stale_hnsw,
so every fresh process (stop hook, precompact hook, CLI) opened a drifted
palace and segfaulted in chromadb_rust_bindings before any write-path
protection could fire.

#1062 wires the quarantine call at MCP server startup (covers long-lived
server processes). This fix adds it to make_client() itself — the call
site that all callers (server, hooks, CLI, tests) pass through — so every
fresh PersistentClient open is protected regardless of entry point.

Also lowers stale_seconds default from 3600 to 300: a 0.96h-drifted
segment caused production segfaults before the 1h threshold fired.
ChromaDB's HNSW flush cadence means legitimate drift is seconds to low
minutes; 5min gives headroom without admitting clearly corrupt segments.
This commit is contained in:
jp
2026-04-24 09:07:46 -07:00
parent 6890948e09
commit db80f6e26c
+8 -5
View File
@@ -49,7 +49,7 @@ def _validate_where(where: Optional[dict]) -> None:
stack.extend(x for x in v if isinstance(x, dict)) stack.extend(x for x in v if isinstance(x, dict))
def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> list[str]: def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 300.0) -> list[str]:
"""Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3. """Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3.
When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk
@@ -73,10 +73,12 @@ def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> li
original directory is renamed, not deleted, so recovery remains original directory is renamed, not deleted, so recovery remains
possible if the heuristic misfires. possible if the heuristic misfires.
The default threshold (1h) is deliberately conservative — ChromaDB's The default threshold (5 min) is based on ChromaDB's HNSW flush
HNSW flush cadence means legitimate drift is normally on the order of cadence legitimate drift is normally on the order of seconds to
seconds to minutes. A segment that is more than an hour out of date is minutes. A segment more than 5 minutes out of date is almost certainly
almost certainly in a "crashed mid-write" state. in a "crashed mid-write" or "concurrent-write corrupted" state. The
previous 1h threshold was too conservative: 0.96h drift was observed
causing segfaults in production.
Args: Args:
palace_path: path to the palace directory containing ``chroma.sqlite3`` palace_path: path to the palace directory containing ``chroma.sqlite3``
@@ -544,6 +546,7 @@ class ChromaBackend(BaseBackend):
:meth:`get_collection` which manages caching internally. :meth:`get_collection` which manages caching internally.
""" """
_fix_blob_seq_ids(palace_path) _fix_blob_seq_ids(palace_path)
quarantine_stale_hnsw(palace_path)
return chromadb.PersistentClient(path=palace_path) return chromadb.PersistentClient(path=palace_path)
@staticmethod @staticmethod