fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min
make_client() called _fix_blob_seq_ids but skipped quarantine_stale_hnsw, so every fresh process (stop hook, precompact hook, CLI) opened a drifted palace and segfaulted in chromadb_rust_bindings before any write-path protection could fire. #1062 wires the quarantine call at MCP server startup (covers long-lived server processes). This fix adds it to make_client() itself — the call site that all callers (server, hooks, CLI, tests) pass through — so every fresh PersistentClient open is protected regardless of entry point. Also lowers stale_seconds default from 3600 to 300: a 0.96h-drifted segment caused production segfaults before the 1h threshold fired. ChromaDB's HNSW flush cadence means legitimate drift is seconds to low minutes; 5min gives headroom without admitting clearly corrupt segments.
This commit is contained in:
@@ -49,7 +49,7 @@ def _validate_where(where: Optional[dict]) -> None:
|
|||||||
stack.extend(x for x in v if isinstance(x, dict))
|
stack.extend(x for x in v if isinstance(x, dict))
|
||||||
|
|
||||||
|
|
||||||
def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> list[str]:
|
def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 300.0) -> list[str]:
|
||||||
"""Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3.
|
"""Rename HNSW segment dirs whose files are stale vs. chroma.sqlite3.
|
||||||
|
|
||||||
When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk
|
When a ChromaDB 1.5.x PersistentClient opens a palace whose on-disk
|
||||||
@@ -73,10 +73,12 @@ def quarantine_stale_hnsw(palace_path: str, stale_seconds: float = 3600.0) -> li
|
|||||||
original directory is renamed, not deleted, so recovery remains
|
original directory is renamed, not deleted, so recovery remains
|
||||||
possible if the heuristic misfires.
|
possible if the heuristic misfires.
|
||||||
|
|
||||||
The default threshold (1h) is deliberately conservative — ChromaDB's
|
The default threshold (5 min) is based on ChromaDB's HNSW flush
|
||||||
HNSW flush cadence means legitimate drift is normally on the order of
|
cadence — legitimate drift is normally on the order of seconds to
|
||||||
seconds to minutes. A segment that is more than an hour out of date is
|
minutes. A segment more than 5 minutes out of date is almost certainly
|
||||||
almost certainly in a "crashed mid-write" state.
|
in a "crashed mid-write" or "concurrent-write corrupted" state. The
|
||||||
|
previous 1h threshold was too conservative: 0.96h drift was observed
|
||||||
|
causing segfaults in production.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
palace_path: path to the palace directory containing ``chroma.sqlite3``
|
palace_path: path to the palace directory containing ``chroma.sqlite3``
|
||||||
@@ -544,6 +546,7 @@ class ChromaBackend(BaseBackend):
|
|||||||
:meth:`get_collection` which manages caching internally.
|
:meth:`get_collection` which manages caching internally.
|
||||||
"""
|
"""
|
||||||
_fix_blob_seq_ids(palace_path)
|
_fix_blob_seq_ids(palace_path)
|
||||||
|
quarantine_stale_hnsw(palace_path)
|
||||||
return chromadb.PersistentClient(path=palace_path)
|
return chromadb.PersistentClient(path=palace_path)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
|
|||||||
Reference in New Issue
Block a user