fix(hnsw): address @igorls's #1173 review

Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.

Plus three nits @igorls flagged:

- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
  documented that idempotency of `quarantine_stale_hnsw` is the safety
  property (concurrent same-palace calls produce a benign redundant
  rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
  PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
  and a future protocol-0/1 emission would conservatively quarantine
  + lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
  conftest reset is the controlled case, and module-level wouldn't
  remove the foot-gun, just relocate it. Conftest reset documented in
  the existing comment is the right pattern for test isolation.

Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in #1177's territory, not #1173's.

37/37 tests pass on the PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
jp
2026-04-26 13:20:14 -07:00
parent 74ff5e6b98
commit 942aae3ed5
2 changed files with 63 additions and 0 deletions
+14
View File
@@ -72,6 +72,12 @@ def _segment_appears_healthy(seg_dir: str) -> bool:
can execute arbitrary code, and the byte-sniff is sufficient to
distinguish a complete write from truncation, zero-fill, or
partial-flush corruption.
Assumes pickle protocol >= 2 (``0x80`` PROTO marker). Matches what
chromadb writes today; if a future chromadb version emits protocol
0/1 segments, this check would start returning False on healthy
files and quarantine_stale_hnsw would conservatively rename them
out of the way (lazy rebuild on next open recovers).
"""
meta_path = os.path.join(seg_dir, "index_metadata.pickle")
if not os.path.isfile(meta_path):
@@ -621,6 +627,14 @@ class ChromaBackend(BaseBackend):
# Real runtime drift is still handled — palace-daemon's ``_auto_repair``
# calls :func:`quarantine_stale_hnsw` directly on observed HNSW errors,
# which bypasses this gate.
#
# Thread-safety: this set is mutated without a lock. Two concurrent
# ``make_client()`` calls for the same palace can both pass the
# membership check and both invoke ``quarantine_stale_hnsw``. That's
# safe because the function is idempotent (mtime check + timestamped
# rename of distinct directories), so the worst-case race produces
# one redundant rename attempt that no-ops. Idempotency is the
# safety property; locking would add cost without correctness gain.
_quarantined_paths: set[str] = set()
@staticmethod