Files
mempalace/tests
Pim Messelink 4a0f330cc1 fix(repair): scale HNSW divergence floor with hnsw:sync_threshold
The capacity probe added in #1227 hardcoded a 2,000-row floor for the
"diverged" decision. The comment justifying that number explicitly tied
it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization
windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling".

#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without
updating the floor. Result: any palace created with the bloat guard
flips between OK and DIVERGED on every flush cycle. Steady-state
divergence sits at 0–50K (the natural queue depth), and the 2,000 floor
trips the guardrail the moment the queue exceeds 10% of sqlite_count.
The MCP server then routes search to BM25-only and disables duplicate
detection for ~80% of the write cycle on actively-mined ≥100K palaces,
even though chromadb is behaving correctly.

This change reads the configured `hnsw:sync_threshold` from
`collection_metadata` per palace and scales the floor to 2 × that value.
The 10% relative term and the original #1222 detection capability are
unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer)
still trips, regardless of whether the collection was created with
sync_threshold=1000 or 50000.

Behavior summary:

| Collection's sync_threshold | New floor | Old floor |
|---|---|---|
| Missing (legacy palace)     | 2000      | 2000 (unchanged)
| 1000 (chromadb default)     | 2000      | 2000 (unchanged)
| 50000 (#1191 bloat guard)   | 100000    | 2000 (the bug)

Tests:
- test_capacity_status_tolerates_lag_under_large_sync_threshold (regression
  for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK)
- test_capacity_status_still_flags_real_corruption_under_large_sync (#1222
  shape with bloat-guard collection — still detects corruption)
- test_capacity_status_default_threshold_when_no_sync_metadata (legacy
  palaces without the metadata row use the 2000 fallback floor)
- test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch
  scales too — 30K under sync_threshold=50000 is no longer flagged)

All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests
in tests/test_backends.py still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:31:47 +00:00
..
2026-04-11 16:16:49 -07:00