mempalace

Author	SHA1	Message	Date
Mika Cohen	0e32b9643c	fix: avoid false hnsw divergence fallback	2026-05-01 12:42:40 -06:00
Pim Messelink	4a0f330cc1	fix(repair): scale HNSW divergence floor with hnsw:sync_threshold The capacity probe added in #1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's default sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". #1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original #1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: \| Collection's sync_threshold \| New floor \| Old floor \| \|---\|---\|---\| \| Missing (legacy palace) \| 2000 \| 2000 (unchanged) \| 1000 (chromadb default) \| 2000 \| 2000 (unchanged) \| 50000 (#1191 bloat guard) \| 100000 \| 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the #1191/#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:31:47 +00:00
Igor Lins e Silva	57ac669dbc	fix(repair): address Copilot review on #1227 Five Copilot review issues + the Python 3.9 CI failure rolled into one follow-up: * Replace ``dict \| None`` annotated assignment with a type-comment so module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red). * Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace repair`` (rebuild) and ``mempalace repair-status``. Updated all user-facing messages, docstrings, and test assertions. * Replace ``_get_client()`` in ``tool_search`` with the safe ``_refresh_vector_disabled_flag`` probe so the fallback isn't defeated by the very chromadb client load it's trying to avoid. * Short-circuit ``tool_status`` to a pure-sqlite reader (``_tool_status_via_sqlite``) when divergence is detected so wing / room counts come back without ever opening the persistent client. * Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an ``id``-ordered fallback so legacy schemas missing ``created_at`` don't break BM25 search. New test covers the sqlite-status short-circuit. 1409 tests pass.	2026-04-26 21:53:56 -03:00
Igor Lins e Silva	0d349c3d86	fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222 ) When chromadb's HNSW segment freezes at a stale max_elements while sqlite keeps accumulating embeddings, the next chromadb open segfaults the MCP server on every tool call. Adds a pure-filesystem capacity probe (zero chromadb interaction), a `mempalace repair-status` read-only health check, and a BM25-only sqlite fallback so the palace stays reachable even when vector search is unavailable. * `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly via a tight-allowlist unpickler — no hnswlib import, no segment load. * MCP server runs the probe at startup and after every reconnect; sets `_vector_disabled` and routes search to the sqlite FTS5 + BM25 path. * `tool_status` and `tool_reconnect` surface the fallback state. * Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).	2026-04-26 19:54:00 -03:00

4 Commits