Merge pull request #1310 from potterdigital/fix/1308-rebuild-from-sqlite

fix(repair): add --mode from-sqlite to recover palaces with corrupt HNSW (#1308)
This commit is contained in:
Igor Lins e Silva
2026-05-07 07:51:42 -03:00
committed by GitHub
6 changed files with 937 additions and 4 deletions
+1
View File
@@ -36,3 +36,4 @@ venv/
# ChromaDB local data # ChromaDB local data
*.sqlite3-journal *.sqlite3-journal
.envrc
+1
View File
@@ -18,6 +18,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
- **CLI and `fact_checker --stdin` mojibaked non-ASCII content on Windows.** Python defaults `sys.stdin`/`stdout`/`stderr` to the system ANSI codepage (cp1252/cp1251/cp950), so `mempalace search > out.txt` and piped fact_checker invocations corrupted Cyrillic / CJK drawer text at the process boundary. New `mempalace/_stdio.py` helper reconfigures all three streams to UTF-8 on `sys.platform == "win32"`, with per-stream `errors` policy: `surrogateescape` on stdin (preserves bad bytes from redirected files for the consumer's parser), `replace` on stdout/stderr (substitutes U+FFFD instead of `UnicodeEncodeError`-ing mid-print). With this, all three user-facing console_scripts (`mcp_server`, `hooks_cli`, `cli`/`fact_checker`) now reconfigure identically on Windows. (#1282) - **CLI and `fact_checker --stdin` mojibaked non-ASCII content on Windows.** Python defaults `sys.stdin`/`stdout`/`stderr` to the system ANSI codepage (cp1252/cp1251/cp950), so `mempalace search > out.txt` and piped fact_checker invocations corrupted Cyrillic / CJK drawer text at the process boundary. New `mempalace/_stdio.py` helper reconfigures all three streams to UTF-8 on `sys.platform == "win32"`, with per-stream `errors` policy: `surrogateescape` on stdin (preserves bad bytes from redirected files for the consumer's parser), `replace` on stdout/stderr (substitutes U+FFFD instead of `UnicodeEncodeError`-ing mid-print). With this, all three user-facing console_scripts (`mcp_server`, `hooks_cli`, `cli`/`fact_checker`) now reconfigure identically on Windows. (#1282)
- **MCP knowledge-graph tools forwarded malformed date strings to SQLite.** `tool_kg_query` (`as_of`), `tool_kg_add` (`valid_from`), and `tool_kg_invalidate` (`ended`) accepted any string and produced empty result sets on natural-language inputs like `"March 2026"` or `"yesterday"` — callers (especially LLM agents) could not distinguish "no fact at this time" from "your date format was unrecognized." New `sanitize_iso_date()` validator in `config.py` accepts `YYYY`, `YYYY-MM`, `YYYY-MM-DD` (and passes through `None`/`""`); all three tools call it before values reach the storage layer. **Behavior change:** previously-silent date typos now raise a clear `ValueError` naming the offending field; full ISO-8601 with time (`YYYY-MM-DDTHH:MM:SS`, timezone offsets) is not yet accepted — file an issue if you have a use case. (#1164, #1167) - **MCP knowledge-graph tools forwarded malformed date strings to SQLite.** `tool_kg_query` (`as_of`), `tool_kg_add` (`valid_from`), and `tool_kg_invalidate` (`ended`) accepted any string and produced empty result sets on natural-language inputs like `"March 2026"` or `"yesterday"` — callers (especially LLM agents) could not distinguish "no fact at this time" from "your date format was unrecognized." New `sanitize_iso_date()` validator in `config.py` accepts `YYYY`, `YYYY-MM`, `YYYY-MM-DD` (and passes through `None`/`""`); all three tools call it before values reach the storage layer. **Behavior change:** previously-silent date typos now raise a clear `ValueError` naming the offending field; full ISO-8601 with time (`YYYY-MM-DDTHH:MM:SS`, timezone offsets) is not yet accepted — file an issue if you have a use case. (#1164, #1167)
- **MCP server's `_kg` was a module-level singleton.** Multi-tenant hosts that rotate `MEMPALACE_PALACE_PATH` between tool calls hit the wrong sqlite file, because the KG was constructed once at import time while the ChromaDB side was already per-call via `_get_client()`. The KG is now resolved per-call through a lazy per-path cache (`_kg_by_path` keyed by `os.path.abspath`, with a double-checked-locking init under `_kg_cache_lock`). `tool_reconnect` drains and `close()`s cached KGs alongside the existing chroma reconnect. A `_call_kg` retry guard catches `sqlite3.ProgrammingError` once after a reconnect race. (#1136, #1160) - **MCP server's `_kg` was a module-level singleton.** Multi-tenant hosts that rotate `MEMPALACE_PALACE_PATH` between tool calls hit the wrong sqlite file, because the KG was constructed once at import time while the ChromaDB side was already per-call via `_get_client()`. The KG is now resolved per-call through a lazy per-path cache (`_kg_by_path` keyed by `os.path.abspath`, with a double-checked-locking init under `_kg_cache_lock`). `tool_reconnect` drains and `close()`s cached KGs alongside the existing chroma reconnect. A `_call_kg` retry guard catches `sqlite3.ProgrammingError` once after a reconnect race. (#1136, #1160)
- **`mempalace repair` can now recover palaces whose HNSW segment writer is stuck on `apply_logs`.** Both the existing `--mode legacy` rebuild and the inline `cli.cmd_repair` path call `Collection.count()` as their first read — exactly the call that raises `chromadb.errors.InternalError: Failed to apply logs to the hnsw segment writer` on the corruption class introduced upstream and reported in #1308. Repair would print `Cannot recover — palace may need to be re-mined from source files` even though the underlying SQLite tables were fully intact (the corruption lives in the on-disk index files, not the data layer). New `--mode from-sqlite` reads `(id, document, metadata)` rows directly from `chroma.sqlite3` via a `segments``embeddings``embedding_metadata` join, never opens a chromadb client against the corrupt palace, and re-upserts everything into a fresh palace at `--palace`. `--source PATH` extracts from a corrupt palace already moved aside; `--archive-existing` handles the in-place case by renaming the existing palace to `<palace>.pre-rebuild-<timestamp>` before reading from it. Documents are re-embedded under the user's configured embedding function (the original HNSW vectors live in the corrupt `data_level0.bin` and cannot be recovered, but the embedding model is deterministic so search results remain semantically equivalent). Verified end-to-end on a 52,300-row real-world corrupt palace. (#1308)
--- ---
+73 -2
View File
@@ -673,6 +673,57 @@ def cmd_repair(args):
) )
return return
if getattr(args, "mode", "legacy") == "from-sqlite":
from .migrate import confirm_destructive_action
from .repair import RebuildPartialError, rebuild_from_sqlite
source_path = getattr(args, "source", None)
source_path = (
os.path.abspath(os.path.expanduser(source_path)) if source_path else palace_path
)
archive_existing = getattr(args, "archive_existing", False)
# Gate any path that touches the user's existing palace dir
# behind confirm_destructive_action. The legacy mode already
# gates; from-sqlite needs the same protection because:
# (a) --archive-existing renames the existing palace,
# (b) --source PATH writes into --palace dir which the user
# may not realize is also a palace.
# No prompt when source != dest AND dest does not exist (pure
# extract-into-fresh-dir case is non-destructive to existing
# palaces).
is_destructive_to_dest = source_path == palace_path or os.path.exists(palace_path)
if is_destructive_to_dest and not confirm_destructive_action(
"Rebuild from SQLite", palace_path, assume_yes=getattr(args, "yes", False)
):
return
try:
counts = rebuild_from_sqlite(
source_palace=source_path,
dest_palace=palace_path,
archive_existing_dest=archive_existing,
)
except RebuildPartialError as exc:
# The error itself was already printed by rebuild_from_sqlite
# with recovery instructions; surface a non-zero exit so
# scripts and CI gates see the failure.
print(
"\n Rebuild partial — see message above. "
f"Failed in collection: {exc.failed_collection}"
)
sys.exit(1)
# An empty counts dict is rebuild_from_sqlite's documented signal
# for a validation refusal (missing source, existing dest,
# in-place without --archive-existing). The library already
# printed an actionable message; exit non-zero so unattended
# scripts/CI distinguish "invalid inputs" from a successful
# rebuild that legitimately found zero rows (which still returns
# a populated dict with 0-valued counts).
if not counts:
sys.exit(1)
return
db_path = os.path.join(palace_path, "chroma.sqlite3") db_path = os.path.join(palace_path, "chroma.sqlite3")
if not os.path.isdir(palace_path): if not os.path.isdir(palace_path):
@@ -1213,11 +1264,31 @@ def main():
) )
p_repair.add_argument( p_repair.add_argument(
"--mode", "--mode",
choices=["legacy", "max-seq-id"], choices=["legacy", "max-seq-id", "from-sqlite"],
default="legacy", default="legacy",
help=( help=(
"legacy: full-palace rebuild (default). " "legacy: full-palace rebuild via the chromadb client (default). "
"max-seq-id: un-poison max_seq_id rows corrupted by the legacy 0.6.x shim. " "max-seq-id: un-poison max_seq_id rows corrupted by the legacy 0.6.x shim. "
"from-sqlite: rebuild by reading rows directly from chroma.sqlite3, "
"bypassing the chromadb client. Use when legacy mode bails because the "
"chromadb client cannot open the collection."
),
)
p_repair.add_argument(
"--source",
default=None,
help=(
"Source palace path for --mode from-sqlite (defaults to --palace). "
"Use when extracting from an archived corrupt palace into a new location."
),
)
p_repair.add_argument(
"--archive-existing",
action="store_true",
help=(
"For --mode from-sqlite when --source equals --palace: rename the "
"existing palace to <palace>.pre-rebuild-<timestamp> before "
"rebuilding so the corrupt copy is preserved."
), ),
) )
p_repair.add_argument( p_repair.add_argument(
+416 -1
View File
@@ -34,14 +34,56 @@ import os
import shutil import shutil
import sqlite3 import sqlite3
import time import time
from collections import defaultdict
from datetime import datetime from datetime import datetime
from typing import Optional from typing import Iterator, Optional
from .backends.chroma import ChromaBackend, hnsw_capacity_status from .backends.chroma import ChromaBackend, hnsw_capacity_status
COLLECTION_NAME = "mempalace_drawers" COLLECTION_NAME = "mempalace_drawers"
# The closets collection (AAAK index layer) is intentionally fixed —
# closets reference drawer IDs by string and live alongside drawers in the
# same palace; renaming the closets collection per-deployment would break
# cross-palace AAAK lookups. Drawer collection name comes from config
# (see ``_recoverable_collections``).
CLOSETS_COLLECTION_NAME = "mempalace_closets"
def _drawers_collection_name() -> str:
"""Resolve the drawers collection name from user config, falling back
to the module default ``COLLECTION_NAME`` if config is unreadable.
Recovery flows must honor ``MempalaceConfig().collection_name`` so a
user with a non-default drawer collection (e.g. multi-palace setups)
rebuilds the right rows. Closets remain fixed — see
``CLOSETS_COLLECTION_NAME``.
"""
try:
from .config import MempalaceConfig
return MempalaceConfig().collection_name or COLLECTION_NAME
except Exception:
return COLLECTION_NAME
def _recoverable_collections() -> tuple[str, ...]:
"""Collections rebuilt by ``rebuild_from_sqlite``, in upsert order.
Drawers first (bulk data), then closets (AAAK index layer that
references drawer IDs by string in their documents — no
foreign-key validation, so ordering is informational, not
load-bearing).
"""
return (_drawers_collection_name(), CLOSETS_COLLECTION_NAME)
# Back-compat alias for callers that imported the constant. New code
# should call ``_recoverable_collections()`` so config changes are picked
# up at call time.
RECOVERABLE_COLLECTIONS = (COLLECTION_NAME, CLOSETS_COLLECTION_NAME)
def _get_palace_path(): def _get_palace_path():
"""Resolve palace path from config.""" """Resolve palace path from config."""
@@ -436,6 +478,379 @@ def rebuild_index(palace_path=None, confirm_truncation_ok: bool = False):
print(f"\n{'=' * 55}\n") print(f"\n{'=' * 55}\n")
class RebuildPartialError(Exception):
"""Raised when ``rebuild_from_sqlite`` fails partway through upserts.
Carries enough state for the user (or CLI) to recover: the
per-collection counts that succeeded, the collection that failed,
the dest path holding the partial palace, and the archive path
(when an in-place rebuild had moved the original aside). Re-raises
the underlying chromadb error as ``__cause__``.
"""
def __init__(
self,
message: str,
*,
partial_counts: dict[str, int],
failed_collection: str,
dest_palace: str,
archive_path: Optional[str],
):
super().__init__(message)
self.message = message
self.partial_counts = partial_counts
self.failed_collection = failed_collection
self.dest_palace = dest_palace
self.archive_path = archive_path
def _rebuild_one_collection(
*,
backend: ChromaBackend,
source_palace: str,
dest_palace: str,
collection_name: str,
batch_size: int,
archive_path: Optional[str],
counts_so_far: dict[str, int],
) -> int:
"""Stream rows for one collection from SQLite and upsert into a
freshly-created collection at ``dest_palace``. Returns rows
upserted. Raises :class:`RebuildPartialError` (with the underlying
chromadb exception as ``__cause__``) on any upsert failure so the
caller can stop the loop and print recovery instructions instead of
silently shipping a partial palace.
"""
ids: list[str] = []
docs: list[str] = []
metas: list[dict] = []
upserted = 0
col = None
def _flush() -> int:
nonlocal upserted
if not ids:
return upserted
col.upsert(ids=list(ids), documents=list(docs), metadatas=list(metas))
upserted += len(ids)
print(f" upserted {upserted}")
ids.clear()
docs.clear()
metas.clear()
return upserted
try:
# ``create_collection`` lives inside the try so a Chroma-side
# "Collection already exists" failure (which can happen when the
# process-wide System cache still holds a pre-archive schema) is
# reported as a structured ``RebuildPartialError`` carrying
# ``archive_path`` — instead of an unstructured exception that
# strands the user without recovery instructions.
col = backend.create_collection(dest_palace, collection_name)
for emb_id, doc, meta in extract_via_sqlite(source_palace, collection_name):
ids.append(emb_id)
docs.append(doc or "")
# chromadb 1.5.x rejects None entries in the metadatas list
# but accepts empty dicts. Mempalace drawers always carry at
# least wing/room, so this branch is defensive — corruption
# in embedding_metadata could yield an emb_id with no rows.
metas.append(meta if meta else {})
if len(ids) >= batch_size:
_flush()
_flush()
except Exception as exc: # noqa: BLE001 — chromadb raises many shapes
partial = dict(counts_so_far)
partial[collection_name] = upserted
msg_parts = [
f"Upsert failed in collection {collection_name!r} after {upserted} rows: {exc!r}",
f"Partial palace left at: {dest_palace}",
]
if archive_path is not None:
msg_parts.append(f"Original palace archived at: {archive_path}")
msg_parts.append(
" Recover by removing the partial dest and re-running with "
f"--source {archive_path}"
)
else:
msg_parts.append(" Source palace is unchanged. Remove the partial dest and re-run.")
message = "\n ".join(msg_parts)
print(f"\n ERROR: {message}")
raise RebuildPartialError(
message,
partial_counts=partial,
failed_collection=collection_name,
dest_palace=dest_palace,
archive_path=archive_path,
) from exc
return upserted
def extract_via_sqlite(palace_path: str, collection_name: str) -> Iterator[tuple[str, str, dict]]:
"""Yield ``(embedding_id, document, metadata)`` for every row in
``collection_name``'s metadata segment by reading ``chroma.sqlite3``
directly.
Bypasses the chromadb client entirely — never opens a
``PersistentClient``, never imports hnswlib, never invokes the
HNSW segment writer. This is the recovery path for palaces where
``Collection.count()`` / ``Collection.get()`` raise ``InternalError``
because the compactor cannot apply WAL logs to the HNSW segment
(#1308). The drawer rows are still on disk in
``embeddings`` + ``embedding_metadata``; the corruption lives in the
on-disk index files, not the SQLite tables.
Resolution rule for chromadb's typed metadata columns: each
``embedding_metadata`` row stores its value in exactly one of
``string_value`` / ``int_value`` / ``float_value`` / ``bool_value``;
we pick the first non-NULL column in that order. Rows where every
typed column is NULL are dropped (chromadb never writes that shape).
The ``chroma:document`` key is removed from the metadata dict and
returned as the document; this matches how chromadb itself stores
``add(documents=...)``.
Silent on missing palace, missing ``chroma.sqlite3``, or unknown
collection name — yields nothing. Callers that need to distinguish
"empty collection" from "collection not present" should query
:func:`sqlite_drawer_count` first.
"""
sqlite_path = os.path.join(palace_path, "chroma.sqlite3")
if not os.path.isfile(sqlite_path):
return
conn = sqlite3.connect(f"file:{sqlite_path}?mode=ro", uri=True)
try:
seg_row = conn.execute(
"""
SELECT s.id FROM segments s
JOIN collections c ON s.collection = c.id
WHERE c.name = ? AND s.scope = 'METADATA'
""",
(collection_name,),
).fetchone()
if not seg_row:
return
segment_id = seg_row[0]
per_id: dict[str, dict] = defaultdict(dict)
order: list[str] = []
for emb_id, key, sv, iv, fv, bv in conn.execute(
"""
SELECT e.embedding_id, em.key, em.string_value, em.int_value,
em.float_value, em.bool_value
FROM embedding_metadata em
JOIN embeddings e ON em.id = e.id
WHERE e.segment_id = ?
ORDER BY em.id
""",
(segment_id,),
):
if emb_id not in per_id:
order.append(emb_id)
if sv is not None:
per_id[emb_id][key] = sv
elif iv is not None:
per_id[emb_id][key] = iv
elif fv is not None:
per_id[emb_id][key] = fv
elif bv is not None:
per_id[emb_id][key] = bool(bv)
for emb_id in order:
kv = per_id[emb_id]
doc = kv.pop("chroma:document", "")
yield emb_id, doc, kv
finally:
conn.close()
def rebuild_from_sqlite(
source_palace: str,
dest_palace: str,
*,
archive_existing_dest: bool = False,
batch_size: int = 1000,
) -> dict[str, int]:
"""Rebuild a palace by reading drawers from ``source_palace``'s
``chroma.sqlite3`` and upserting them into a fresh palace at
``dest_palace``.
Recovery path for the #1308 failure mode: the chromadb client raises
``InternalError: Failed to apply logs to the hnsw segment writer``
on every operation that touches the index (``count``, ``get``,
``query``), but the underlying SQLite tables are intact. Both the
legacy ``rebuild_index`` and the inline ``cli.cmd_repair`` path call
``Collection.count()`` as their first read — exactly the call that
fails — so neither can recover this class of corruption. This
function bypasses the chromadb read path entirely via
:func:`extract_via_sqlite`.
Re-embeds documents at upsert time using the configured embedding
function; the original HNSW vectors are not preserved (they live in
the corrupt ``data_level0.bin`` / ``link_lists.bin``, not in
SQLite). Acceptable for a corruption-recovery flow because the
embedding model is deterministic — same model + same document text
yields semantically equivalent search results.
``archive_existing_dest`` controls behavior when ``dest_palace``
already exists:
* ``False`` (default) — refuse with a clear message. Callers must
manually move the existing palace aside first.
* ``True`` — rename ``dest_palace`` to
``<dest_palace>.pre-rebuild-<timestamp>`` and read from there
instead. Used by the in-place CLI flow where ``--source`` defaults
to the same path as ``--palace``.
Returns a ``{collection_name: row_count}`` dict so callers (CLI,
tests) can verify the per-collection rebuild count without parsing
stdout. A successful rebuild always returns a dict with one key per
recoverable collection (values may be ``0`` when a collection is
legitimately empty in the source). The empty dict ``{}`` is reserved
for validation refusals (missing source DB, refusing to overwrite an
existing dest, in-place mode without ``archive_existing_dest``); CLI
callers should treat ``{}`` as an error and exit non-zero so CI and
scripts can distinguish "invalid inputs" from "successful recovery
that found zero rows." Raises :class:`RebuildPartialError` if a
chromadb upsert fails partway through; the dest palace is left in
place so the user can inspect what landed, and the in-place archive
(when applicable) is reported in the error so the user can re-run
against it.
.. warning::
In-place mode (``source_palace == dest_palace`` with
``archive_existing_dest=True``) calls
``chromadb.api.client.SharedSystemClient.clear_system_cache()`` to
drop chromadb's process-wide System registry — required because
an existing cached System built against the original palace will
refuse ``create_collection`` after the dir is renamed (chromadb
still thinks the collections exist). This invalidates any
PersistentClient instances held elsewhere in the same process for
*any* palace, not just this one. Do not call this function from
inside a long-running mempalace process (MCP server, daemon)
while other callers hold live ``PersistentClient`` references —
use the CLI in a separate process instead. Cross-palace use
(``source != dest``) does not touch the cache.
Note on metadata fidelity: the resolution rule
(``string_value`` → ``int_value`` → ``float_value`` → ``bool_value``)
matches the precedent in :mod:`mempalace.migrate`. ChromaDB 0.4.x
occasionally wrote booleans as ``int_value=0/1``; those will
round-trip as ``int`` rather than ``bool`` after this rebuild. This
is a known divergence and matches the existing migrate-path
behavior.
"""
source_palace = os.path.abspath(os.path.expanduser(source_palace))
dest_palace = os.path.abspath(os.path.expanduser(dest_palace))
src_db = os.path.join(source_palace, "chroma.sqlite3")
in_place = source_palace == dest_palace
print(f"\n{'=' * 55}")
print(" MemPalace Repair — Rebuild from SQLite")
print(f"{'=' * 55}\n")
print(f" Source: {source_palace}")
print(f" Dest: {dest_palace}")
# Validate source BEFORE any destructive moves. An earlier draft
# archived the dest first and surfaced the missing-chroma.sqlite3
# error after — leaving the user with a renamed dir to manually undo
# when the archive itself was empty. Validate first so a user error
# (--source pointing at a non-palace dir) bails cleanly.
if in_place:
if not archive_existing_dest:
print(
"\n Source and dest are the same path. Pass "
"archive_existing_dest=True (CLI: --archive-existing) to move "
"the existing palace aside, or pass a different source_palace= "
"(CLI: --source)."
)
return {}
if not os.path.isfile(src_db):
print(f"\n Source palace has no chroma.sqlite3 at {src_db}")
return {}
else:
if not os.path.isfile(src_db):
print(f"\n Source palace has no chroma.sqlite3 at {src_db}")
return {}
if os.path.exists(dest_palace):
print(
f"\n Refusing to rebuild into existing path: {dest_palace}\n"
" Move it aside, pass a different dest, or set "
"archive_existing_dest=True if rebuilding in place "
"(source_palace == dest_palace)."
)
return {}
archive_path: Optional[str] = None
if in_place:
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
archive_path = f"{dest_palace}.pre-rebuild-{ts}"
print(f" Archiving {dest_palace}{archive_path}")
shutil.move(dest_palace, archive_path)
source_palace = archive_path
src_db = os.path.join(source_palace, "chroma.sqlite3")
# In-place only: drop chromadb's process-wide System registry so
# the new client at dest_palace builds a fresh System. Without
# this, ``create_collection`` raises "Collection already exists"
# because the cached System still holds the pre-rename schema.
# Cross-palace mode does not need this and would needlessly
# invalidate other callers' clients (see docstring warning).
try:
from chromadb.api.client import SharedSystemClient
SharedSystemClient.clear_system_cache()
except Exception as exc: # noqa: BLE001
print(
f" Warning: could not clear chromadb system cache ({exc!r}); "
"in-place rebuild may fail with 'Collection already exists'."
)
os.makedirs(dest_palace, exist_ok=True)
# Backend lifetime is wrapped in try/finally so the dest palace's
# PersistentClient handle (opened lazily inside ``create_collection``
# / ``get_collection``) is released on every exit path: success,
# ``RebuildPartialError``, or any unexpected exception. Without this,
# a long-running process that calls ``rebuild_from_sqlite`` would
# leak SQLite/HNSW file handles into Chroma's ``SharedSystemClient``
# cache, surfacing later as "Collection already exists" on the next
# in-place rebuild or as a Windows file-lock failure on cleanup
# (cf. #1285's lifecycle hardening for the legacy rebuild path).
backend = ChromaBackend()
counts: dict[str, int] = {}
try:
for cname in _recoverable_collections():
print(f"\n [{cname}]")
upserted = _rebuild_one_collection(
backend=backend,
source_palace=source_palace,
dest_palace=dest_palace,
collection_name=cname,
batch_size=batch_size,
archive_path=archive_path,
counts_so_far=counts,
)
counts[cname] = upserted
if upserted == 0:
print(f" no rows found for {cname} in source palace")
else:
print(f" done: {upserted} rows in {cname}")
print(f"\n Rebuild complete. {sum(counts.values())} total rows.")
if archive_path is not None:
print(f" Original palace archived at: {archive_path}")
print(f"{'=' * 55}\n")
return counts
finally:
backend.close()
def status(palace_path=None) -> dict: def status(palace_path=None) -> dict:
"""Read-only health check: compare sqlite vs HNSW element counts. """Read-only health check: compare sqlite vs HNSW element counts.
+61
View File
@@ -1097,3 +1097,64 @@ def test_reconfigure_stdio_is_noop_off_windows():
_reconfigure_stdio_utf8_on_windows() _reconfigure_stdio_utf8_on_windows()
assert stdin.reconfigure_calls == [] assert stdin.reconfigure_calls == []
# ── cmd_repair: from-sqlite mode exit codes ──────────────────────────
@patch("mempalace.cli.MempalaceConfig")
def test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero(mock_config_cls, tmp_path, capsys):
"""When ``rebuild_from_sqlite`` returns ``{}`` for a validation
refusal (missing source DB, in-place without --archive-existing,
refusing to overwrite an existing dest), the CLI must surface a
non-zero exit so unattended scripts and CI distinguish "invalid
inputs" from "successful recovery that found zero rows."
Catches: a regression where the CLI treats the validation-refusal
sentinel as success, leaving CI green on a no-op repair that should
have alerted an operator.
"""
palace_dir = tmp_path / "palace"
palace_dir.mkdir()
mock_config_cls.return_value.palace_path = str(palace_dir)
args = argparse.Namespace(
palace=str(palace_dir),
mode="from-sqlite",
source=None,
archive_existing=False,
yes=True,
)
with patch("mempalace.repair.rebuild_from_sqlite", return_value={}):
with pytest.raises(SystemExit) as excinfo:
cmd_repair(args)
assert excinfo.value.code == 1
@patch("mempalace.cli.MempalaceConfig")
def test_cmd_repair_from_sqlite_success_does_not_exit(mock_config_cls, tmp_path):
"""A successful from-sqlite rebuild — even one that finds zero rows
in a legitimately empty source palace — must NOT call ``sys.exit``.
A populated counts dict (with ``0`` values) is the success signal;
only the empty dict ``{}`` is reserved for validation refusal.
Catches: a regression where ``if not counts`` is replaced by
``if not sum(counts.values())`` or similar, conflating "empty source"
with "validation refused" and breaking idempotent recovery scripts.
"""
palace_dir = tmp_path / "palace"
palace_dir.mkdir()
mock_config_cls.return_value.palace_path = str(palace_dir)
args = argparse.Namespace(
palace=str(palace_dir),
mode="from-sqlite",
source=None,
archive_existing=False,
yes=True,
)
# Zero rows but per-collection keys present → success, no exit.
fake_counts = {"mempalace_drawers": 0, "mempalace_closets": 0}
with patch("mempalace.repair.rebuild_from_sqlite", return_value=fake_counts):
# Should return cleanly; no SystemExit raised.
cmd_repair(args)
+384
View File
@@ -682,3 +682,387 @@ def test_max_seq_id_rollback_on_verification_failure(tmp_path, monkeypatch):
# A backup file is still present — caller can roll back from it. # A backup file is still present — caller can roll back from it.
leftover = [fn for fn in os.listdir(palace) if "max-seq-id-backup-" in fn] leftover = [fn for fn in os.listdir(palace) if "max-seq-id-backup-" in fn]
assert leftover assert leftover
# ── extract_via_sqlite + rebuild_from_sqlite (#1308) ──────────────────
#
# These tests build real chromadb palaces in tmp_path rather than mocking
# the SQLite layer. The bug class they guard against is "extraction sees
# different rows than chromadb stored" — the only honest check is to let
# chromadb actually write rows and then read them back via the SQLite
# bypass. Mocking the SQLite cursor would defeat the test.
def _seed_palace(palace_path, collection_name, rows):
"""Build a real chromadb palace at ``palace_path`` and add ``rows``.
``rows`` is a list of ``(id, document, metadata)`` tuples.
"""
from mempalace.backends.chroma import ChromaBackend
backend = ChromaBackend()
try:
col = backend.create_collection(str(palace_path), collection_name)
col.upsert(
ids=[r[0] for r in rows],
documents=[r[1] for r in rows],
metadatas=[r[2] for r in rows],
)
finally:
# Release chromadb's rust-side SQLite/HNSW file locks before the
# caller proceeds. Without this, an in-place rebuild on Windows
# fails with WinError 32 on data_level0.bin during the archive
# rename (cf. PR #1310 test-windows job).
backend.close()
def test_extract_via_sqlite_returns_all_rows_with_metadata(tmp_path):
"""Round-trip: a chromadb palace with N upserted rows returns those
same N rows when read via the SQLite bypass.
Catches: anyone who breaks the segments/embeddings/embedding_metadata
JOIN, swaps the metadata vs vector segment, or changes how the
document is stored under the ``chroma:document`` key.
Also asserts every embedding row underlying the extraction lives in
a ``segments.scope = 'METADATA'`` segment. Document + metadata rows
are stored under METADATA in Chroma's segment layout while HNSW
files live under ``VECTOR``; locking that assumption in here means a
future refactor that accidentally points the JOIN at ``VECTOR``
fails this test instead of silently regressing the recovery path.
"""
rows = [
(f"drawer_{i:03d}", f"document body {i}", {"wing": "test_wing", "room": f"r{i % 3}"})
for i in range(25)
]
_seed_palace(tmp_path, "mempalace_drawers", rows)
extracted = list(repair.extract_via_sqlite(str(tmp_path), "mempalace_drawers"))
assert len(extracted) == 25
by_id = {emb_id: (doc, meta) for emb_id, doc, meta in extracted}
assert set(by_id) == {r[0] for r in rows}
for emb_id, doc, meta in rows:
got_doc, got_meta = by_id[emb_id]
assert got_doc == doc, f"document mangled for {emb_id}"
assert got_meta == meta, f"metadata mangled for {emb_id}: {got_meta!r}"
# Lock the segment-scope assumption directly against Chroma's on-disk
# layout so a future change that points the extraction JOIN at the
# VECTOR segment cannot pass this test. Query each extracted row's
# backing segment scope via the same SQLite tables ``extract_via_sqlite``
# reads from.
sqlite_path = os.path.join(str(tmp_path), "chroma.sqlite3")
conn = sqlite3.connect(f"file:{sqlite_path}?mode=ro", uri=True)
try:
scopes = {
scope
for (scope,) in conn.execute(
"""
SELECT DISTINCT s.scope
FROM embeddings e
JOIN segments s ON e.segment_id = s.id
JOIN collections c ON s.collection = c.id
WHERE c.name = ? AND e.embedding_id IN ({})
""".format(",".join("?" * len(extracted))),
("mempalace_drawers", *(emb_id for emb_id, _, _ in extracted)),
)
}
finally:
conn.close()
assert scopes == {"METADATA"}, (
f"extraction is reading from segments scoped {scopes!r}; only "
"'METADATA' should back the document/metadata rows. If Chroma's "
"segment layout changed, update extract_via_sqlite's WHERE clause."
)
def test_extract_via_sqlite_preserves_typed_metadata(tmp_path):
"""Chromadb stores int / float / bool / string in distinct typed
columns. Extraction must round-trip the original type, not coerce
everything to string.
Catches: a regression where the SELECT order changes and ints come
back as None, or where the column-resolution rule prefers the wrong
column.
"""
rows = [
(
"drawer_typed",
"doc",
{
"wing": "w",
"chunk_index": 7, # int
"score": 0.42, # float
"is_active": True, # bool
},
),
]
_seed_palace(tmp_path, "mempalace_drawers", rows)
extracted = list(repair.extract_via_sqlite(str(tmp_path), "mempalace_drawers"))
assert len(extracted) == 1
_, _, meta = extracted[0]
assert meta["chunk_index"] == 7 and isinstance(meta["chunk_index"], int)
assert meta["score"] == 0.42 and isinstance(meta["score"], float)
assert meta["is_active"] is True
assert meta["wing"] == "w"
def test_extract_via_sqlite_unknown_collection_yields_nothing(tmp_path):
"""Asking for a collection that isn't in the palace must return an
empty iterator, not silently fall back to another collection's
metadata segment. Seeds two real collections and queries for a third
name so a regression that drops the WHERE c.name=? filter would leak
rows from the seeded collections rather than passing.
"""
_seed_palace(tmp_path, "mempalace_drawers", [("d1", "doc", {"wing": "w"})])
_seed_palace(tmp_path, "mempalace_closets", [("c1", "abbrev", {"wing": "w"})])
assert list(repair.extract_via_sqlite(str(tmp_path), "not_a_real_collection")) == []
def test_extract_via_sqlite_missing_palace_yields_nothing(tmp_path):
"""No chroma.sqlite3 → empty iterator, no exception. Callers depend
on this when probing speculatively."""
empty = tmp_path / "no_palace_here"
empty.mkdir()
assert list(repair.extract_via_sqlite(str(empty), "mempalace_drawers")) == []
def test_rebuild_from_sqlite_roundtrips_via_real_chromadb(tmp_path):
"""End-to-end: seed source palace, rebuild into a fresh dest, then
open dest with a fresh ChromaBackend and verify ``count()`` and
metadata filters return the original rows. Also asserts a closet
document round-trips so a future regression that re-embeds with the
wrong EF or swaps drawer/closet content would fail here.
This is the single most important regression guard. If
``rebuild_from_sqlite`` silently drops rows or mangles metadata, no
other test in this file would catch it because they all stop at the
extraction layer.
"""
from mempalace.backends.chroma import ChromaBackend
source = tmp_path / "source"
dest = tmp_path / "dest"
rows = [
(f"drawer_{i:03d}", f"body {i}", {"wing": "alpha" if i % 2 else "beta", "room": "r0"})
for i in range(40)
]
_seed_palace(source, "mempalace_drawers", rows)
_seed_palace(
source,
"mempalace_closets",
[("closet_x", "abbrev pointer →drawer_001", {"wing": "alpha"})],
)
counts = repair.rebuild_from_sqlite(str(source), str(dest))
assert counts == {"mempalace_drawers": 40, "mempalace_closets": 1}
backend = ChromaBackend()
drawers = backend.get_collection(str(dest), "mempalace_drawers")
assert drawers.count() == 40
alpha = drawers.get(where={"wing": "alpha"})
assert len(alpha["ids"]) == 20
# Spot-check that document text round-trips for one specific drawer
# — protects against a regression where extraction or upsert order
# silently swaps document bodies between IDs.
one = drawers.get(ids=["drawer_007"], include=["documents", "metadatas"])
assert one["documents"] == ["body 7"]
assert one["metadatas"][0]["wing"] == "alpha"
# Closets: the AAAK index layer. Re-embedded with the same EF so a
# known closet ID and its document body must come back intact.
closets = backend.get_collection(str(dest), "mempalace_closets")
assert closets.count() == 1
closet_row = closets.get(ids=["closet_x"], include=["documents", "metadatas"])
assert closet_row["documents"] == ["abbrev pointer →drawer_001"]
assert closet_row["metadatas"][0] == {"wing": "alpha"}
def test_rebuild_from_sqlite_refuses_existing_dest(tmp_path):
"""Refuse to write into a directory that already exists when source
and dest differ. Without this, an unattended re-run would silently
interleave a partial rebuild with whatever's already at dest.
"""
source = tmp_path / "source"
dest = tmp_path / "dest"
_seed_palace(source, "mempalace_drawers", [("d1", "doc", {"wing": "w"})])
dest.mkdir()
# Drop a marker file so we can prove the dir wasn't touched.
(dest / "marker.txt").write_text("preexisting")
counts = repair.rebuild_from_sqlite(str(source), str(dest))
assert counts == {}
assert (dest / "marker.txt").read_text() == "preexisting"
assert not (dest / "chroma.sqlite3").exists()
def test_rebuild_from_sqlite_in_place_archives_when_opted_in(tmp_path):
"""In-place rebuild (source == dest) with ``archive_existing_dest=True``
must move the original aside to ``<dest>.pre-rebuild-<ts>`` and read
from the archive — the original drawer rows must survive in the new
palace, AND the archive itself must still contain the original rows.
Catches: a refactor that moves the original out but then reads from
the now-empty original location, producing an empty rebuild; also
catches a swap that empties the archive after reading.
"""
palace = tmp_path / "palace"
rows = [(f"d{i}", f"body {i}", {"wing": "w", "room": "r"}) for i in range(15)]
_seed_palace(palace, "mempalace_drawers", rows)
counts = repair.rebuild_from_sqlite(str(palace), str(palace), archive_existing_dest=True)
assert counts["mempalace_drawers"] == 15
archives = [p for p in tmp_path.iterdir() if p.name.startswith("palace.pre-rebuild-")]
assert len(archives) == 1
assert (archives[0] / "chroma.sqlite3").exists()
# Archive must still hold the same row count via the SQLite bypass —
# proves the archive wasn't silently truncated as a side effect.
archived_rows = list(repair.extract_via_sqlite(str(archives[0]), "mempalace_drawers"))
assert len(archived_rows) == 15
from mempalace.backends.chroma import ChromaBackend
rebuilt = ChromaBackend().get_collection(str(palace), "mempalace_drawers")
assert rebuilt.count() == 15
def test_rebuild_from_sqlite_in_place_refuses_without_archive_flag(tmp_path):
"""Source == dest without archive flag must abort untouched. The
most catastrophic possible regression of this code path is silently
deleting the only copy of the user's data."""
palace = tmp_path / "palace"
_seed_palace(palace, "mempalace_drawers", [("d1", "doc", {"wing": "w"})])
sqlite_before = (palace / "chroma.sqlite3").stat().st_size
counts = repair.rebuild_from_sqlite(str(palace), str(palace))
assert counts == {}
# Same file, untouched.
assert (palace / "chroma.sqlite3").stat().st_size == sqlite_before
archives = [p for p in tmp_path.iterdir() if "pre-rebuild" in p.name]
assert archives == []
def test_rebuild_from_sqlite_source_missing_chroma_db(tmp_path):
"""Source dir exists but has no chroma.sqlite3 → returns empty,
leaves dest untouched."""
source = tmp_path / "source"
source.mkdir()
(source / "stray_file").write_text("not a palace")
dest = tmp_path / "dest"
counts = repair.rebuild_from_sqlite(str(source), str(dest))
assert counts == {}
assert not dest.exists()
def test_rebuild_from_sqlite_in_place_validates_source_before_archiving(tmp_path):
"""In-place + archive_existing_dest=True with a dir that lacks
chroma.sqlite3 must NOT rename the dir before bailing. An earlier
revision archived first and validated second, leaving the user with
a renamed empty dir to manually undo. Catches that ordering bug.
"""
palace = tmp_path / "palace"
palace.mkdir()
(palace / "marker.txt").write_text("not a real palace")
counts = repair.rebuild_from_sqlite(str(palace), str(palace), archive_existing_dest=True)
assert counts == {}
# No archive created — original dir still in place with its marker.
assert palace.exists()
assert (palace / "marker.txt").read_text() == "not a real palace"
archives = [p for p in tmp_path.iterdir() if "pre-rebuild" in p.name]
assert archives == []
def test_rebuild_from_sqlite_raises_on_upsert_failure(tmp_path, monkeypatch):
"""Mid-batch upsert failure must raise ``RebuildPartialError`` and
surface the failed collection + archive path so the user can recover.
Without this, an unattended script gets exit-code-zero on a partial
rebuild and the user discovers the data loss only when search starts
returning fewer hits.
"""
palace = tmp_path / "palace"
rows = [(f"d{i}", f"body {i}", {"wing": "w", "room": "r"}) for i in range(5)]
_seed_palace(palace, "mempalace_drawers", rows)
# Make the very first upsert raise so we don't depend on batch
# boundary behavior. Patching ChromaCollection.upsert (the wrapper
# mempalace's backend returns) keeps the failure path realistic.
# ``monkeypatch`` is pytest's built-in fixture that auto-restores
# the original attribute when the test exits, so we don't need to
# undo this manually.
from mempalace.backends.chroma import ChromaCollection
def boom(self, **kwargs):
raise RuntimeError("simulated chromadb upsert failure")
monkeypatch.setattr(ChromaCollection, "upsert", boom)
with pytest.raises(repair.RebuildPartialError) as excinfo:
repair.rebuild_from_sqlite(str(palace), str(palace), archive_existing_dest=True)
err = excinfo.value
assert err.failed_collection == "mempalace_drawers"
assert err.partial_counts.get("mempalace_drawers") == 0
assert err.archive_path is not None
assert os.path.isfile(os.path.join(err.archive_path, "chroma.sqlite3"))
assert err.dest_palace == os.path.abspath(str(palace))
def test_rebuild_from_sqlite_honors_configured_drawer_collection_name(tmp_path, monkeypatch):
"""A user with a non-default drawers collection name (set via
``MempalaceConfig().collection_name``) must have THAT collection
rebuilt — not the hardcoded ``mempalace_drawers``.
Catches: a regression where the recovery path silently rebuilds the
default-name collection on a custom-named palace, leaving the user's
actual data unrebuilt while reporting "rebuild complete." This is
the failure mode reviewer mjc flagged on PR #1310 as needing to line
up with the configured-collection-name work in #1312. Closets stay
fixed (``mempalace_closets``) by design — the AAAK index references
drawer IDs by string and is not per-deployment configurable.
Strategy: monkeypatch the lazy resolver so the test is hermetic and
does not depend on the global config file or env state.
"""
from mempalace.backends.chroma import ChromaBackend
custom_drawers = "custom_drawers_xyz"
monkeypatch.setattr(repair, "_drawers_collection_name", lambda: custom_drawers)
source = tmp_path / "source"
dest = tmp_path / "dest"
drawer_rows = [(f"d{i}", f"body {i}", {"wing": "alpha"}) for i in range(3)]
closet_rows = [("closet_a", "abbrev →d0", {"wing": "alpha"})]
_seed_palace(source, custom_drawers, drawer_rows)
_seed_palace(source, "mempalace_closets", closet_rows)
counts = repair.rebuild_from_sqlite(str(source), str(dest))
# Rebuilt under the custom name, not under the default "mempalace_drawers".
assert counts == {custom_drawers: 3, "mempalace_closets": 1}
backend = ChromaBackend()
rebuilt_drawers = backend.get_collection(str(dest), custom_drawers)
assert rebuilt_drawers.count() == 3
# Default-name collection must NOT exist in dest — proves we did not
# silently fall back to the hardcoded name during rebuild.
try:
rebuilt_default = backend.get_collection(str(dest), "mempalace_drawers")
# If get_collection returns without raising, count() should be 0
# (chromadb may auto-create on get with some EFs); a non-zero
# count would mean we wrote rows to the wrong collection.
assert rebuilt_default.count() == 0, (
"rebuild leaked rows into the default-name collection on a "
"custom-name palace — recovery wrote to the wrong collection."
)
except Exception:
pass # Expected: collection wasn't created.