fix(repair): run SQLite integrity preflight before chromadb open

#1364 added the SQLite quick_check preflight to rebuild_index, but
placed it AFTER backend.get_collection(...). On a SQLite-corrupt
palace, chromadb's rust binding raises pyo3_runtime.PanicException —
which is not a regular Exception subclass — so it propagates past the
existing `except Exception` handlers and the user sees a 30-line stack
trace instead of the friendly abort message #1364 was designed to
deliver. Reproduced with `mempalace repair --yes` against a palace
whose chroma.sqlite3 has 4 mangled pages: pre-fix, panic; post-fix,
the clean abort message and exit code 1.

Two changes:

- mempalace/cli.py cmd_repair: run sqlite_integrity_errors() right
  after the basic palace-existence check, BEFORE the max_seq_id
  preflight (which itself opens sqlite3) and BEFORE backend =
  ChromaBackend(). Exit non-zero so unattended scripts and CI gates
  see the failure.

- mempalace/repair.py rebuild_index: same move at the function level
  for direct callers (tests, MCP) that bypass cmd_repair.

The new test test_rebuild_index_runs_sqlite_preflight_before_chromadb_open
uses a real chromadb-built palace (no ChromaBackend mock) plus a
real corrupt SQLite (16 KB of mangled pages) so the ordering is
exercised end-to-end. The previously-shipping test for the abort path
mocked both the backend and sqlite_integrity_errors, which is why the
ordering bug shipped CI-green.

Six existing test_cli.py cmd_repair tests used `(palace_dir /
"chroma.sqlite3").write_text("db")` to fake the SQLite file. The new
preflight correctly fails quick_check on those 2-byte stubs, so the
tests now create empty real SQLite DBs the same way the test_repair.py
fixtures already do.
This commit is contained in:
Igor Lins e Silva
2026-05-07 11:52:58 -03:00
parent f38d9eb109
commit 5134a635ed
4 changed files with 83 additions and 11 deletions
+11 -5
View File
@@ -633,6 +633,17 @@ def rebuild_index(
print(f"{'=' * 55}\n")
print(f" Palace: {palace_path}")
# Run the SQLite integrity preflight before any chromadb client open.
# ChromaDB's rust binding raises pyo3_runtime.PanicException (which is
# not a regular Exception subclass) on a malformed page, propagating
# past the try/except around get_collection below. Catching the
# corruption here lets us surface the clear recovery instructions and
# exit cleanly before chromadb's compactor touches the disk.
sqlite_errors = sqlite_integrity_errors(palace_path)
if sqlite_errors:
print_sqlite_integrity_abort(palace_path, sqlite_errors)
return
preflight = maybe_repair_poisoned_max_seq_id_before_rebuild(
palace_path,
assume_yes=True,
@@ -676,11 +687,6 @@ def rebuild_index(
print(e.message)
return
sqlite_errors = sqlite_integrity_errors(palace_path)
if sqlite_errors:
print_sqlite_integrity_abort(palace_path, sqlite_errors)
return
# Back up ONLY the SQLite database, not the bloated HNSW files
sqlite_path = os.path.join(palace_path, "chroma.sqlite3")
backup_path = sqlite_path + ".backup"