Files
mempalace/tests
jp 74ff5e6b98 fix(hnsw): integrity gate in quarantine_stale_hnsw — corruption vs flush-lag
Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in #1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.

Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.

Fix: two-stage gate.

  1. mtime gate (existing) — gap > stale_seconds is necessary.
  2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
     expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
     tail) and a non-trivial size, WITHOUT deserializing the file.
     Healthy segment with mtime drift → keep in place; truncated /
     zero-filled / partial-flush → quarantine.

Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.

The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.

4 new tests in test_backends.py:

  - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
  - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
    (drift + valid meta — the production case at 06:24)
  - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
    (fresh / never-flushed, no meta file)
  - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
    size, partial-flush shape)

Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.

Suite 1366/1366 pass.

Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.

CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 09:40:25 -07:00
..
2026-04-11 16:16:49 -07:00