Commit Graph

676 Commits

Author SHA1 Message Date
jp e5e7a57930 fix(hnsw): gate quarantine_stale_hnsw to cold-start, not every reconnect
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).

Root cause is a false-positive thrash in the quarantine heuristic:

- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
  chromadb's internal cadence, which can lag minutes behind sqlite
  under continuous write load.

Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR #1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.

Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.

Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.

Suite 1362/1362 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 09:40:25 -07:00
jp db80f6e26c fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min
make_client() called _fix_blob_seq_ids but skipped quarantine_stale_hnsw,
so every fresh process (stop hook, precompact hook, CLI) opened a drifted
palace and segfaulted in chromadb_rust_bindings before any write-path
protection could fire.

#1062 wires the quarantine call at MCP server startup (covers long-lived
server processes). This fix adds it to make_client() itself — the call
site that all callers (server, hooks, CLI, tests) pass through — so every
fresh PersistentClient open is protected regardless of entry point.

Also lowers stale_seconds default from 3600 to 300: a 0.96h-drifted
segment caused production segfaults before the 1h threshold fired.
ChromaDB's HNSW flush cadence means legitimate drift is seconds to low
minutes; 5min gives headroom without admitting clearly corrupt segments.
2026-04-26 09:39:53 -07:00
Igor Lins e Silva 6890948e09 Merge pull request #1210 from MemPalace/fix/repair-extraction-cap-detection
fix(repair): refuse to overwrite when extraction looks truncated (#1208)
2026-04-26 04:49:01 -03:00
bensig 452630e927 fix(repair): refuse to overwrite when extraction looks truncated (#1208)
The user-reported case in #1208: a palace with 67,580 drawers had its
HNSW files manually quarantined to recover from corruption. ``mempalace
repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair
complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3
that did contain the full 67,580 — but the rebuilt collection only had
the first 10K. 85% data loss, no warning.

Root cause: ChromaDB's collection-layer get() silently caps at
``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a
collection whose segment metadata is stale (typical post-quarantine
state). col.count() returns the same capped value, so neither the
loop bound nor the extraction count flagged the truncation.

Fix is defense-in-depth, not a recovery mechanism. Repair now:

1. After extraction, queries chroma.sqlite3 directly via a read-only
   sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN
   collections WHERE name='mempalace_drawers'. If that count exceeds
   the extracted count, abort with a clear message before any
   destructive operation.
2. Falls back to a weaker check when the SQLite query can't run
   (chromadb schema drift, locked file): if extracted exactly equals
   CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to
   refuse without explicit acknowledgement.
3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok``
   (rebuild_index kwarg) to override after independent verification.
   Useful for the rare case of a palace genuinely sized at exactly
   10,000 drawers.

The guard logic lives in ``repair.check_extraction_safety()`` so the
two extraction paths (CLI ``cmd_repair`` and the lower-level
``rebuild_index``) share a single implementation. Raises
``TruncationDetected`` carrying the printable message.

Tests: 9 new cases covering the safe path (counts match, SQLite
unreadable but well under cap), both abort paths (SQLite higher than
extracted, unreadable + at cap), the override flag, and end-to-end
behavior of ``rebuild_index`` with the guard wired in. Plus two
``sqlite_drawer_count`` tests for the missing-file and bad-schema
cases.

What's NOT in this PR: actually recovering the missing 57,580
drawers from the user's case. The on-disk SQLite still holds them;
recovery is a separate flow (direct-extract from chroma.sqlite3,
bypass the chromadb collection layer entirely). This PR's job is
to stop repair from making it worse.

Refs #1208.
2026-04-25 23:34:05 -07:00
Igor Lins e Silva 5e57404502 Merge pull request #935 from shaun0927/fix/repair-crash-safety
fix: guard against data loss in repair, migrate, and CLI rebuild
2026-04-25 20:49:23 -03:00
Igor Lins e Silva 34bc08fb30 Merge pull request #1205 from MemPalace/ci/python-313-windows-macos
ci: bump Windows and macOS jobs to Python 3.13
2026-04-25 20:03:12 -03:00
Igor Lins e Silva 57854c7e2e ci: bump Windows and macOS jobs to Python 3.13
3.11 is mid-life; 3.13 is already on the Linux matrix and gives ~3.5
years of upstream support. Aligns the single-version platform jobs with
the top of the Linux matrix. requires-python and lint job left alone.

Refs #1192 (Option A).
2026-04-25 19:59:05 -03:00
jp 5b07b869b0 fix(palace_graph): skip None metadata in build_graph
ChromaDB can return None for drawers without metadata (legacy data,
partial writes — same root cause as upstream #1020 / our PR #1094).
build_graph at line 95 called meta.get("room", "") unconditionally,
which AttributeErrors on None and takes out every consumer of
build_graph for the whole call path: graph_stats, find_tunnels,
traverse, and (most visibly) the daemon's /stats endpoint.

Caught 2026-04-25 by palace-daemon's verify-routes.sh smoke test
against the canonical 151K-drawer palace — /stats was 500-ing on a
single None drawer.

Adds `if meta is None: continue` guard. Closes the same gap upstream's
#999 None-metadata audit closed in searcher.py / mcp_server.py /
miner.status, just in a different file the audit didn't reach. The
graph-build is recoverable: skipping a single None drawer doesn't
distort the graph since build_graph already filters
`room and room != "general" and wing` — a missing-metadata drawer was
never going to participate anyway.

Test: TestBuildGraph::test_none_metadata_does_not_crash mixes a None
entry into a 3-drawer fixture and asserts the two real drawers are
processed normally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:06:32 -07:00
jp bc24aa14e2 fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090)
Opening chroma.sqlite3 via Python's sqlite3.connect() against a live
ChromaDB 1.5.x WAL-mode database leaves state that segfaults the next
PersistentClient call — the same failure mode tracked at #1090.

_fix_blob_seq_ids runs unconditionally on every make_client() call, so
every fresh process (MCP server, stop hook, CLI) re-triggers the sqlite
open → corrupt → segfault cycle on palaces that have already completed
the 0.6.x → 1.5.x seq_id migration.

Guard with a .blob_seq_ids_migrated marker file in the palace directory:
- If marker exists, return immediately — skip sqlite entirely
- After successful migration (or confirmation that no BLOBs remain),
  write the marker so subsequent opens take the fast path
- Palaces that never had BLOB seq_ids also get the marker on first open,
  so they too avoid the redundant sqlite open after that
- Already-migrated palaces can touch the marker manually to opt in

Test plan: Direct test — run _fix_blob_seq_ids twice against a fresh
palace; second call returns immediately because marker exists. 1094
existing tests pass.
2026-04-25 07:42:05 -07:00
jp 67248330c5 chore: ruff format tests/test_searcher.py
CI lint job runs `ruff format --check`; the new tests in TestBM25NoneSafety
needed the standard "blank line after import-inside-function" + line-length
wrap. No logic change — formatter pass only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 07:22:53 -07:00
jp ee12c07c54 fix(searcher): tolerate None documents in BM25 reranker
`_tokenize` calls `text.lower()` unconditionally; when ChromaDB returns a
drawer with `documents` containing `None`, the hybrid-rerank path raises
`AttributeError: 'NoneType' object has no attribute 'lower'`.

Observed in production daemon log (2026-04-24 21:07:05) during a search
that triggered `_hybrid_rank → _bm25_scores → _tokenize`:

    File "mempalace/searcher.py", line 81, in _bm25_scores
        tokenized = [_tokenize(d) for d in documents]
    File "mempalace/searcher.py", line 52, in _tokenize
        return _TOKEN_RE.findall(text.lower())
    AttributeError: 'NoneType' object has no attribute 'lower'

Closes the gap left by the upstream None-metadata audit (#999), which
covered metadata loops but not BM25 helpers. Returns `[]` for falsy input
so a None doc gets score 0.0 while the rest of the corpus reranks normally.

Three regression tests in TestBM25NoneSafety lock the behavior and reference
the production trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 06:27:14 -07:00
Igor Lins e Silva 0d9929c0dd Merge pull request #976 from felipetruman/fix/hnsw-race-and-fanout
fix: HNSW graph corruption, PreCompact deadlock, mine fan-out (closes #974, #965, #955)
2026-04-25 05:00:14 -03:00
Igor Lins e Silva 7773432bca chore(rebase): reconcile with develop and apply ruff format
After rebasing onto current develop:
- chroma.py: keep develop's quarantine_stale_hnsw + UnsupportedFilterError
  validation alongside this PR's _pin_hnsw_threads retrofit.
- tests/test_backends.py: combine quarantine_stale_hnsw and
  _pin_hnsw_threads test sections; ruff format.
- miner.py: propagate the new `files=` kwarg (added on develop in #1183
  for the init -> mine flow) through _mine_impl so the caller can pass
  a pre-scanned file list under the global lock.
2026-04-25 04:39:31 -03:00
Felipe Truman 8df944a54d fix: best-effort HNSW thread-pin retrofit + drop dead attempt-cap constant
Addresses remaining PR #976 review items after rebase on develop.

`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).

In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.

After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.

- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
2026-04-25 04:36:29 -03:00
Felipe Truman 40d7958ca1 test: remove attempt-cap tests obsoleted by develop's pass-through approach
PR #863 on develop eliminated precompact blocking entirely. After rebasing,
the attempt-cap tests (test_precompact_first_two_attempts_block,
test_precompact_passes_through_after_cap, test_precompact_counter_is_per_session)
would always fail because hook_precompact now mines synchronously and
passes through unconditionally. Remove them to keep the suite green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 04:34:30 -03:00
Felipe Truman 1998aede66 fix: Windows CI compat for palace lock tests and path normalization
Addresses the two actionable Copilot comments from the 2nd review pass.

tests/test_palace_locks.py (#7, #8)
  multiprocessing.get_context("fork") is unavailable on Windows, so the
  cross-process tests would crash the Windows CI runner. Added
  `_get_mp_context()` that picks "spawn" on Windows and "fork" elsewhere.
  Spawn re-imports the module in the child; it inherits os.environ
  (including the monkeypatched HOME), which is all these tests need.

mempalace/palace.py (#10)
  The per-palace lock key was computed from os.path.abspath(palace_path).
  On Windows the filesystem is case-insensitive, so `C:\\Palace` and
  `c:\\palace` would hash to different keys and two concurrent mines
  could touch the same on-disk palace. Switched to
  `os.path.normcase(os.path.realpath(...))` so:
    * realpath resolves symlinks and `..` segments
    * normcase folds case on Windows (no-op on POSIX)

Testing
  pytest tests/test_palace_locks.py tests/test_hooks_cli.py
         tests/test_backends.py tests/test_cli.py
  → 98 passed, 0 failed.
2026-04-25 04:34:30 -03:00
Felipe Truman 99b820cb42 fix: address PR review — per-palace lock, MCP server path, hook timeout, tests
Addresses the six Copilot review comments on the initial commit.

1) #6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend

   The MCP server creates its palace collection directly via
   `chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
   not through `ChromaBackend.get_collection`. That path was missing the
   `hnsw:num_threads=1` metadata, so the primary crash surface for #974
   and #965 was untouched by the original patch. Fixed by passing
   `hnsw:num_threads=1` at the mcp_server create site too. Documented
   in a code comment that the setting is only honored at creation
   time — existing palaces created before this fix still need a
   `mempalace nuke` + re-mine to gain the protection.

2) #3 — mine_global_lock over-serialized mines across unrelated palaces

   Replaced the single global lock file `mine_global.lock` with a
   per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
   (`mine_palace_<hash>.lock`). Mines against the same palace still
   collapse to a single runner (the correctness boundary), but mines
   against *different* palaces are now free to run in parallel.
   `mine_global_lock` is kept as a backward-compatible alias for
   `mine_palace_lock` so any external callers that imported the
   previous name keep working.

3) #1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired

   `subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
   palaces. The previous `except OSError` clause didn't catch it, so
   the hook could raise and fail to emit any JSON decision — leaving
   the harness without a block/passthrough signal. Fixed by catching
   `(OSError, subprocess.TimeoutExpired)` together and always falling
   through to the block decision so the hook reliably emits a response.

4) #2 + #4 — tests

   - tests/test_hooks_cli.py: added
     `test_precompact_first_two_attempts_block`,
     `test_precompact_passes_through_after_cap`, and
     `test_precompact_counter_is_per_session` to lock in the #955
     deadlock fix.
   - tests/test_palace_locks.py (new): covers `mine_palace_lock`
     single-acquire, reuse-after-release, cross-process serialization
     on the same palace, non-interference across different palaces,
     path normalization, and the `mine_global_lock` back-compat alias.

5) #5 — known limitation, documented but not auto-fixed

   Copilot suggested detecting collections missing `hnsw:num_threads=1`
   and calling `collection.modify(metadata=...)` to retrofit existing
   palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
   replaces metadata rather than merging, and re-passing
   `hnsw:space="cosine"` then raises `ValueError: Changing the
   distance function of a collection once it is created is not
   supported currently.` The HNSW runtime configuration
   (`configuration_json`) also does not expose `num_threads` in
   chromadb 1.5.x, so the flag appears to be read only at creation
   time. Rather than paper over the limitation with a best-effort
   `modify` that silently drops `hnsw:space`, documented in the
   mcp_server comment that pre-existing palaces need a
   `mempalace nuke` + re-mine to gain the protection. Fresh palaces
   are always protected.

Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
  tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
  - Different palaces → both complete in parallel ✓
  - Same palace     → one completes, the other exits with
    "another `mine` is already running against <palace> — exiting
    cleanly." ✓
2026-04-25 04:34:30 -03:00
Felipe Truman 7e18a70796 fix: resolve hooks_cli.py merge conflict + add mine_global_lock tests
- Resolve UU conflict in hooks_cli.py: take develop/HEAD approach
  (mine synchronously via _mine_sync, then pass through unconditionally).
  _mine_sync already catches subprocess.TimeoutExpired — fixes Copilot #1.
- Add tests/test_palace_locks.py: 4 tests covering mine_global_lock
  non-blocking semantics (acquire, second-acquire raises MineAlreadyRunning,
  reusable after release, release on exception) — fixes Copilot #4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 04:34:30 -03:00
Igor Lins e Silva 91a60263e3 Merge pull request #1168 from arnoldwender/fix/security-tunnels-permissions
fix(security): restrict tunnels.json file permissions
2026-04-25 04:21:44 -03:00
Igor Lins e Silva 320aab31e3 Merge pull request #939 from mvalentsev/ci/pip-cache-and-python-bump
ci: add pip caching and bump Python on macOS/Windows
2026-04-25 04:12:57 -03:00
Igor Lins e Silva 5ed24ad061 Merge pull request #969 from MemPalace/dependabot/github_actions/actions/checkout-6
chore(deps): bump actions/checkout from 4 to 6
2026-04-25 04:04:37 -03:00
Igor Lins e Silva 036742e888 Merge pull request #968 from MemPalace/dependabot/github_actions/actions/upload-pages-artifact-5
chore(deps): bump actions/upload-pages-artifact from 3 to 5
2026-04-25 04:04:18 -03:00
Igor Lins e Silva 000acc1e33 Merge pull request #967 from MemPalace/dependabot/github_actions/actions/deploy-pages-5
chore(deps): bump actions/deploy-pages from 4 to 5
2026-04-25 04:03:58 -03:00
Igor Lins e Silva 9b7536a1f7 Merge pull request #1101 from wahajahmed010/fix/hooks-tutorial-1037
docs: fix HOOKS_TUTORIAL.md paths, matcher, and missing timeout (#1037)
2026-04-25 03:48:36 -03:00
Igor Lins e Silva 374fb5656b Merge pull request #1183 from MemPalace/feat/init-mine-ux
feat(cli): init prompts to mine, mine handles Ctrl-C gracefully (#1181, #1182)
2026-04-25 01:24:23 -03:00
Igor Lins e Silva c4eeec8642 test: use shlex.quote in resume-hint assertions for Windows
The pre-existing test_maybe_run_mine_prompt_declined_prints_hint
asserted the bare unquoted form `mempalace mine {tmp_path}`. After
the production code switched to shlex.quote on the resume hint, this
passed on Linux/macOS (POSIX paths have no characters that trigger
quoting) but failed on Windows where backslashes always get wrapped
in single quotes.

Mirror the production code in the assertion via shlex.quote so it's
portable across platforms; do the same for the two new
spaces-in-path tests for consistency.
2026-04-25 01:18:31 -03:00
Igor Lins e Silva 8faf0042b5 fix(cli,mine): shell-quote project_dir in resume hints
The "Skipped. Run mempalace mine <dir>" hint after declining the init
prompt and the "Re-run mempalace mine <dir> to resume" hint after a
Ctrl-C interruption both interpolated project_dir without shell-quoting.
A path containing spaces or metacharacters produced a copy-paste-broken
command.

Both spots now use shlex.quote(project_dir). Adds regression tests
covering each hint with a path that contains a space.
2026-04-25 01:10:17 -03:00
Igor Lins e Silva 23d534f8f3 fix(init): split --auto-mine from --yes; show file-count estimate before mine prompt
Reviewer feedback on the previous commit flagged two real problems:

1. Overloading --yes to also auto-mine was a silent behaviour change for
   scripted callers. Today --yes only auto-accepts entities — making it
   ALSO trigger a multi-minute ChromaDB write breaks every script that
   currently runs `mempalace init --yes <dir>` for the fast non-interactive
   entity path. Add a separate `--auto-mine` flag instead. Combinations:

     mempalace init --yes <dir>              # entities auto, STILL prompt mine
     mempalace init --auto-mine <dir>        # prompt entities, skip mine prompt
     mempalace init --yes --auto-mine <dir>  # fully non-interactive

   --yes behaviour is now identical to pre-PR.

2. The mine prompt was firing without telling the user how big the job
   was. On a real corpus mine takes minutes-to-tens-of-minutes; hitting
   Enter on default-Y with no size cue is a footgun. Show a one-line
   estimate computed from scan_project (the same walk we hand into mine)
   BEFORE the prompt:

     ~423 files (~12 MB) would be mined into this palace.
     Mine this directory now? [Y/n]

   The estimate uses a single corpus walk: scan_project's output is
   passed into mine() via a new optional files= kwarg, so we never walk
   the tree twice.

Tests: replaced the old "--yes auto-mines" assertion with a regression
guard that --yes alone STILL prompts; added coverage for --auto-mine
alone, --yes --auto-mine together, and the pre-prompt estimate line.
2026-04-25 01:02:09 -03:00
Igor Lins e Silva f13b9a46a2 feat(cli): init prompts to mine, mine handles Ctrl-C gracefully
`mempalace init` now ends with a `Mine this directory now? [Y/n]`
prompt and runs `mine()` in-process when accepted; `--yes` skips the
prompt and auto-mines for non-interactive callers. Declining prints
the resume command. Removes the "remember to type the next command"
friction since rooms + entities just got set up.

`mempalace mine` now wraps its main loop in `try / except
KeyboardInterrupt` and prints `files_processed`, `drawers_filed`, and
`last_file` before exiting with code 130 on Ctrl-C. Re-mining is safe
because deterministic drawer IDs make the upsert idempotent. The
hooks PID lock at `~/.mempalace/hook_state/mine.pid` is now actively
removed in a `finally` when its entry points at us, on clean exit,
error, or interrupt — preventing the next hook fire from briefly
waiting on a stale PID.

Closes #1181, #1182.
2026-04-25 01:01:24 -03:00
Igor Lins e Silva 91c1d159af Merge pull request #1179 from MemPalace/fix/search-metric-quality
fix(search): CLI hybrid rerank, legacy-metric warning, invariant tests (3.3.4)
2026-04-25 00:57:25 -03:00
Igor Lins e Silva ec5f4eba9d fix(test): use tmp_path for full-stack invariant test (Windows CI)
`test_fresh_palace_via_full_stack_gets_cosine` used `tempfile.Temporary-
Directory()` as a context manager, which tries to delete the temp path
on exit. On Windows, ChromaDB still holds SQLite file handles to
`chroma.sqlite3` when the context closes, producing:

    PermissionError: [WinError 32] The process cannot access the file
    because it is being used by another process: '...\\chroma.sqlite3'
    NotADirectoryError: [WinError 267] The directory name is invalid

Other tests in the same file use pytest's `tmp_path` fixture, which
defers cleanup to session end (when the process is exiting and the
file-lock contention is moot). Align this one with the rest of the
file.

CLAUDE.md already documents the 80% Windows coverage allowance due to
"ChromaDB file lock cleanup" — the fix is to stop fighting the lock.
2026-04-25 00:39:37 -03:00
Igor Lins e Silva 133dfbfb41 fix(search): BM25 hybrid rerank, legacy-metric warning, invariant tests
Three tightly-coupled search-quality fixes for v3.3.3:

1. CLI `mempalace search` now routes through the same `_hybrid_rank`
   the MCP path already used. Drawers whose text contains every query
   term but embed as file-tree noise (directory listings, diffs, log
   fragments) were scoring cosine distance >= 1.0 — the display formula
   `max(0, 1 - dist)` then floored every result to `Match: 0.0`, with
   no way for the user to tell a lexical match from a total miss. BM25
   catches these cleanly; the display surfaces both `cosine=` and
   `bm25=` so users see which component is firing.

2. Legacy-palace distance-metric warning. Palaces created before
   `hnsw:space=cosine` was consistently set silently use ChromaDB's
   default L2 metric, which breaks the cosine-similarity formula (L2
   distances routinely exceed 1.0 on normalized 384-dim vectors). The
   search path now detects this at query time and prints a one-line
   notice pointing at `mempalace repair`. Only fires for legacy
   palaces; new palaces already set cosine correctly.

3. Invariant tests pinning `hnsw:space=cosine` on every collection-
   creation path — legacy `get_or_create_collection`, legacy
   `create_collection`, RFC 001 `get_collection(create=True)`, the
   public `palace.get_collection`, and a round-trip through reopen.
   Locks down the correctness that new-user palaces already have so a
   future refactor can't silently regress it.

Also adds a `metadata` property to `ChromaCollection` so callers can
read the underlying hnsw:space without reaching into `_collection`.

Tests:
- New regression: simulate three candidates at distance 1.5 (cosine=0),
  one containing query terms — must rank first with non-zero bm25.
- New: legacy metric (empty or non-cosine) produces stderr warning.
- New: correctly-configured palace produces no warning.
- New: all five creation paths pin cosine metadata.

All existing tests still pass.
2026-04-25 00:39:37 -03:00
Igor Lins e Silva b9e41286fa Merge pull request #1189 from MemPalace/openarena-claim
chore: add OpenArena owner claim verification file
2026-04-24 23:27:49 -03:00
Igor Lins e Silva 8d49b009e0 Merge pull request #1184 from MemPalace/feat/cross-wing-topic-tunnels
feat(graph): cross-wing tunnels by shared topics (#1180)
2026-04-24 23:25:55 -03:00
Igor Lins e Silva 0197b2eea9 chore: add OpenArena owner claim verification file 2026-04-24 23:19:29 -03:00
Igor Lins e Silva 865a36bc5c feat(graph): namespace topic-tunnel rooms with "topic:" prefix + kind field
Previously a cross-wing topic tunnel for "Angular" stored the room as
"Angular" — colliding with a wing's literal folder-derived "Angular" room
at follow_tunnels/list_tunnels read time, and exposing raw topic strings
(which may contain characters rejected by sanitize_name) to the MCP
surface.

Topic tunnels now store their room as "topic:<original-casing>" and carry
kind="topic" on the stored dict. Explicit tunnels get kind="explicit"
(default). follow_tunnels("wing", "Angular") on a literal Angular room
no longer surfaces topic connections for the same name, and any LLM
scanning list_tunnels has a visible discriminator.
2026-04-24 23:06:26 -03:00
Igor Lins e Silva fe051adc73 feat(graph): cross-wing tunnels by shared topics (#1180)
When two wings have one or more confirmed TOPIC labels in common, the
miner now drops a symmetric tunnel between them at mine time so the
palace graph reflects shared themes (frameworks, vendors, recurring
concepts).

- llm_refine: TOPIC label routes to a dedicated `topics` bucket so the
  signal survives confirmation instead of getting collapsed into
  `uncertain` and dropped.
- entity_detector / project_scanner: bucket plumbed through the
  detection pipeline; `confirm_entities` returns confirmed topics
  alongside people/projects.
- miner.add_to_known_entities: optional `wing` parameter records the
  confirmed topics under `topics_by_wing` in
  `~/.mempalace/known_entities.json`. Wing names do NOT leak into the
  flat known-name set used by drawer-tagging.
- palace_graph: `compute_topic_tunnels` and `topic_tunnels_for_wing`
  create symmetric tunnels via the existing `create_tunnel` API so they
  share dedup and persistence with explicit tunnels.
- miner.mine: post-file-loop pass calls `topic_tunnels_for_wing` for
  the freshly-mined wing. Failures are logged but never abort the mine.
- config: `topic_tunnel_min_count` knob (env
  `MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` or `~/.mempalace/config.json`),
  default 1.

Tests cover topic persistence through init->mine, tunnel creation when
wings share a topic, no tunnel below threshold, cross-wing tunnel
retrieval via `list_tunnels`, dedup on recompute, case-insensitive
overlap, and the end-to-end mine-time wiring.

Out of scope for this PR (called out in the PR body): manifest-
dependency overlap, per-topic allow/deny lists, search-result surfacing.
2026-04-24 23:06:26 -03:00
Igor Lins e Silva ed2ba726c9 Merge pull request #1185 from MemPalace/perf/batched-upsert-gpu
perf(mining): batch per-chunk upserts + optional GPU acceleration
2026-04-24 20:34:28 -03:00
copilot-swe-agent[bot] 031512438e test: isolate embedding module state with monkeypatch
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-24 23:11:29 +00:00
copilot-swe-agent[bot] 3d529e7028 test: tidy embedding follow-up imports
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-24 23:10:20 +00:00
copilot-swe-agent[bot] 9fbdba17ca test: isolate embedding device env override tests
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-24 23:09:23 +00:00
copilot-swe-agent[bot] 25c885ae0b test: use tmp_path for embedding device config tests
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-24 23:08:26 +00:00
copilot-swe-agent[bot] fbd0904799 test: cover embedding device fallback and bounded upserts
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/3213a67a-6871-4bb2-9ae0-23fa11001a22

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-24 23:06:50 +00:00
Igor Lins e Silva a4868a3589 perf(mining): batch per-chunk upserts and add optional GPU acceleration
The miner upserted one drawer per ChromaDB call, paying tokenizer +
ONNX session setup per chunk. The embedding device was CPU-only because
no EmbeddingFunction was ever wired through the backend.

Two changes, each a speedup in its own right; stacked they give ~10x
end-to-end on a medium corpus (20 files, 568 drawers):

1. Batched upsert. `process_file` and `_file_chunks_locked` now collect
   all chunks of a file into a single `collection.upsert(...)` so the
   embedding model runs one forward pass per file instead of N.

2. Hardware-accelerated embedding function. New `mempalace/embedding.py`
   wraps `ONNXMiniLM_L6_V2` with configurable `preferred_providers`.
   `MEMPALACE_EMBEDDING_DEVICE` (or `embedding_device` in config.json)
   selects auto / cpu / cuda / coreml / dml. Unavailable accelerators
   log a warning and fall back to CPU.

   The factory subclasses `ONNXMiniLM_L6_V2` and spoofs its `name()` to
   `"default"` so the persisted EF identity matches existing palaces
   created with ChromaDB's bare `DefaultEmbeddingFunction` -- same
   model, same 384-dim vectors, no rebuild needed when turning GPU on.

   `ChromaBackend.get_collection` / `create_collection` now pass the
   resolved EF on every call so miner writes and searcher reads agree.

Benchmarks (i9-12900KF + RTX 3090, medium scenario, 568 drawers):

  per-chunk + CPU   19.77s ·  29 drw/s   (baseline)
  batched   + CPU    8.07s ·  70 drw/s   (2.4x)
  batched   + CUDA   2.15s · 264 drw/s   (9.2x)

Reproducible via `benchmarks/mine_bench.py`.

Install paths:
  pip install mempalace[gpu]       # NVIDIA CUDA
  pip install mempalace[dml]       # DirectML (Windows)
  pip install mempalace[coreml]    # macOS Neural Engine

Mine header now prints `Device: cpu|cuda|...` so users can confirm the
accelerator engaged.
2026-04-24 19:42:35 -03:00
Arnold Wender 5fd09d3693 fix(security): restrict tunnels.json file permissions
~/.mempalace/tunnels.json (introduced in #790) was created via plain
open(..., "w") with no chmod, and its parent dir via os.makedirs()
without mode=0o700. On Linux with default umask 022 both end up
world-readable (0o644 / 0o755).

Tunnels reveal cross-wing connections — which projects, people, and
rooms the user has explicitly linked — so they are sensitive metadata
that should not be readable by other local users on shared systems.

Apply the same 0o700 / 0o600 pattern that #814 established for the
other sensitive palace files. Chmod calls are wrapped in try/except
(OSError, NotImplementedError) for Windows / unsupported-filesystem
compatibility.

Closes #1165
2026-04-24 22:57:34 +02:00
Igor Lins e Silva 7a757916b3 Merge pull request #1176 from MemPalace/docs/changelog-3.3.3-init-overhaul
docs(changelog): document init entity-detection overhaul in 3.3.3
2026-04-24 14:34:09 -03:00
Igor Lins e Silva 174ecaf42c Update CHANGELOG.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-24 14:33:51 -03:00
Igor Lins e Silva 431e42a720 docs(changelog): document init entity-detection overhaul in 3.3.3
Adds entries to the 3.3.3 section for the work that landed via #1148,
#1150, #1157, and #1175 (rescued from stacked feature branches into
develop via #1175). Without these entries the 3.3.3 release notes on
main would advertise only the hook/diary/search fixes that made it to
develop through the first direct merge.

Covers:
- Manifest + git-author entity detection (#1148)
- Regex detector accuracy improvements (#1148)
- Optional --llm classification with Ollama / openai-compat / Anthropic
  provider abstraction and interactive UX (#1150)
- Claude Code conversation scanner (#1150)
- Init → miner registry wire-up so confirmed entities actually reach
  drawer metadata tagging (#1157)
- Case-insensitive project dedup across all sources (#1175)
- `mempalace mine` skips the generated entities.json artifact
2026-04-24 14:25:13 -03:00
Igor Lins e Silva f246d25b7f Merge pull request #1166 from arnoldwender/fix/security-palace-path-env-normalize
fix(security): normalize MEMPALACE_PALACE_PATH env var with abspath+expanduser
2026-04-24 14:16:58 -03:00
Igor Lins e Silva 8a6ebbe363 Merge pull request #1175 from MemPalace/chore/rescue-stacked-prs-into-develop
chore: rescue merged stacked PRs #1150 and #1157 into develop
2026-04-24 14:14:50 -03:00