Commit Graph

16 Commits

Author SHA1 Message Date
jp feba7e8043 fix(miner): same None-metadata guard for status() histogram loop
`status()` walks `col.get(include=["metadatas"])` and buckets each drawer
into a `wing_rooms[wing][room]` histogram. The same ChromaDB return shape
fixed in the search print path — `None` entries in the `metadatas` list
for drawers with no stored metadata — crashes the status command with:

    AttributeError: 'NoneType' object has no attribute 'get'

Applies the matching ``m = m or {}`` guard so None-metadata drawers roll
up under the existing `?/?` fallback bucket instead of killing the
command mid-tally. Reproduced on a 135K-drawer palace where two drawers
had `metadata=None`; both now show under `WING: ? / ROOM: ?` in the
tally while the command prints the full histogram as designed.

Adds a regression test that feeds `status()` a fake collection whose
`get()` returns a `None` in the middle of the metadatas list and asserts
both the fallback bucket and the real wing render.
2026-04-18 10:26:11 -07:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
Matt Van Horn e8e93b53c0 fix: allow mining directories without local mempalace.yaml
When no mempalace.yaml or mempal.yaml exists in the source directory,
return a default config (wing = directory name, room = general) instead
of calling sys.exit(1). This lets users mine any directory into their
palace without requiring init first.

Closes #14.
2026-04-14 13:53:07 -03:00
Igor Lins e Silva 5320246297 Merge pull request #807 from sha2fiddy/fix/218-cosine-distance-metadata
Fix: set cosine distance metadata on all collection creation sites
2026-04-13 21:18:40 -03:00
eblander 8dc5970ca9 Fix: ruff format with CI-pinned version (0.4.x) 2026-04-13 18:29:48 -04:00
Igor Lins e Silva 7e5eeda9a5 feat(normalize): auto-rebuild stale drawers via NORMALIZE_VERSION schema gate
Without this, the strip_noise improvement only helps new mines. Every
user who had already mined Claude Code JSONL sessions would keep their
noise-polluted drawers forever, because convo_miner's file_already_mined
skip short-circuits before re-processing.

Adds a versioned schema gate so upgrades propagate silently:

- palace.NORMALIZE_VERSION=2 — bumped when the normalization pipeline
  changes shape (this PR's strip_noise is the v1→v2 bump).
- file_already_mined now returns False if the stored normalize_version
  is missing or less than current, triggering a rebuild on next mine.
- Both miners stamp drawers with the current normalize_version.
- convo_miner now purges stale drawers before inserting fresh chunks
  (mirrors miner.py's existing delete+insert), extracted into
  _file_convo_chunks helper to keep mine_convos under ruff's C901 limit.

User experience: upgrade mempalace, run `mempalace mine` as usual, old
noisy drawers get silently replaced with clean ones. No erase needed,
no "you need to rebuild" changelog footgun.

Tests:
- test_file_already_mined_returns_false_for_stale_normalize_version —
  pins the version gate contract for missing/v1/current.
- test_add_drawer_stamps_normalize_version — fresh project-miner drawers
  carry the field.
- test_mine_convos_rebuilds_stale_drawers_after_schema_bump — end-to-end
  proof that a pre-v2 palace gets silently cleaned on next mine, with
  orphan drawers purged and NOT skipped.

Existing test_file_already_mined_check_mtime updated to include the
new field; all other tests unaffected.
2026-04-13 16:20:55 -03:00
eblander 1e86892e62 Fix: set cosine distance metadata on all collection creation sites
ChromaDB defaults HNSW index to L2 (Euclidean) distance, but
MemPalace scoring uses 1-distance which requires cosine (range 0-2).
Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test
call sites that were missing it.

Closes #218
2026-04-13 11:00:52 -04:00
Mikhail Valentsev 091c2fe1c6 fix: mine --dry-run TypeError on files with room=None (#586) (#687)
* fix: return "general" room from process_file error paths (#586)

process_file() returned (0, None) for already-mined, unreadable, and
too-short files.  In --dry-run mode the caller always enters the
room_counts branch, so None ended up as a dict key and crashed the
summary printer with "unsupported format string passed to
NoneType.__format__".

Returning "general" instead of None makes the function contract
explicit: it always yields (int, str).  This matches the consensus
fix discussed in the issue thread.

* style: apply ruff format to test_miner.py
2026-04-12 14:23:44 -07:00
Sergey Kuznetsov ae5196bc8d Мempalace backend seam (#413)
* refactor: add stage-1 backend abstraction seam

Introduce the first upstreamable storage seam for MemPalace without
bringing in the PostgreSQL spike or any benchmark artifacts.

This change adds a small backend package with:
- BaseCollection as the minimal collection contract
- ChromaBackend/ChromaCollection as the default implementation

It then routes the main runtime collection consumers through that seam:
- palace.py
- searcher.py
- layers.py
- palace_graph.py
- mcp_server.py
- miner.status()

Behavioral constraints kept for stage 1:
- ChromaDB remains the only backend and the default path
- no config/env backend selection yet
- no PostgreSQL code
- no benchmark or research files
- existing tests stay unchanged

Important compatibility details:
- read paths now call the seam with create=False so they still surface
  the existing 'no palace found' behavior instead of silently creating
  empty collections
- write paths keep create=True semantics through palace.get_collection()
- layers/searcher retain a chromadb module attribute so the existing
  mock-based tests can keep patching PersistentClient unchanged
- ChromaBackend only creates palace directories on create=True, which
  preserves mocked read-path tests that use fake read-only paths

Verification:
- python3 -m py_compile mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py
- pytest -q  # 529 passed, 106 deselected

* refactor: clean up stage-1 seam compatibility shims

Tighten the stage-1 backend abstraction branch after review.

This follow-up does three small things:
- keep the chromadb compatibility hook in searcher.py and layers.py,
  but express it through the backends.chroma module so it no longer
  reads like an accidental unused import
- fix the palace_graph.py helper alias to avoid the local name collision
  flagged by ruff (imported helper vs local _get_collection wrapper)
- preserve the existing mock-based test patch points unchanged while
  keeping the new backend seam intact

Why this matters:
- the direct  form looked like a
  dead import in review, even though it was intentionally preserving the
  existing test seam ( and
  )
- palace_graph.py had a real lint issue ( redefinition) that was
  small but worth fixing before a public PR

Verification:
- /opt/homebrew/bin/ruff check mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py
- pytest -q tests/test_layers.py tests/test_searcher.py
- pytest -q  # 529 passed, 106 deselected

* docs: explain backend shim imports in search paths

Add short code comments in searcher.py and layers.py explaining why the
module-level `chromadb` alias remains after the stage-1 backend seam
refactor.

The alias is intentional: it preserves the existing mock patch points used
by the current test suite (`mempalace.searcher.chromadb.PersistentClient`
and `mempalace.layers.chromadb.PersistentClient`) while the runtime logic
now flows through the backend abstraction.

This keeps the public PR easier to review because the apparent "unused
import" now has an explicit reason next to it.

Verification:
- /opt/homebrew/bin/ruff check mempalace/searcher.py mempalace/layers.py
- pytest -q tests/test_layers.py tests/test_searcher.py

* refactor: reuse a default backend instance in palace helper

Tighten the stage-1 backend seam by promoting the default Chroma backend
adapter to a module-level singleton in `mempalace/palace.py`.

This keeps the stage-1 scope unchanged — Chroma is still the only backend
wired in this branch — but avoids constructing a fresh `ChromaBackend()`
object on every `get_collection()` call. The backend is stateless today,
so this is a readability/cleanup change rather than a behavioral one.

Why this helps:
- makes `palace.get_collection()` read like a real default factory instead
  of an inline constructor call
- keeps the stage-1 branch a little cleaner before opening the public PR
- does not widen the backend surface or change any config/runtime behavior

Verification:
- python3 -m py_compile mempalace/palace.py
- pytest -q tests/test_miner.py tests/test_layers.py tests/test_searcher.py
- pytest -q  # 529 passed, 106 deselected

* fix: harden read-only seam behavior and update seam tests

Preserve the stage-1 backend abstraction while closing the real read-path
regression surfaced in PR review.

What changed:
- make ChromaBackend.get_collection(create=False) fail fast when the palace
  directory does not exist instead of letting PersistentClient create it as a
  side effect
- update miner.status() to call get_collection(..., create=False) so status
  keeps the historical 'No palace found' behavior
- remove the temporary chromadb shim aliases from layers.py and searcher.py
  now that the tests patch the seam directly
- add focused tests for the new backends package, including ChromaCollection
  delegation and ChromaBackend create=True/create=False behavior
- retarget layer/searcher tests to patch the backend seam instead of patching
  chromadb.PersistentClient inside production modules
- add a regression test that status() does not create an empty palace when the
  target path is missing

Verification:
- ruff check .
- uv run pytest -q
- uv run pytest -q tests/test_backends.py tests/test_cli.py tests/test_mcp_server.py tests/test_layers.py tests/test_searcher.py tests/test_miner.py

Notes:
- the separate benchmark/slow/stress layer was started as a soak but not used
  as the merge gate for this PR branch

* refactor: drop duplicate mcp collection cache declaration

Remove a redundant `_collection_cache = None` assignment in
`mempalace/mcp_server.py` left over after the stage-1 backend seam refactor.

This does not change behavior; it only trims review noise in the MCP server
module after the read-path hardening pass.

Verification:
- ruff check mempalace/mcp_server.py
- uv run pytest -q tests/test_mcp_server.py

---------

Co-authored-by: Sergey Kuznetsov <sergey@iterudit.com>
2026-04-11 16:16:49 -07:00
bensig 58b8d5b198 fix: release ChromaDB handles before rmtree on Windows 2026-04-09 09:31:55 -07:00
bensig 1c48f4d2c3 fix: use os.utime in mtime test for Windows compatibility 2026-04-09 09:23:08 -07:00
bensig 2448ac0026 test: add coverage for file_already_mined mtime check
Covers the check_mtime=True path in palace.py to meet 85% coverage threshold.
2026-04-09 08:56:28 -07:00
Tal Muskal abd52534bb test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11
- Add tests for config, convo_miner, spellcheck, knowledge_graph
- Fix Windows PermissionError in test cleanup (chromadb file locks)
- Add UTF-8 encoding to split_mega_files, entity_registry, hooks_cli
- Fix mcp_server parse_known_args logging for unknown args
- Set coverage threshold to 85 in pyproject.toml and CI
- Reset all version files to 3.0.11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 21:38:12 +03:00
ac-opensource c8c220d789 fix: support nested .gitignore rules during mining 2026-04-08 00:02:21 +08:00
ac-opensource 9b9daa9b4b fix: respect .gitignore during project mining 2026-04-07 22:26:06 +08:00
bensig 0f8fa8c7d5 bench: add benchmark runners, results docs, and test suite
Benchmarks: LongMemEval, LoCoMo, ConvoMem, MemBench runners with
methodology docs and hybrid retrieval analysis.

Tests: config, miner, convo_miner, normalize — 9 tests, all passing.
2026-04-04 18:33:42 -07:00