Merge pull request #1004 from coogie/coogie/fix/miner-routing
Tests / test-linux (3.11) (push) Successful in 13m29s
Tests / test-linux (3.13) (push) Successful in 13m6s
Tests / test-linux (3.9) (push) Successful in 13m14s
Tests / lint (push) Successful in 9m24s
Tests / test-windows (push) Has been cancelled
Tests / test-macos (push) Has been cancelled
Tests / test-linux (3.11) (push) Successful in 13m29s
Tests / test-linux (3.13) (push) Successful in 13m6s
Tests / test-linux (3.9) (push) Successful in 13m14s
Tests / lint (push) Successful in 9m24s
Tests / test-windows (push) Has been cancelled
Tests / test-macos (push) Has been cancelled
fix(miner): use token-boundary matching in detect_room
This commit is contained in:
@@ -14,6 +14,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|||||||
- **Knowledge-graph triples with `valid_to < valid_from` were silently invisible.** `KnowledgeGraph.query_entity()` filters with `valid_from <= as_of AND valid_to >= as_of`, so an inverted interval matches no `as_of` and the row is durably stored but unreachable — a P0 data-integrity foot-gun any caller that mixes up the two date params can hit. `add_triple()` now rejects inverted intervals at write time with a clear `ValueError` naming both bounds. Open intervals (one bound only) and point-in-time facts (`valid_from == valid_to`) remain accepted unchanged. (#1214)
|
- **Knowledge-graph triples with `valid_to < valid_from` were silently invisible.** `KnowledgeGraph.query_entity()` filters with `valid_from <= as_of AND valid_to >= as_of`, so an inverted interval matches no `as_of` and the row is durably stored but unreachable — a P0 data-integrity foot-gun any caller that mixes up the two date params can hit. `add_triple()` now rejects inverted intervals at write time with a clear `ValueError` naming both bounds. Open intervals (one bound only) and point-in-time facts (`valid_from == valid_to`) remain accepted unchanged. (#1214)
|
||||||
- **`ChromaBackend.close_palace()` / `close()` did not release the SQLite file lock.** Evicted clients sat in `_clients` without `close()`, and chromadb 1.5.x retains the rust-side SQLite lock until GC. Reopening the same palace path after `shutil.rmtree` + recreate within one process failed with `SQLITE_READONLY_DBMOVED` (code 1032). New `_close_client()` helper now calls `PersistentClient.close()` (with a try/except fallback for older chromadb) on `close_palace()`, on whole-backend `close()`, and on the `_client()` invalidation path that detects a missing `chroma.sqlite3`. The mtime/inode auto-invalidation branch is intentionally left alone — callers there may still hold a live `ChromaCollection`. (#1067, #1105)
|
- **`ChromaBackend.close_palace()` / `close()` did not release the SQLite file lock.** Evicted clients sat in `_clients` without `close()`, and chromadb 1.5.x retains the rust-side SQLite lock until GC. Reopening the same palace path after `shutil.rmtree` + recreate within one process failed with `SQLITE_READONLY_DBMOVED` (code 1032). New `_close_client()` helper now calls `PersistentClient.close()` (with a try/except fallback for older chromadb) on `close_palace()`, on whole-backend `close()`, and on the `_client()` invalidation path that detects a missing `chroma.sqlite3`. The mtime/inode auto-invalidation branch is intentionally left alone — callers there may still hold a live `ChromaCollection`. (#1067, #1105)
|
||||||
- **`EntityRegistry.save()` could leave a corrupt or empty `entity_registry.json` on crash.** `Path.write_text()` is not atomic — kernel sees `open('w')` (truncate), `write`, `close`, and any failure between truncate and full-flush (power loss, OOM, FS-full, kill -9) wipes the months-of-mining people/projects map silently (the registry's `load()` swallows `JSONDecodeError`). Save now writes to a sibling `.tmp` in the same directory, `fsync`s, `chmod 0o600`s, then `os.replace()`s into place — atomic on POSIX and Windows. The previous registry stays intact on any crash before the rename returns. (#1215)
|
- **`EntityRegistry.save()` could leave a corrupt or empty `entity_registry.json` on crash.** `Path.write_text()` is not atomic — kernel sees `open('w')` (truncate), `write`, `close`, and any failure between truncate and full-flush (power loss, OOM, FS-full, kill -9) wipes the months-of-mining people/projects map silently (the registry's `load()` swallows `JSONDecodeError`). Save now writes to a sibling `.tmp` in the same directory, `fsync`s, `chmod 0o600`s, then `os.replace()`s into place — atomic on POSIX and Windows. The previous registry stays intact on any crash before the rename returns. (#1215)
|
||||||
|
- **`miner.detect_room` bidirectional substring matching caused systemic misrouting.** The priority-1 (path parts) and priority-2 (filename) checks used `c in part or part in c` against room names + keywords, so any token that was an unbounded substring of a room name (or vice versa) matched. Priority-1 iterates left-to-right and returns on first match, so `views/billing-page/src/Foo.test.tsx` routed to an `interviews` room because `"views" in "interviews"` matched before reaching `billing-page`. Both call sites now use a `_name_matches` helper that compares names as equal or as separator-bounded tokens of each other (split on `-`, `_`, `.`, `/`). (#1004, closes #1002)
|
||||||
- **`mempalace compress` crashed on large palaces.** `regenerate_closets` fetched all closet_llm drawers in a single `col.get()`, which trips `SQLITE_MAX_VARIABLE_NUMBER` on palaces above ~32k drawers. Mirrors the #851 fix in `miner.py`: drawer fetch is now paginated at `batch_size=5000`. Per-source aggregation works across batches, so the LLM regeneration call still groups chunks correctly. (#1073, #1107)
|
- **`mempalace compress` crashed on large palaces.** `regenerate_closets` fetched all closet_llm drawers in a single `col.get()`, which trips `SQLITE_MAX_VARIABLE_NUMBER` on palaces above ~32k drawers. Mirrors the #851 fix in `miner.py`: drawer fetch is now paginated at `batch_size=5000`. Per-source aggregation works across batches, so the LLM regeneration call still groups chunks correctly. (#1073, #1107)
|
||||||
- **CLI and `fact_checker --stdin` mojibaked non-ASCII content on Windows.** Python defaults `sys.stdin`/`stdout`/`stderr` to the system ANSI codepage (cp1252/cp1251/cp950), so `mempalace search > out.txt` and piped fact_checker invocations corrupted Cyrillic / CJK drawer text at the process boundary. New `mempalace/_stdio.py` helper reconfigures all three streams to UTF-8 on `sys.platform == "win32"`, with per-stream `errors` policy: `surrogateescape` on stdin (preserves bad bytes from redirected files for the consumer's parser), `replace` on stdout/stderr (substitutes U+FFFD instead of `UnicodeEncodeError`-ing mid-print). With this, all three user-facing console_scripts (`mcp_server`, `hooks_cli`, `cli`/`fact_checker`) now reconfigure identically on Windows. (#1282)
|
- **CLI and `fact_checker --stdin` mojibaked non-ASCII content on Windows.** Python defaults `sys.stdin`/`stdout`/`stderr` to the system ANSI codepage (cp1252/cp1251/cp950), so `mempalace search > out.txt` and piped fact_checker invocations corrupted Cyrillic / CJK drawer text at the process boundary. New `mempalace/_stdio.py` helper reconfigures all three streams to UTF-8 on `sys.platform == "win32"`, with per-stream `errors` policy: `surrogateescape` on stdin (preserves bad bytes from redirected files for the consumer's parser), `replace` on stdout/stderr (substitutes U+FFFD instead of `UnicodeEncodeError`-ing mid-print). With this, all three user-facing console_scripts (`mcp_server`, `hooks_cli`, `cli`/`fact_checker`) now reconfigure identically on Windows. (#1282)
|
||||||
- **MCP knowledge-graph tools forwarded malformed date strings to SQLite.** `tool_kg_query` (`as_of`), `tool_kg_add` (`valid_from`), and `tool_kg_invalidate` (`ended`) accepted any string and produced empty result sets on natural-language inputs like `"March 2026"` or `"yesterday"` — callers (especially LLM agents) could not distinguish "no fact at this time" from "your date format was unrecognized." New `sanitize_iso_date()` validator in `config.py` accepts `YYYY`, `YYYY-MM`, `YYYY-MM-DD` (and passes through `None`/`""`); all three tools call it before values reach the storage layer. **Behavior change:** previously-silent date typos now raise a clear `ValueError` naming the offending field; full ISO-8601 with time (`YYYY-MM-DDTHH:MM:SS`, timezone offsets) is not yet accepted — file an issue if you have a use case. (#1164, #1167)
|
- **MCP knowledge-graph tools forwarded malformed date strings to SQLite.** `tool_kg_query` (`as_of`), `tool_kg_add` (`valid_from`), and `tool_kg_invalidate` (`ended`) accepted any string and produced empty result sets on natural-language inputs like `"March 2026"` or `"yesterday"` — callers (especially LLM agents) could not distinguish "no fact at this time" from "your date format was unrecognized." New `sanitize_iso_date()` validator in `config.py` accepts `YYYY`, `YYYY-MM`, `YYYY-MM-DD` (and passes through `None`/`""`); all three tools call it before values reach the storage layer. **Behavior change:** previously-silent date typos now raise a clear `ValueError` naming the offending field; full ISO-8601 with time (`YYYY-MM-DDTHH:MM:SS`, timezone offsets) is not yet accepted — file an issue if you have a use case. (#1164, #1167)
|
||||||
|
|||||||
+25
-2
@@ -8,6 +8,7 @@ Stores verbatim chunks as drawers. No summaries. Ever.
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import sys
|
import sys
|
||||||
import shlex
|
import shlex
|
||||||
import hashlib
|
import hashlib
|
||||||
@@ -331,6 +332,28 @@ def load_config(project_dir: str) -> dict:
|
|||||||
# FILE ROUTING — which room does this file belong to?
|
# FILE ROUTING — which room does this file belong to?
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
||||||
|
_TOKEN_SPLIT = re.compile(r"[-_./]+")
|
||||||
|
|
||||||
|
|
||||||
|
def _tokens(value: str) -> set:
|
||||||
|
"""Split ``value`` into lowercased tokens bounded by ``-``, ``_``, ``.`` or ``/``."""
|
||||||
|
return {t for t in _TOKEN_SPLIT.split(value.lower()) if t}
|
||||||
|
|
||||||
|
|
||||||
|
def _name_matches(a: str, b: str) -> bool:
|
||||||
|
"""Return True when ``a`` and ``b`` match as equal strings or as
|
||||||
|
separator-bounded tokens of each other.
|
||||||
|
|
||||||
|
Prevents incidental substring collisions (e.g., ``"views" in "interviews"``)
|
||||||
|
that a raw ``in`` check would produce, while preserving the intended
|
||||||
|
match for real tokens (e.g., ``"frontend"`` in ``"frontend-app"``).
|
||||||
|
"""
|
||||||
|
a = a.lower()
|
||||||
|
b = b.lower()
|
||||||
|
if a == b:
|
||||||
|
return True
|
||||||
|
return b in _tokens(a) or a in _tokens(b)
|
||||||
|
|
||||||
|
|
||||||
def detect_room(filepath: Path, content: str, rooms: list, project_path: Path) -> str:
|
def detect_room(filepath: Path, content: str, rooms: list, project_path: Path) -> str:
|
||||||
"""
|
"""
|
||||||
@@ -350,12 +373,12 @@ def detect_room(filepath: Path, content: str, rooms: list, project_path: Path) -
|
|||||||
for part in path_parts[:-1]: # skip filename itself
|
for part in path_parts[:-1]: # skip filename itself
|
||||||
for room in rooms:
|
for room in rooms:
|
||||||
candidates = [room["name"].lower()] + [k.lower() for k in room.get("keywords", [])]
|
candidates = [room["name"].lower()] + [k.lower() for k in room.get("keywords", [])]
|
||||||
if any(part == c or c in part or part in c for c in candidates):
|
if any(_name_matches(part, c) for c in candidates):
|
||||||
return room["name"]
|
return room["name"]
|
||||||
|
|
||||||
# Priority 2: filename matches room name
|
# Priority 2: filename matches room name
|
||||||
for room in rooms:
|
for room in rooms:
|
||||||
if room["name"].lower() in filename or filename in room["name"].lower():
|
if _name_matches(filename, room["name"]):
|
||||||
return room["name"]
|
return room["name"]
|
||||||
|
|
||||||
# Priority 3: keyword scoring from room keywords + name
|
# Priority 3: keyword scoring from room keywords + name
|
||||||
|
|||||||
+98
-1
@@ -7,7 +7,7 @@ from pathlib import Path
|
|||||||
import chromadb
|
import chromadb
|
||||||
import yaml
|
import yaml
|
||||||
|
|
||||||
from mempalace.miner import load_config, mine, scan_project, status
|
from mempalace.miner import detect_room, load_config, mine, scan_project, status
|
||||||
from mempalace.palace import NORMALIZE_VERSION, file_already_mined
|
from mempalace.palace import NORMALIZE_VERSION, file_already_mined
|
||||||
|
|
||||||
|
|
||||||
@@ -491,6 +491,103 @@ def test_file_already_mined_returns_false_for_stale_normalize_version():
|
|||||||
shutil.rmtree(tmpdir, ignore_errors=True)
|
shutil.rmtree(tmpdir, ignore_errors=True)
|
||||||
|
|
||||||
|
|
||||||
|
def test_detect_room_uses_token_boundary_matching(tmp_path):
|
||||||
|
"""Path-part routing must not fire on incidental substrings.
|
||||||
|
|
||||||
|
Regression: "views" is a substring of "interviews", so the old
|
||||||
|
substring check routed every file under views/ into a room keyed
|
||||||
|
by "interviews". Token-boundary matching prevents this while still
|
||||||
|
matching real tokens like "frontend" in "frontend-app".
|
||||||
|
"""
|
||||||
|
project = tmp_path
|
||||||
|
rooms = [
|
||||||
|
{"name": "billing-page", "keywords": ["billing-page"]},
|
||||||
|
{"name": "interviews", "keywords": ["interviews"]},
|
||||||
|
{"name": "general", "keywords": []},
|
||||||
|
]
|
||||||
|
|
||||||
|
# views/<X>/... must NOT route to "interviews" on the "views"⊂"interviews" accident
|
||||||
|
view_file = project / "views" / "billing-page" / "Foo.test.tsx"
|
||||||
|
view_file.parent.mkdir(parents=True)
|
||||||
|
view_file.write_text("content")
|
||||||
|
assert detect_room(view_file, "content", rooms, project) == "billing-page"
|
||||||
|
|
||||||
|
# data/interviews/... must route to "interviews" via the real token
|
||||||
|
data_file = project / "data" / "interviews" / "index.ts"
|
||||||
|
data_file.parent.mkdir(parents=True)
|
||||||
|
data_file.write_text("content")
|
||||||
|
assert detect_room(data_file, "content", rooms, project) == "interviews"
|
||||||
|
|
||||||
|
|
||||||
|
def test_detect_room_preserves_token_matches(tmp_path):
|
||||||
|
"""Real separator-bounded tokens still match in both directions."""
|
||||||
|
project = tmp_path
|
||||||
|
rooms = [
|
||||||
|
{"name": "frontend", "keywords": ["frontend"]},
|
||||||
|
{"name": "general", "keywords": []},
|
||||||
|
]
|
||||||
|
|
||||||
|
# path part contains keyword as a token
|
||||||
|
f1 = project / "frontend-app" / "main.ts"
|
||||||
|
f1.parent.mkdir(parents=True)
|
||||||
|
f1.write_text("x")
|
||||||
|
assert detect_room(f1, "x", rooms, project) == "frontend"
|
||||||
|
|
||||||
|
# keyword contains path part as a token (reverse direction)
|
||||||
|
rooms2 = [
|
||||||
|
{"name": "data-retention", "keywords": ["data-retention"]},
|
||||||
|
{"name": "general", "keywords": []},
|
||||||
|
]
|
||||||
|
f2 = project / "data" / "data-retention" / "policy.ts"
|
||||||
|
f2.parent.mkdir(parents=True)
|
||||||
|
f2.write_text("x")
|
||||||
|
assert detect_room(f2, "x", rooms2, project) == "data-retention"
|
||||||
|
|
||||||
|
|
||||||
|
def test_detect_room_matches_keyword_distinct_from_name(tmp_path):
|
||||||
|
"""Regression: PR #145 — path part must match a keyword even when the
|
||||||
|
room name itself doesn't contain the path part as a token.
|
||||||
|
|
||||||
|
Scenario: a folder named ``docs/`` should route to a room named
|
||||||
|
``documentation`` that declares ``"docs"`` as a keyword.
|
||||||
|
"""
|
||||||
|
project = tmp_path
|
||||||
|
rooms = [
|
||||||
|
{"name": "documentation", "keywords": ["docs"]},
|
||||||
|
{"name": "general", "keywords": []},
|
||||||
|
]
|
||||||
|
|
||||||
|
f = project / "docs" / "readme.md"
|
||||||
|
f.parent.mkdir(parents=True)
|
||||||
|
f.write_text("x")
|
||||||
|
assert detect_room(f, "x", rooms, project) == "documentation"
|
||||||
|
|
||||||
|
|
||||||
|
def test_detect_room_filename_match_uses_token_boundary(tmp_path):
|
||||||
|
"""Priority 2 (filename match) must also use token-boundary rules."""
|
||||||
|
project = tmp_path
|
||||||
|
rooms = [
|
||||||
|
{"name": "review", "keywords": []},
|
||||||
|
{"name": "general", "keywords": []},
|
||||||
|
]
|
||||||
|
|
||||||
|
# "review" is a substring of "reviewmodule" but not a token — should NOT match
|
||||||
|
f1 = project / "reviewmodule.ts"
|
||||||
|
f1.write_text("x")
|
||||||
|
assert detect_room(f1, "x", rooms, project) != "review"
|
||||||
|
|
||||||
|
# "review" IS a token of "review-page" — should match
|
||||||
|
f2 = project / "review-page.ts"
|
||||||
|
f2.write_text("x")
|
||||||
|
assert detect_room(f2, "x", rooms, project) == "review"
|
||||||
|
|
||||||
|
# Dotted filename stems like "Foo.test" split on "." too
|
||||||
|
rooms3 = [{"name": "foo", "keywords": []}, {"name": "general", "keywords": []}]
|
||||||
|
f3 = project / "foo.test.ts"
|
||||||
|
f3.write_text("x")
|
||||||
|
assert detect_room(f3, "x", rooms3, project) == "foo"
|
||||||
|
|
||||||
|
|
||||||
def test_add_drawer_stamps_normalize_version(tmp_path):
|
def test_add_drawer_stamps_normalize_version(tmp_path):
|
||||||
"""Fresh drawers carry the current schema version so future upgrades work."""
|
"""Fresh drawers carry the current schema version so future upgrades work."""
|
||||||
from mempalace.miner import add_drawer
|
from mempalace.miner import add_drawer
|
||||||
|
|||||||
Reference in New Issue
Block a user