feat(init): wire confirmed entities into the miner's known-entities registry

The init step's output was a dead file. miner.py has always read
`~/.mempalace/known_entities.json` to tag drawer metadata with
recognized names, but nothing ever wrote it — so init's careful
manifest + git + LLM detection work stopped at `<project>/entities.json`
and never reached the path that actually uses it.

Measured delta on a representative prose snippet (eight sentences
mentioning six real people and four real projects):
- Empty registry: 0 entities recognized (multi-word names fail the
  frequency threshold; lowercase/hyphenated project names don't match
  the CamelCase regex).
- Registry populated by init: 12 entities recognized (all correct, zero
  false positives).

Every recognized name becomes a semicolon-separated metadata tag on the
drawer, which ChromaDB uses for entity-filtered search.

Implementation:

- `miner.add_to_known_entities({category: [names]})` reads the existing
  registry, unions each category (case-insensitively, preserving first-
  seen casing), and writes back. The function is tolerant of the two
  on-disk shapes miner already supports: list of names, or dict mapping
  name → code (dialect-style). In the dict case new names are added as
  keys with `None` values so existing codes aren't overwritten.
- Invalidates the in-process mtime cache so same-process callers
  (`cmd_init` → `cmd_mine` in one run) see the write immediately.
- Writes with `ensure_ascii=False` so non-ASCII names (Gergő Móricz,
  Arturo Domínguez, etc.) stay readable on disk.
- Chmods 0o600 — the registry mirrors confirm-step PII from the user's
  git authors and local paths.

cmd_init now calls this at the end of the confirm-entities step, after
the per-project `entities.json` is written (which is kept as an audit
trail the user can inspect or hand-edit). The per-project file is still
excluded from mining via `SKIP_FILENAMES` from the earlier fix.

17 new tests cover: fresh-file creation, list-category union, case-
insensitive dedup, preservation of untouched categories, dict-format
registries, malformed/non-dict file recovery, cache invalidation,
unicode round-trip, and an end-to-end verification that the miner's
`_extract_entities_for_metadata` picks up every registered name.
This commit is contained in:
Igor Lins e Silva
2026-04-24 02:09:32 -03:00
parent b150d33398
commit 4631d6a7db
3 changed files with 289 additions and 2 deletions
+9 -2
View File
@@ -120,12 +120,19 @@ def cmd_init(args):
total = len(detected["people"]) + len(detected["projects"]) + len(detected["uncertain"])
if total > 0:
confirmed = confirm_entities(detected, yes=getattr(args, "yes", False))
# Save confirmed entities to <project>/entities.json for the miner
# Save confirmed entities to <project>/entities.json (per-project
# audit trail — user can inspect or hand-edit) AND merge into the
# global registry the miner reads at mine time.
if confirmed["people"] or confirmed["projects"]:
entities_path = Path(args.dir).expanduser().resolve() / "entities.json"
with open(entities_path, "w") as f:
json.dump(confirmed, f, indent=2)
json.dump(confirmed, f, indent=2, ensure_ascii=False)
print(f" Entities saved: {entities_path}")
from .miner import add_to_known_entities
registry_path = add_to_known_entities(confirmed)
print(f" Registry updated: {registry_path}")
else:
print(" No entities detected — proceeding with directory-based rooms.")
+79
View File
@@ -472,6 +472,85 @@ def _load_known_entities_raw() -> dict:
return dict(_ENTITY_REGISTRY_CACHE["raw"])
def add_to_known_entities(entities_by_category: dict) -> str:
"""Union ``entities_by_category`` into ``~/.mempalace/known_entities.json``.
Accepts ``{category: [names]}`` shape as produced by ``mempalace init``
and merges into the registry the miner reads at mine time. Existing
categories are preserved untouched unless also present in the input;
for categories present in both, entries are unioned case-insensitively
without changing the on-disk ordering of pre-existing names.
If a category is stored on-disk as ``{name: code}`` (the alternate
miner-supported shape, used by dialect-style configs), new names are
added as keys with ``None`` values so existing code mappings aren't
overwritten. A later compress pass can assign codes.
The in-process cache is invalidated on write so same-process callers
(notably ``cmd_init`` → ``cmd_mine`` in sequence) see the update
immediately instead of waiting for a mtime re-check.
Returns the registry path as a string for logging.
"""
import json as _json
from pathlib import Path as _Path
registry_path = _Path(_ENTITY_REGISTRY_PATH)
registry_path.parent.mkdir(parents=True, exist_ok=True)
existing: dict = {}
if registry_path.exists():
try:
loaded = _json.loads(registry_path.read_text(encoding="utf-8"))
if isinstance(loaded, dict):
existing = loaded
except (_json.JSONDecodeError, OSError):
existing = {}
for category, names in entities_by_category.items():
if not isinstance(names, list) or not names:
continue
current = existing.get(category)
if isinstance(current, list):
seen_lower = {str(n).lower() for n in current}
for n in names:
if not n:
continue
if str(n).lower() not in seen_lower:
current.append(n)
seen_lower.add(str(n).lower())
elif isinstance(current, dict):
for n in names:
if n and n not in current:
current[n] = None
else:
# Missing or unrecognized shape — seed as a fresh list, deduped
seen: set = set()
ordered: list = []
for n in names:
if not n:
continue
key = str(n).lower()
if key in seen:
continue
seen.add(key)
ordered.append(n)
existing[category] = ordered
registry_path.write_text(_json.dumps(existing, indent=2, ensure_ascii=False), encoding="utf-8")
try:
registry_path.chmod(0o600)
except (OSError, NotImplementedError):
pass
# Invalidate in-process cache so later calls in the same run see the write.
_ENTITY_REGISTRY_CACHE["mtime"] = None
_ENTITY_REGISTRY_CACHE["names"] = frozenset()
_ENTITY_REGISTRY_CACHE["raw"] = {}
return str(registry_path)
_HALL_KEYWORDS_CACHE = None