feat(init): context-aware corpus detection
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5. This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests. The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean. But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports. ## What changes for users (in order, from `pip install` onwards) **Install** — `pip install mempalace` is unchanged. The package itself didn't shift. **First run — `mempalace init <folder>`:** 1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context. 2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`: - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`. - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise. - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here. 3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out. 4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder. 5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all. 6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean. 7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it. 8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR. **Later — when your folder grows:** 9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe. ## Behind the changes - **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation. - **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`. - **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged. - **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module. - **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. - **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback. ## Cost story - **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`. - **Ollama / local LLM runtime:** zero cost. Fully offline. - **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes. - **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks. ## Backwards compatibility - All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change. - The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work. - `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3. ## Test coverage - **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation. - **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition). - **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too. - **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures. ## Hygiene guardrail This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
This commit is contained in:
@@ -2,6 +2,9 @@
|
||||
"""
|
||||
entity_detector.py — Auto-detect people and projects from file content.
|
||||
|
||||
Uses ``from __future__ import annotations`` so PEP 604 union syntax
|
||||
(``dict | None``) works on the Python 3.9 baseline.
|
||||
|
||||
Two-pass approach:
|
||||
Pass 1: scan files, extract entity candidates with signal counts
|
||||
Pass 2: score and classify each candidate as person, project, or uncertain
|
||||
@@ -27,6 +30,8 @@ Usage:
|
||||
confirmed = confirm_entities(candidates) # interactive review
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
import os
|
||||
import functools
|
||||
@@ -396,7 +401,12 @@ def classify_entity(name: str, frequency: int, scores: dict) -> dict:
|
||||
# ==================== MAIN DETECT ====================
|
||||
|
||||
|
||||
def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) -> dict:
|
||||
def detect_entities(
|
||||
file_paths: list,
|
||||
max_files: int = 10,
|
||||
languages=("en",),
|
||||
corpus_origin: dict | None = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Scan files and detect entity candidates.
|
||||
|
||||
@@ -405,12 +415,23 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
||||
max_files: Max files to read (for speed)
|
||||
languages: Tuple of language codes whose entity patterns should be
|
||||
applied (union). Defaults to ``("en",)``.
|
||||
corpus_origin: Optional corpus-origin context (the dict produced
|
||||
by ``mempalace.corpus_origin`` and persisted to
|
||||
``<palace>/.mempalace/origin.json`` by ``mempalace init``).
|
||||
When supplied and the corpus is identified as AI-dialogue with
|
||||
known agent persona names, candidates whose name matches an
|
||||
agent persona are moved out of ``people``/``uncertain`` and
|
||||
into a new ``agent_personas`` bucket. Shape:
|
||||
``{"schema_version": 1, "result": {"agent_persona_names": [...], ...}}``.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"people": [...entity dicts...],
|
||||
"projects": [...entity dicts...],
|
||||
"uncertain":[...entity dicts...],
|
||||
# Only present when corpus_origin reclassifies at least one
|
||||
# candidate as an agent persona:
|
||||
"agent_personas": [...entity dicts...],
|
||||
}
|
||||
"""
|
||||
langs = _normalize_langs(languages)
|
||||
@@ -440,7 +461,10 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
||||
candidates = extract_candidates(combined_text, languages=langs)
|
||||
|
||||
if not candidates:
|
||||
return {"people": [], "projects": [], "topics": [], "uncertain": []}
|
||||
return _apply_corpus_origin(
|
||||
{"people": [], "projects": [], "topics": [], "uncertain": []},
|
||||
corpus_origin,
|
||||
)
|
||||
|
||||
# Score and classify each candidate
|
||||
people = []
|
||||
@@ -463,14 +487,76 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
||||
projects.sort(key=lambda x: x["confidence"], reverse=True)
|
||||
uncertain.sort(key=lambda x: x["frequency"], reverse=True)
|
||||
|
||||
# Cap results to most relevant
|
||||
return {
|
||||
detected = {
|
||||
"people": people[:15],
|
||||
"projects": projects[:10],
|
||||
"topics": [],
|
||||
"uncertain": uncertain[:8],
|
||||
}
|
||||
|
||||
return _apply_corpus_origin(detected, corpus_origin)
|
||||
|
||||
|
||||
def _apply_corpus_origin(detected: dict, corpus_origin: dict | None) -> dict:
|
||||
"""Reclassify per-candidate buckets using corpus-origin context.
|
||||
|
||||
When the corpus is identified as AI-dialogue with known agent persona
|
||||
names, a candidate whose name case-insensitively matches one of those
|
||||
personas is moved from ``people``/``uncertain`` into an
|
||||
``agent_personas`` bucket. The candidate's per-entity ``type`` is also
|
||||
rewritten to ``"agent_persona"``.
|
||||
|
||||
No-op when ``corpus_origin`` is ``None`` or contains no usable persona
|
||||
names. Pure: returns a new dict, does not mutate the input.
|
||||
"""
|
||||
if not corpus_origin:
|
||||
return detected
|
||||
|
||||
origin_result = corpus_origin.get("result") or {}
|
||||
raw_personas = origin_result.get("agent_persona_names") or []
|
||||
persona_lower = {n.lower() for n in raw_personas if isinstance(n, str)}
|
||||
if not persona_lower:
|
||||
return detected
|
||||
|
||||
agent_personas: list = []
|
||||
new_people: list = []
|
||||
new_uncertain: list = []
|
||||
|
||||
for entity in detected.get("people", []):
|
||||
if entity["name"].lower() in persona_lower:
|
||||
agent_personas.append(_tag_as_persona(entity))
|
||||
else:
|
||||
new_people.append(entity)
|
||||
|
||||
for entity in detected.get("uncertain", []):
|
||||
if entity["name"].lower() in persona_lower:
|
||||
agent_personas.append(_tag_as_persona(entity))
|
||||
else:
|
||||
new_uncertain.append(entity)
|
||||
|
||||
if not agent_personas:
|
||||
return detected
|
||||
|
||||
agent_personas.sort(key=lambda x: x.get("confidence", 0), reverse=True)
|
||||
|
||||
return {
|
||||
**detected,
|
||||
"people": new_people,
|
||||
"uncertain": new_uncertain,
|
||||
"agent_personas": agent_personas,
|
||||
}
|
||||
|
||||
|
||||
def _tag_as_persona(entity: dict) -> dict:
|
||||
"""Return a new entity dict tagged as agent_persona with provenance signal."""
|
||||
existing_signals = entity.get("signals", [])
|
||||
return {
|
||||
**entity,
|
||||
"type": "agent_persona",
|
||||
"confidence": max(0.95, entity.get("confidence", 0.0)),
|
||||
"signals": ["matched corpus_origin agent_persona_names"] + existing_signals[:2],
|
||||
}
|
||||
|
||||
|
||||
# ==================== INTERACTIVE CONFIRM ====================
|
||||
|
||||
|
||||
Reference in New Issue
Block a user