Files
mempalace/mempalace/llm_refine.py
T
MSL b99e54546b feat(init): context-aware corpus detection
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.

This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.

The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean.

But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.

## What changes for users (in order, from `pip install` onwards)

**Install** — `pip install mempalace` is unchanged. The package itself didn't shift.

**First run — `mempalace init <folder>`:**

1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.

2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`:
   - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`.
   - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise.
   - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.

3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out.

4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder.

5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.

6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean.

7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it.

8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR.

**Later — when your folder grows:**

9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe.

## Behind the changes

- **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.

- **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`.

- **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged.

- **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module.

- **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`.

- **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.

## Cost story

- **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`.
- **Ollama / local LLM runtime:** zero cost. Fully offline.
- **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes.
- **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks.

## Backwards compatibility

- All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change.
- The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
- `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3.

## Test coverage

- **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
- **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
- **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too.
- **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.

## Hygiene guardrail

This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
2026-04-26 12:37:26 -07:00

501 lines
18 KiB
Python

"""
llm_refine.py — Optional LLM refinement of regex-detected entities.
Takes the candidate set produced by phase-1 detection (manifests, git
authors, regex on prose) and asks an LLM to reclassify each candidate as
PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS.
Design constraints:
- Opt-in. Default init path never imports this module.
- Local-first by default (Ollama).
- Interactive UX: visible progress, clean cancellation (Ctrl-C returns
whatever was classified before the interrupt).
- Don't feed the raw corpus to the LLM — feed candidates + a few sampled
context lines each. Keeps total input to ~50-100K tokens even for huge
prose corpora.
Public:
refine_entities(detected, corpus_text, provider, ...) -> dict
"""
from __future__ import annotations
import json
import re
import sys
from dataclasses import dataclass
from mempalace.llm_client import LLMError, LLMProvider
BATCH_SIZE = 25 # candidates per LLM call; tuned for 4B local models
CONTEXT_LINES_PER_CANDIDATE = 3
CONTEXT_WINDOW_CHARS = 240 # max chars per context line to keep tokens bounded
# Valid labels the LLM is allowed to return. Anything else is treated as
# AMBIGUOUS so the user reviews it.
VALID_LABELS = {"PERSON", "PROJECT", "TOPIC", "COMMON_WORD", "AMBIGUOUS"}
SYSTEM_PROMPT = """You are helping organize a user's memory palace by classifying capitalized tokens found in their files.
For each candidate, pick exactly ONE label:
- PERSON: a specific real person the user knows (colleague, family, character they write about)
- PROJECT: a named product, codebase, or effort the user works on
- TOPIC: a recurring theme or subject (not a person, not a project) — cities, technologies, concepts
- COMMON_WORD: an English word, verb, or fragment that isn't a named entity at all (e.g. "Created", "Before", "Never")
- AMBIGUOUS: context is insufficient to decide between two of the above
Frameworks, runtimes, APIs, cloud services, vendors, and third-party products
(e.g. Angular, OpenAPI, Terraform, Bun, Google) are TOPIC unless the context
clearly says this is the user's own named codebase, product, or active effort.
Use the provided context lines to disambiguate. A capitalized word that only appears in metadata ("Created: 2026-04-24") is COMMON_WORD. A name that appears with pronouns and dialogue is PERSON.
Respond with JSON only. Schema:
{"classifications": [{"name": "<exact candidate name>", "label": "<LABEL>", "reason": "<one short sentence>"}]}
One entry per candidate, same order as the input."""
@dataclass
class RefineResult:
merged: dict # updated detected dict
reclassified: int # entries whose type changed
dropped: int # entries removed from the merged result (COMMON_WORD only)
errors: list[str] # per-batch error messages (transport/parse failures)
batches_completed: int
batches_total: int
cancelled: bool
def _collect_contexts(
corpus_lines: list[str], name: str, max_lines: int = CONTEXT_LINES_PER_CANDIDATE
) -> list[str]:
"""Return up to `max_lines` distinct lines from the corpus that mention `name`.
Case-insensitive token-boundary match. Lines are truncated to
CONTEXT_WINDOW_CHARS chars to keep token usage bounded.
"""
needle = re.compile(rf"(?<!\w){re.escape(name)}(?!\w)", re.IGNORECASE)
seen: set[str] = set()
out: list[str] = []
for line in corpus_lines:
if not needle.search(line):
continue
trimmed = line.strip()[:CONTEXT_WINDOW_CHARS]
if not trimmed or trimmed in seen:
continue
seen.add(trimmed)
out.append(trimmed)
if len(out) >= max_lines:
break
return out
def _build_user_prompt(candidates_with_contexts: list[tuple[str, str, list[str]]]) -> str:
"""Shape: for each candidate, list its current type guess + sampled contexts."""
parts: list[str] = ["CANDIDATES:"]
for i, (name, current_type, contexts) in enumerate(candidates_with_contexts, 1):
parts.append(f"\n{i}. {name} (currently: {current_type})")
if contexts:
for c in contexts:
parts.append(f" > {c}")
else:
parts.append(" > (no context available)")
return "\n".join(parts)
def _extract_json_candidates(text: str) -> list[str]:
"""Return plausible JSON payloads extracted from an LLM response."""
text = text.strip()
if not text:
return []
candidates: list[str] = [text]
for match in re.finditer(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE):
candidate = match.group(1).strip()
if candidate and candidate not in candidates:
candidates.append(candidate)
for start, opener in ((i, ch) for i, ch in enumerate(text) if ch in "{["):
closer = "}" if opener == "{" else "]"
depth = 0
in_string = False
escaped = False
for i in range(start, len(text)):
ch = text[i]
if in_string:
if escaped:
escaped = False
elif ch == "\\":
escaped = True
elif ch == '"':
in_string = False
continue
if ch == '"':
in_string = True
elif ch == opener:
depth += 1
elif ch == closer:
depth -= 1
if depth == 0:
candidate = text[start : i + 1].strip()
if candidate and candidate not in candidates:
candidates.append(candidate)
break
return candidates
def _parse_response(text: str, expected_names: list[str]) -> dict[str, tuple[str, str]]:
"""Parse the LLM's JSON response into {name: (label, reason)}.
Robust to the model occasionally wrapping JSON in text or returning
slight schema variations. Falls back to matching by candidate name.
"""
data = None
for candidate in _extract_json_candidates(text):
try:
data = json.loads(candidate)
break
except json.JSONDecodeError:
continue
if data is None:
return {}
entries = data.get("classifications") if isinstance(data, dict) else data
if not isinstance(entries, list):
return {}
name_to_label: dict[str, tuple[str, str]] = {}
expected_set = {n.lower(): n for n in expected_names}
for entry in entries:
if not isinstance(entry, dict):
continue
name = entry.get("name") or entry.get("candidate")
label = entry.get("label") or entry.get("type") or entry.get("classification")
reason = entry.get("reason") or ""
if not isinstance(name, str) or not isinstance(label, str):
continue
# Restore canonical casing from expected_names
canonical = expected_set.get(name.lower(), name)
lbl = label.strip().upper()
if lbl not in VALID_LABELS:
lbl = "AMBIGUOUS"
name_to_label[canonical] = (lbl, reason.strip()[:120])
return name_to_label
def _apply_classifications(
detected: dict,
decisions: dict[str, tuple[str, str]],
allow_project_promotions: bool = True,
) -> tuple[dict, int, int]:
"""Merge LLM decisions back into the detected dict.
Returns (new_detected, reclassified_count, dropped_count).
Topics get their own bucket so the caller can persist them as
cross-wing tunnel signal. ``AMBIGUOUS`` still falls back to
``uncertain`` for human review.
"""
label_to_bucket = {
"PERSON": "people",
"PROJECT": "projects",
"TOPIC": "topics",
"AMBIGUOUS": "uncertain",
}
bucket_to_type = {
"people": "person",
"projects": "project",
"topics": "topic",
"uncertain": "uncertain",
}
# Index every entity by name for in-place update
all_entries: list[tuple[str, dict]] = []
for bucket, items in detected.items():
for e in items:
all_entries.append((bucket, e))
reclassified = 0
dropped = 0
new_detected: dict[str, list[dict]] = {
"people": [],
"projects": [],
"topics": [],
"uncertain": [],
}
for old_bucket, entry in all_entries:
decision = decisions.get(entry["name"])
if decision is None:
# No LLM opinion — keep as-is
new_detected.setdefault(old_bucket, []).append(entry)
continue
label, reason = decision
if label == "COMMON_WORD":
dropped += 1
continue
target_bucket = label_to_bucket[label]
if (
label == "PROJECT"
and not allow_project_promotions
and not _is_authoritative_project(entry)
):
target_bucket = "uncertain"
updated = dict(entry)
# Append the LLM's reason as a new signal so the user sees why it moved
signals = list(updated.get("signals", []))
signals.append(f"LLM: {label.lower()}{reason}" if reason else f"LLM: {label.lower()}")
updated["signals"] = signals
if target_bucket != old_bucket:
reclassified += 1
updated["type"] = bucket_to_type.get(target_bucket, "uncertain")
new_detected[target_bucket].append(updated)
return new_detected, reclassified, dropped
def _build_corpus_origin_preamble(corpus_origin: dict | None) -> str:
"""Build a system-prompt preamble carrying corpus-origin context.
When the corpus has been identified as AI-dialogue with known persona
names, this preamble lets the LLM disambiguate ambiguous candidates
with knowledge that this is AI-dialogue. It does NOT add a new label
or change the classification schema — the post-refine sweep in
project_scanner.discover_entities still moves persona names into
``agent_personas``. The preamble is purely classification context for
the OTHER candidates (ambiguous, common-word) that benefit from
knowing the corpus shape.
Returns ``""`` when no usable origin context is available, so callers
can concatenate unconditionally without changing the v3.3.3 prompt
shape for opt-out paths.
"""
if not corpus_origin:
return ""
result = corpus_origin.get("result") or {}
if not result.get("likely_ai_dialogue"):
return ""
lines = ["\n\nCORPUS CONTEXT (corpus-origin detection):"]
platform = result.get("primary_platform")
if platform:
lines.append(f"- This corpus is AI-dialogue from {platform}.")
user_name = result.get("user_name")
if user_name:
lines.append(
f"- The corpus author (the human user) is named '{user_name}'. "
f"Treat this name as PERSON."
)
personas = result.get("agent_persona_names") or []
if personas:
lines.append(
"- The user has assigned these persona names to AI agents in "
f"this corpus: {', '.join(personas)}."
)
lines.append(
"- Persona names refer to AI agents, not biological people. "
"Classify them as PERSON (a downstream step tags them as "
"agent personas)."
)
return "\n".join(lines)
def _is_authoritative_person(entry: dict) -> bool:
"""Return True for git-author people that should not be second-guessed."""
signals = " ".join(entry.get("signals", [])).lower()
return "commit" in signals and "repo" in signals
def _is_authoritative_project(entry: dict) -> bool:
"""Return True for manifest/git-backed projects that are already source-backed."""
signals = " ".join(entry.get("signals", [])).lower()
manifest_markers = ("package.json", "pyproject.toml", "cargo.toml", "go.mod")
return any(marker in signals for marker in manifest_markers) or "commit" in signals
def _print_progress(batch_idx: int, total: int, current_name: str) -> None:
"""Overwrite-line progress indicator."""
width = 40
filled = int(width * batch_idx / total) if total else 0
bar = "" * filled + "" * (width - filled)
msg = f"\r LLM refine: [{bar}] batch {batch_idx}/{total} current: {current_name[:30]:<30}"
sys.stderr.write(msg)
sys.stderr.flush()
def refine_entities(
detected: dict,
corpus_text: str,
provider: LLMProvider,
batch_size: int = BATCH_SIZE,
show_progress: bool = True,
allow_project_promotions: bool = True,
corpus_origin: dict | None = None,
) -> RefineResult:
"""Reclassify detected entities using the LLM provider.
Only regex-derived candidates are sent for refinement. Git authors and
manifest/git-backed projects are already source-backed and don't benefit
from LLM second-guessing.
Ctrl-C during refinement: cancels the remaining batches, returns a
RefineResult with ``cancelled=True`` and whatever was classified before
the interrupt. The partial result is safe to pass straight to
``confirm_entities``.
Transport or parse failures in individual batches are recorded in
``errors`` and do not abort the run.
``allow_project_promotions=False`` keeps LLM-only project guesses in the
uncertain bucket. This is useful when manifest/git signal already supplied
canonical projects and regex/LLM hits are likely tools, vendors, or topics.
"""
candidates: list[tuple[str, str]] = []
current_type = {"people": "person", "projects": "project", "uncertain": "uncertain"}
for bucket in ("people", "projects", "uncertain"):
for e in detected.get(bucket, []):
if bucket == "people" and _is_authoritative_person(e):
continue
if bucket == "projects" and _is_authoritative_project(e):
continue
candidates.append((e["name"], current_type[bucket]))
corpus_lines = corpus_text.splitlines() if corpus_text else []
# Deduplicate candidate names while preserving order
seen: set[str] = set()
unique: list[tuple[str, str]] = []
for name, kind in candidates:
if name not in seen:
seen.add(name)
unique.append((name, kind))
if not unique:
return RefineResult(
merged=detected,
reclassified=0,
dropped=0,
errors=[],
batches_completed=0,
batches_total=0,
cancelled=False,
)
# Build batches
batches: list[list[tuple[str, str, list[str]]]] = []
for i in range(0, len(unique), batch_size):
chunk = unique[i : i + batch_size]
enriched = [(name, kind, _collect_contexts(corpus_lines, name)) for name, kind in chunk]
batches.append(enriched)
all_decisions: dict[str, tuple[str, str]] = {}
errors: list[str] = []
completed = 0
cancelled = False
system_prompt = SYSTEM_PROMPT + _build_corpus_origin_preamble(corpus_origin)
for idx, batch in enumerate(batches, 1):
if show_progress and batch:
_print_progress(idx - 1, len(batches), batch[0][0])
user_prompt = _build_user_prompt(batch)
try:
resp = provider.classify(system_prompt, user_prompt, json_mode=True)
except KeyboardInterrupt:
cancelled = True
break
except LLMError as e:
errors.append(f"batch {idx}: {e}")
continue
names_in_batch = [name for name, _, _ in batch]
decisions = _parse_response(resp.text, names_in_batch)
if not decisions:
errors.append(f"batch {idx}: could not parse response")
all_decisions.update(decisions)
completed += 1
if show_progress:
_print_progress(idx, len(batches), batch[-1][0])
if show_progress:
sys.stderr.write("\n")
sys.stderr.flush()
merged, reclassified, dropped = _apply_classifications(
detected,
all_decisions,
allow_project_promotions=allow_project_promotions,
)
return RefineResult(
merged=merged,
reclassified=reclassified,
dropped=dropped,
errors=errors,
batches_completed=completed,
batches_total=len(batches),
cancelled=cancelled,
)
def collect_corpus_text(
project_dir: str,
max_files: int = 30,
max_bytes_per_file: int = 20_000,
) -> str:
"""Gather prose text from ``project_dir`` for use as LLM context source.
Stratified: reads up to ``max_files`` prose files (``.md``, ``.txt``,
``.rst``), preferring recently-modified. Each file capped at
``max_bytes_per_file`` to bound total input.
"""
from pathlib import Path
from mempalace.entity_detector import PROSE_EXTENSIONS, SKIP_DIRS
root = Path(project_dir).expanduser().resolve()
if not root.is_dir():
return ""
candidates: list[tuple[float, Path]] = []
for dirpath, dirs, files in _walk_prose(root, SKIP_DIRS):
for fname in files:
p = dirpath / fname
if p.suffix.lower() not in PROSE_EXTENSIONS:
continue
try:
mtime = p.stat().st_mtime
except OSError:
continue
candidates.append((mtime, p))
candidates.sort(reverse=True)
selected = [p for _, p in candidates[:max_files]]
chunks: list[str] = []
for p in selected:
try:
with open(p, encoding="utf-8", errors="replace") as f:
chunks.append(f.read(max_bytes_per_file))
except OSError:
continue
return "\n".join(chunks)
def _walk_prose(root, skip_dirs):
"""Walk a directory yielding (Path, dirs, files), pruning skip_dirs.
Inlined from ``project_scanner._walk`` to avoid a private-name import
coupling. Functionality is intentionally narrow: prose collection only.
"""
import os
from pathlib import Path
for dirpath, dirs, files in os.walk(root):
dirs[:] = [d for d in dirs if d not in skip_dirs and not d.startswith(".")]
yield Path(dirpath), dirs, files