Files
mempalace/mempalace/corpus_origin.py
T
MSL b99e54546b feat(init): context-aware corpus detection
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5.

This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests.

The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean.

But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports.

## What changes for users (in order, from `pip install` onwards)

**Install** — `pip install mempalace` is unchanged. The package itself didn't shift.

**First run — `mempalace init <folder>`:**

1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context.

2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`:
   - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`.
   - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise.
   - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here.

3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out.

4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder.

5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all.

6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean.

7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it.

8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR.

**Later — when your folder grows:**

9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe.

## Behind the changes

- **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation.

- **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`.

- **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged.

- **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module.

- **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`.

- **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback.

## Cost story

- **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`.
- **Ollama / local LLM runtime:** zero cost. Fully offline.
- **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes.
- **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks.

## Backwards compatibility

- All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change.
- The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work.
- `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3.

## Test coverage

- **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation.
- **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition).
- **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too.
- **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures.

## Hygiene guardrail

This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
2026-04-26 12:37:26 -07:00

423 lines
17 KiB
Python

"""
corpus_origin.py — Detect whether a corpus is an AI-dialogue record and,
if so, what platform and what persona names the user has assigned to the
agent.
This is the first question any downstream Pass 2 classification needs
answered. Without it, a drawer like "my three sons" in a Claude Code
dialogue corpus can't be correctly resolved to "three AI instances"
rather than "three biological children."
Two-tier detection:
Tier 1 — detect_origin_heuristic(samples)
Cheap, no API. Grep for well-known AI brand terms + turn
markers. Always runs. Outputs a hypothesis.
Tier 2 — detect_origin_llm(samples, provider)
Uses an LLMProvider (typically Haiku via mempalace.llm_client)
with the model's pre-trained knowledge of Claude/ChatGPT/Gemini
etc. Confirms platform, extracts agent persona-names the user
has assigned. One call, ~$0.01 cost.
Design principle:
Don't make the classifier re-discover what Claude, ChatGPT, Gemini, MCP,
or other well-known entities ARE — the LLM already knows them from its
training. Only corpus-specific entities (e.g. the user's persona-name
for their Claude instance) need discovery.
Default stance (when evidence is thin):
"This IS an AI-dialogue corpus" — false-negative is catastrophic for
downstream classification; false-positive is recoverable via per-drawer
voice-profile detection in later passes.
"""
from __future__ import annotations
import json
import re
from dataclasses import dataclass, field, asdict
from typing import Optional
# ── Well-known AI brand terms (expand as new platforms emerge) ────────────
# Detection is by PATTERN + CONTEXT, not by capitalization or English-language
# rules. Two categories:
#
# UNAMBIGUOUS — terms that have essentially no meaning outside of AI context.
# Always counted toward AI-dialogue evidence.
#
# AMBIGUOUS — terms that share a string with common English words, names,
# poetry forms, zodiac signs, animals, etc. Counted toward AI-dialogue
# evidence ONLY when at least one unambiguous AI signal also appears in
# the corpus (turn marker, unambiguous brand term, or AI infrastructure
# term). This avoids false-positives on French novels with characters
# named "Claude", astrology corpora discussing "Gemini", poetry corpora
# full of "haiku" / "sonnet", etc.
#
# All matching is CASE-INSENSITIVE — users type lowercase constantly.
_AI_UNAMBIGUOUS_TERMS = [
# Anthropic-specific
"Anthropic",
"Claude Code",
"Claude 3",
"Claude 4",
"claude mcp",
"CLAUDE.md",
".claude/",
# OpenAI-specific
"ChatGPT",
"GPT-4",
"GPT-3",
"GPT-5",
"OpenAI",
"gpt-4o",
"gpt-4-turbo",
"o1-preview",
"o3",
# Google-specific
"gemini-pro",
"gemini-1.5",
"Google AI",
# Meta / others (specific model identifiers, not bare common words)
"Mixtral",
"Cohere",
# AI-infrastructure terms with no common-English collision
"MCP",
"LLM",
"RAG",
"fine-tune",
"context window",
"embedding",
]
_AI_AMBIGUOUS_TERMS = [
# Anthropic — bare brand/model names that collide with names + poetry
"Claude", # also a common French masculine name
"Opus", # also a musical work, comic strip, magazine
"Sonnet", # also a 14-line poem form
"Haiku", # also a 17-syllable poem form
# Google — bare brand that collides with zodiac sign
"Gemini", # also the zodiac sign
"Bard", # also a poet / Shakespeare
# Meta / others
"Llama", # also the South American animal
"Mistral", # also a Mediterranean wind
# Note: 'prompt', 'completion', 'tokens' previously lived here but were
# removed: they're suppressed without an unambiguous co-signal anyway,
# and by the time a co-signal is present the corpus is already flagged.
# Keeping them just produced noisier evidence strings.
]
# Turn-marker patterns commonly seen in AI-dialogue transcripts
_TURN_MARKERS = [
r"\buser\s*:\s*",
r"\bassistant\s*:\s*",
r"\bhuman\s*:\s*",
r"\bai\s*:\s*",
r"\b>>>\s*User\b",
r"\b>>>\s*Assistant\b",
]
def _brand_pattern(term: str) -> str:
"""Build a regex for a brand term that uses word boundaries
only on edges where the term itself starts/ends with a word
character. Without this nuance:
- 'Claude' would falsely match inside 'Claudette' (no \\b)
- '.claude/' would fail to match at start of string (\\b
before non-word char requires preceding word char)
So we only attach \\b where it actually makes sense."""
escaped = re.escape(term)
prefix = r"\b" if term[0].isalnum() or term[0] == "_" else ""
suffix = r"\b" if term[-1].isalnum() or term[-1] == "_" else ""
return prefix + escaped + suffix
@dataclass
class CorpusOriginResult:
"""Structured output from corpus-origin detection.
Fields:
likely_ai_dialogue — best hypothesis about whether this is AI-dialogue
confidence — 0.0 to 1.0
primary_platform — e.g. "Claude Code (Anthropic CLI)" or None
user_name — the corpus author's name if identifiable from context, else None
agent_persona_names — names the user has assigned to the AI agent(s)
(e.g. ["Echo", "Sparrow"]). Does NOT include the user's own name.
evidence — human-readable reasons for the classification
"""
likely_ai_dialogue: bool
confidence: float
primary_platform: Optional[str]
user_name: Optional[str] = None
agent_persona_names: list[str] = field(default_factory=list)
evidence: list[str] = field(default_factory=list)
def to_dict(self) -> dict:
return asdict(self)
# ── Tier 1: cheap heuristic ───────────────────────────────────────────────
def detect_origin_heuristic(samples: list[str]) -> CorpusOriginResult:
"""Fast grep-based detection. No API calls.
Scores AI-dialogue likelihood by counting:
- occurrences of well-known AI brand terms
- turn-marker patterns (user:, assistant:, etc.)
Returns a CorpusOriginResult with confidence derived from signal density.
"""
combined = "\n\n".join(samples)
total_chars = max(1, len(combined))
# Count UNAMBIGUOUS brand-term hits (case-insensitive — users type
# lowercase constantly, so 'chatgpt' must trip the same as 'ChatGPT').
# Word boundaries prevent false in-word matches (see _brand_pattern).
unambiguous_hits: dict[str, int] = {}
total_unambiguous = 0
for term in _AI_UNAMBIGUOUS_TERMS:
matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
if matches:
unambiguous_hits[term] = len(matches)
total_unambiguous += len(matches)
# Count AMBIGUOUS brand-term hits separately. These will only be
# counted toward AI-dialogue evidence if the corpus also contains
# at least one unambiguous AI signal — see co-occurrence rule below.
ambiguous_hits: dict[str, int] = {}
total_ambiguous = 0
for term in _AI_AMBIGUOUS_TERMS:
matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
if matches:
ambiguous_hits[term] = len(matches)
total_ambiguous += len(matches)
# Count turn-marker hits (case-insensitive — transcripts vary).
turn_hits = 0
turn_types_found = set()
for pattern in _TURN_MARKERS:
matches = re.findall(pattern, combined, re.IGNORECASE)
if matches:
turn_hits += len(matches)
turn_types_found.add(pattern)
# Co-occurrence rule for ambiguous terms.
# Ambiguous terms (e.g. 'Claude' as a French name, 'Gemini' as a zodiac
# sign, 'Haiku' as a poem form) only count toward brand evidence if
# the corpus also contains at least one unambiguous AI signal. Otherwise
# we'd false-positive on French novels, astrology forums, poetry corpora,
# llama-rancher journals, etc.
has_ai_context = total_unambiguous > 0 or turn_hits > 0
counted_brand_hits = total_unambiguous + (total_ambiguous if has_ai_context else 0)
# Brand-term density per 1000 chars; turn-marker density likewise.
# Tuned on a small set of examples; these aren't magic numbers and
# can be revisited as we see more corpora.
brand_density = counted_brand_hits / (total_chars / 1000)
turn_density = turn_hits / (total_chars / 1000)
# Build evidence list
evidence: list[str] = []
shown_hits = dict(unambiguous_hits)
if has_ai_context:
shown_hits.update(ambiguous_hits)
if shown_hits:
top_terms = sorted(shown_hits.items(), key=lambda x: -x[1])[:5]
evidence.append("AI brand terms: " + ", ".join(f"'{k}' ({v}x)" for k, v in top_terms))
elif ambiguous_hits and not has_ai_context:
# Be transparent that we saw ambiguous matches but suppressed them
# for lack of co-occurring AI context.
suppressed = sorted(ambiguous_hits.items(), key=lambda x: -x[1])[:3]
evidence.append(
"Ambiguous terms present but suppressed (no co-occurring AI signal): "
+ ", ".join(f"'{k}' ({v}x)" for k, v in suppressed)
)
if turn_hits:
evidence.append(
f"Turn markers detected: {turn_hits} occurrences across {len(turn_types_found)} pattern types"
)
# Decision logic:
# strong signal (brand OR turn hits both >= threshold) → confident AI-dialogue
# MEANINGFUL absence (enough text, zero brand, zero turn) → confident narrative
# ambiguous or insufficient text → default stance: AI-dialogue with low confidence
#
# Threshold for "meaningful absence": the samples collectively have to
# be long enough that the absence of AI signals would be expected to
# surface if the corpus really is narrative. 150 chars is the working
# floor — below that, we cannot confidently say "this is narrative."
MEANINGFUL_TEXT_FLOOR = 150
if brand_density >= 0.5 or turn_density >= 2.0:
return CorpusOriginResult(
likely_ai_dialogue=True,
confidence=min(0.95, 0.6 + 0.1 * (brand_density + turn_density)),
primary_platform=None, # tier 2 will refine
evidence=evidence,
)
if counted_brand_hits == 0 and turn_hits == 0 and total_chars >= MEANINGFUL_TEXT_FLOOR:
# Note: ambiguous-only matches (e.g. a French novel with 'Claude' as
# a character name) flow through here because counted_brand_hits == 0
# when no unambiguous AI signal co-occurs. The 'evidence' list still
# records that the ambiguous matches were seen and suppressed.
narrative_evidence = list(evidence) + [
f"no unambiguous AI signal across {total_chars} chars of text — pure narrative"
]
return CorpusOriginResult(
likely_ai_dialogue=False,
confidence=0.9,
primary_platform=None,
evidence=narrative_evidence,
)
# Ambiguous or too-short-to-tell case: default stance is AI-dialogue
# with explicit low confidence. Tier 2 (LLM) should be called to confirm.
reason = "weak signal" if (counted_brand_hits or turn_hits) else "insufficient text"
return CorpusOriginResult(
likely_ai_dialogue=True,
confidence=0.4,
primary_platform=None,
evidence=evidence
+ [
f"{reason} — applying default-stance (ai_dialogue=True, low confidence). "
"Tier 2 LLM check recommended to confirm or override."
],
)
# ── Tier 2: LLM-assisted confirmation + persona extraction ────────────────
_SYSTEM_PROMPT = """You are analyzing a corpus of text to determine whether it is a \
record of conversations with an AI agent (e.g. Claude, ChatGPT, Gemini, custom LLM \
apps), or some other kind of text (personal narrative, story, research notes, \
journal, code, etc.).
Use your pre-existing knowledge of well-known AI platforms. You don't need the \
corpus to explain what Claude or ChatGPT is — you already know. Your job is to \
detect evidence of their presence and identify what persona-names the user has \
assigned to the agent(s) they converse with.
CRITICAL distinction:
- agent_persona_names are names the USER has assigned to the AI AGENT(S)
they converse with. Example: "Echo", "Sparrow", "Henry" might be names
the user calls a Claude instance they're building a relationship with.
- Do NOT include the USER's own name in agent_persona_names. The user
is the human author of the corpus, not a persona of the agent. Even
if the user's name appears frequently in the text (writing about
themselves), that is NOT an agent persona.
- If you can identify the user's name from context, put it in user_name
(separate field). If unclear, leave user_name null.
Respond with JSON only (no prose before or after):
{
"is_ai_dialogue_corpus": <true|false>,
"confidence": <0.0 to 1.0>,
"primary_platform": <"Claude (Anthropic)" | "ChatGPT (OpenAI)" | "Gemini (Google)" | other platform name | null>,
"user_name": <user's name if clearly identifiable from context, else null>,
"agent_persona_names": [<names the user has assigned to the AI AGENT(S), NOT the user's own name>],
"evidence": [<short bullet strings explaining the decision>]
}
Default stance: if evidence is thin or mixed, return is_ai_dialogue_corpus=true \
with low confidence. False-negatives on AI-dialogue detection break downstream \
classification; false-positives are recoverable later.
"""
def _extract_json(text: str) -> Optional[dict]:
"""Pull the first JSON object out of a possibly-messy LLM response."""
text = text.strip()
if not text:
return None
# Try straight parse first
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Try to find a {...} block
start = text.find("{")
if start < 0:
return None
depth = 0
in_string = False
escape = False
for i in range(start, len(text)):
ch = text[i]
if in_string:
if escape:
escape = False
elif ch == "\\":
escape = True
elif ch == '"':
in_string = False
continue
if ch == '"':
in_string = True
elif ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
candidate = text[start : i + 1]
try:
return json.loads(candidate)
except json.JSONDecodeError:
return None
return None
def detect_origin_llm(samples: list[str], provider) -> CorpusOriginResult:
"""LLM-assisted detection. Takes samples (list of drawer-text excerpts)
and an LLMProvider (mempalace.llm_client.LLMProvider). Returns the
same CorpusOriginResult shape as the heuristic.
Falls back conservatively (default-stance ai=True, low confidence)
on any LLM error or malformed response — never raises.
"""
# Build the user prompt: concise excerpts, capped so we stay cheap
max_excerpt_chars = 800
excerpts = "\n\n---\n\n".join(
f"[sample {i + 1}]\n{s[:max_excerpt_chars]}" for i, s in enumerate(samples[:20])
)
user_prompt = f"CORPUS EXCERPTS:\n\n{excerpts}\n\nAnalyze and respond with JSON."
try:
resp = provider.classify(system=_SYSTEM_PROMPT, user=user_prompt, json_mode=True)
raw = getattr(resp, "text", "") or ""
except Exception as e:
return CorpusOriginResult(
likely_ai_dialogue=True,
confidence=0.3,
primary_platform=None,
evidence=[f"LLM provider error (fallback to default stance): {e}"],
)
parsed = _extract_json(raw)
if not parsed or not isinstance(parsed, dict):
return CorpusOriginResult(
likely_ai_dialogue=True,
confidence=0.3,
primary_platform=None,
evidence=["LLM response was not valid JSON (fallback to default stance)"],
)
# Pull fields defensively. If the LLM leaked the user_name into
# agent_persona_names despite the prompt telling it not to, filter it out.
user_name = parsed.get("user_name") or None
personas = list(parsed.get("agent_persona_names") or [])
if user_name:
personas = [p for p in personas if p.lower() != user_name.lower()]
return CorpusOriginResult(
likely_ai_dialogue=bool(parsed.get("is_ai_dialogue_corpus", True)),
confidence=float(parsed.get("confidence", 0.5)),
primary_platform=parsed.get("primary_platform") or None,
user_name=user_name,
agent_persona_names=personas,
evidence=list(parsed.get("evidence") or []),
)