diff --git a/CHANGELOG.md b/CHANGELOG.md
index b5433ac..25b7853 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,12 +13,16 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
- **`mempalace init` now prompts to mine the same directory.** After entity confirmation, room detection, and gitignore guard, `init` shows a one-line scope estimate (e.g. `~423 files (~12 MB) would be mined into this palace.`) computed from its existing corpus walk, then asks `Mine this directory now? [Y/n]` (default yes) and runs `mine()` in-process if accepted. The estimate fires before the prompt so users on a real corpus aren't surprised by a minutes-long ChromaDB write. Declining prints the exact `mempalace mine
` command for later. (#1181)
- **New `--auto-mine` flag on `mempalace init`** for the non-interactive path (`mempalace init --auto-mine ` skips the mine prompt and runs mine directly). `--yes` retains its existing scope of entity auto-accept only and still prompts for the mine step, so existing scripted callers see no behaviour change; combining `--yes --auto-mine` gives a fully non-interactive setup. (#1181)
- **Cross-wing topic tunnels.** When two wings have confirmed `TOPIC` labels in common (the LLM-refine bucket from `mempalace init --llm`), the miner now drops a symmetric tunnel between them at mine time so the palace graph reflects shared themes (frameworks, vendors, recurring concepts). Tunnels are routed through the existing `create_tunnel` storage so they share dedup and persistence with explicit tunnels. Topic tunnels are stored under a synthetic `topic:` room and tagged with `kind: "topic"` on the stored dict — this keeps them distinct from literal folder-derived rooms of the same name (a wing with both an `Angular` folder room and an `Angular` topic tunnel no longer collides at `follow_tunnels` read time) and gives LLMs scanning `list_tunnels` a visible discriminator. Threshold is configurable via `MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` env var or `topic_tunnel_min_count` in `~/.mempalace/config.json` (default `1`). Manifest-dependency overlap and per-topic allow/deny lists remain out of scope. (#1180)
+- **Context-aware corpus detection at `mempalace init`.** A new Pass 0 runs at the start of `init` — before entity detection — and answers one question: *is this corpus an AI-dialogue record, and if so, which platform and what persona names has the user assigned to the agents?* Tier 1 is a free regex heuristic (well-known AI brand terms + turn-marker patterns, with a co-occurrence rule that suppresses ambiguous terms like `Claude`/`Gemini`/`Haiku` when no unambiguous AI signal is present, so French novels and astrology forums don't false-positive). Tier 2 is an LLM call (~$0.01 with Anthropic Haiku, free with local Ollama/LM Studio/llama.cpp/vLLM) that extracts `user_name` and `agent_persona_names` from dialogue structure. Result is persisted to `/.mempalace/origin.json` with a `schema_version: 1` envelope so downstream tools can read it. Entity classification then routes names matching `agent_persona_names` (case-insensitive) into a new `agent_personas` bucket instead of `people`, so a Claude Code transcript no longer misclassifies the user's `Echo`/`Sparrow`/`Cipher` agents as biological people. `llm_refine` receives the same context as a system-prompt preamble so it can disambiguate other ambiguous candidates with corpus-level knowledge too. Backwards compatible: callers that don't pass `corpus_origin` see the v3.3.3 return shape unchanged. (#TBD)
+- **`mempalace init` runs LLM-assisted refinement by default.** v3.3.3 made `--llm` opt-in; the LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications) so it now runs by default. Provider precedence is unchanged — Ollama at `http://localhost:11434` first, then openai-compat, then anthropic with API key. **Never blocks init on a missing LLM**: if no provider is reachable (Ollama not running, no API key set), init prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. `--no-llm` is the new explicit opt-out. The legacy `--llm` flag is preserved as a deprecated alias of the default so scripted callers see no behaviour change. Cost story: zero for users with a local LLM (the majority on this repo), ~$0.01 per init for users with `ANTHROPIC_API_KEY` set who explicitly choose `--llm-provider anthropic`, zero for users with no LLM (graceful fallback). (#TBD)
+- **`mempalace mine --redetect-origin` flag.** Re-runs corpus-origin detection on the current corpus state and overwrites `/.mempalace/origin.json`. Useful when the corpus has grown since `mempalace init` and the stored origin may be stale. Heuristic-only by design (the flag is meant to be cheap); re-run `mempalace init` for full Tier 2 LLM refinement. Default `mempalace mine` does not touch `origin.json` — the flag is opt-in. (#TBD)
### Bug Fixes
- **CLI `mempalace search` retrieval quality.** The CLI was using pure ChromaDB cosine distance with no BM25 rerank, so drawers containing every query term but embedding as noise (directory listings, diff output, shell logs) scored `Match: 0.0` alongside genuinely irrelevant results with no way to tell them apart. Wired the CLI through the same `_hybrid_rank` the `mempalace_search` MCP tool already used, and surfaced both `cosine=` and `bm25=` scores in the output so users see which component of the match is firing. MCP search was unaffected; this fixes the human-facing CLI parity gap.
- **Legacy-palace distance-metric warning.** CLI search now detects palaces created before `hnsw:space=cosine` was consistently set and prints a one-line notice pointing at `mempalace repair`. Without the warning such palaces silently used L2 distance, under which the similarity display floored every result to `Match: 0.0`. New palaces mined today already set cosine correctly and now have invariant tests pinning that behavior so future refactors can't silently regress it. (#1179)
- **Graceful Ctrl-C during `mempalace mine`.** Interrupting a long mine no longer dumps a multi-frame `KeyboardInterrupt` traceback. The main file-processing loop now catches the signal, prints `files_processed: N/M`, `drawers_filed: K`, and `last_file:` so the user knows what landed, then exits with code 130 (standard SIGINT). Already-filed drawers are upserted idempotently on re-mine via deterministic IDs, so resuming is safe. The hooks PID lock at `~/.mempalace/hook_state/mine.pid` is now also actively cleaned up in a `finally` when its entry points at us — clean exit, error, or interrupt — preventing the next hook fire from briefly waiting on a stale PID. (#1182)
+- **`mempalace init` is now idempotent across re-runs.** Running `init` twice on the same project produced different `origin.json` results because the first run wrote `entities.json` into the project directory, and the second run's corpus-origin sampling included that file as corpus content — shifting Tier 1's character-density math. Sampling now skips the per-project artifacts (`entities.json`, `mempalace.yaml`), so re-running `init` produces the same classification it did the first time. Pinned by an integration test in `tests/test_corpus_origin_integration.py`. (#TBD)
---
diff --git a/CLAUDE.md b/CLAUDE.md
index 27fd8fb..13dfac3 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -22,7 +22,7 @@ These are non-negotiable. Every PR, every feature, every refactor must honor the
- **Verbatim always** — Never summarize, paraphrase, or lossy-compress user data. The system searches the index and returns the original words. If a user said it, we store exactly what they said. This is the foundational promise.
- **Incremental only** — Append-only ingest after initial build. Never destroy existing data to rebuild. A crash mid-operation must leave the existing palace untouched.
- **Entity-first** — Everything is keyed by real names with disambiguation by DOB, ID, or context. People matter more than topics.
-- **Local-first, zero API** — All extraction, chunking, and embedding happens on the user's machine. No cloud dependency for memory operations. No API keys required.
+- **Local-first, zero external API by default** — All extraction, chunking, embedding, and LLM-assisted refinement happens on the user's machine by default, using locally-hosted runtimes (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio, etc.). External providers (Anthropic, OpenAI, Google) are supported via BYOK but are never required and never enabled silently. The system never sends user content to a service the user has not explicitly configured. "Local LLM" is not an external API — Ollama and equivalents running on localhost are part of the user's machine. External BYOK is always a deliberate user choice, never a default and never a silent fallback.
- **Performance budgets** — Hooks under 500ms. Startup injection under 100ms. Memory should feel instant.
- **Privacy by architecture** — The system physically cannot send your data because it never leaves your machine. No telemetry, no phone-home, no external service dependencies for core operations.
- **Background everything** — Filing, indexing, timestamps, and pipeline work happen via hooks in the background. Nothing interrupts the user's conversation. Zero tokens spent on bookkeeping in the chat window.
diff --git a/mempalace/cli.py b/mempalace/cli.py
index e4ddba0..84b8e04 100644
--- a/mempalace/cli.py
+++ b/mempalace/cli.py
@@ -34,11 +34,143 @@ import argparse
from pathlib import Path
from .config import MempalaceConfig
+from .corpus_origin import detect_origin_heuristic, detect_origin_llm
+from .llm_client import LLMError, get_provider
from .version import __version__
_MEMPALACE_PROJECT_FILES = ("mempalace.yaml", "entities.json")
+# Pass 0 corpus-origin sampling caps. Tier 1 reads FULL file content (no
+# front-bias sampling) but bounds total memory on enormous corpora. Tier 2
+# trims to a smaller view because LLM context windows are finite.
+_PASS_ZERO_MAX_FILES = 30
+_PASS_ZERO_PER_FILE_CAP = 100_000 # 100KB per file is generous for prose
+_PASS_ZERO_TOTAL_CAP = 5_000_000 # 5MB total ceiling — bounds memory
+_PASS_ZERO_LLM_PER_SAMPLE = 2_000 # for Tier 2 LLM call only
+_PASS_ZERO_LLM_MAX_SAMPLES = 20 # caps the LLM-tier sample count
+
+
+def _gather_origin_samples(project_dir) -> list:
+ """Collect Tier-1 samples for corpus-origin detection.
+
+ Reads FULL file content (capped at ``_PASS_ZERO_PER_FILE_CAP`` per file
+ and ``_PASS_ZERO_TOTAL_CAP`` overall). No front-bias sampling — AI
+ signal that lives past the first N chars of a file must still trip
+ detection, so we read the whole file up to the cap.
+
+ Skips mempalace's own per-project artifacts (``entities.json``,
+ ``mempalace.yaml``) so a re-run of ``mempalace init`` produces the
+ same classification result it did on the first run. Without this
+ filter, the first run writes entities.json into the corpus, the
+ second run picks it up as a sample, and the Tier-1 density math
+ drifts (different total_chars). That makes init non-idempotent.
+
+ Returns a list of strings (one per readable file). Empty list when
+ the project has no readable text.
+ """
+ from .entity_detector import scan_for_detection
+
+ files = scan_for_detection(project_dir, max_files=_PASS_ZERO_MAX_FILES)
+ samples: list = []
+ total_chars = 0
+ for filepath in files:
+ if filepath.name in _MEMPALACE_PROJECT_FILES:
+ continue
+ if total_chars >= _PASS_ZERO_TOTAL_CAP:
+ break
+ try:
+ with open(filepath, encoding="utf-8", errors="replace") as f:
+ content = f.read(_PASS_ZERO_PER_FILE_CAP)
+ except OSError:
+ continue
+ if not content:
+ continue
+ samples.append(content)
+ total_chars += len(content)
+ return samples
+
+
+def _trim_samples_for_llm(samples: list) -> list:
+ """Reduce Tier-1 full-content samples to LLM-friendly size.
+
+ Tier 2 hits an LLM with a finite context window — we trim each sample
+ to ``_PASS_ZERO_LLM_PER_SAMPLE`` chars and cap the overall sample
+ count at ``_PASS_ZERO_LLM_MAX_SAMPLES``.
+ """
+ return [s[:_PASS_ZERO_LLM_PER_SAMPLE] for s in samples[:_PASS_ZERO_LLM_MAX_SAMPLES]]
+
+
+def _run_pass_zero(project_dir, palace_dir, llm_provider) -> dict:
+ """Pass 0: detect whether the corpus is AI-dialogue and persist the
+ result to ``/.mempalace/origin.json``.
+
+ Returns the wrapped result dict (same shape as origin.json) on success,
+ or ``None`` when there are no readable samples to detect from. The
+ return value is what cmd_init forwards to ``discover_entities`` via
+ the ``corpus_origin`` kwarg.
+
+ File-write failures (e.g. read-only palace) are caught and reported on
+ stderr; init never blocks on them.
+ """
+ import json
+ from datetime import datetime, timezone
+ from pathlib import Path
+
+ samples = _gather_origin_samples(project_dir)
+ if not samples:
+ print(" Skipping corpus-origin detection — no readable samples.")
+ return None
+
+ # Tier 1 — always runs. Cheap regex grep, no API.
+ result = detect_origin_heuristic(samples)
+
+ # Tier 2 — runs only when an LLM provider is available. The provider
+ # contract is best-effort: corpus_origin internally falls back to a
+ # conservative default on transport/parse failure, so we don't need a
+ # try/except here, but we still keep one for any unforeseen exception.
+ if llm_provider is not None:
+ try:
+ llm_result = detect_origin_llm(_trim_samples_for_llm(samples), llm_provider)
+ # LLM-tier result wins on platform/persona/user fields; keep the
+ # heuristic evidence appended so the on-disk record retains the
+ # cheap-tier signal trail.
+ llm_result.evidence = list(llm_result.evidence) + [
+ f"Tier-1 heuristic: {e}" for e in result.evidence
+ ]
+ result = llm_result
+ except Exception as exc: # noqa: BLE001 — never block init on LLM failure
+ print(f" LLM corpus-origin tier failed ({exc}); using heuristic only.")
+
+ wrapped = {
+ "schema_version": 1,
+ "detected_at": datetime.now(timezone.utc).isoformat(),
+ "result": result.to_dict(),
+ }
+
+ origin_path = Path(palace_dir).expanduser() / ".mempalace" / "origin.json"
+ try:
+ origin_path.parent.mkdir(parents=True, exist_ok=True)
+ with open(origin_path, "w", encoding="utf-8") as f:
+ json.dump(wrapped, f, indent=2, ensure_ascii=False)
+ except OSError as exc:
+ print(f" Could not write {origin_path}: {exc}", file=sys.stderr)
+ # Return the wrapped dict anyway so the in-memory pipeline still
+ # benefits from the detection result this run.
+ return wrapped
+
+ # Banner — one line, two-space indent matching existing init style.
+ res = result
+ if res.likely_ai_dialogue:
+ platform = res.primary_platform or "AI dialogue (platform unidentified)"
+ user = res.user_name or "—"
+ agents = ", ".join(res.agent_persona_names) if res.agent_persona_names else "—"
+ print(f" Detected: {platform} (user: {user}, agents: {agents})")
+ else:
+ print(f" Corpus origin: not AI-dialogue (confidence: {res.confidence:.2f})")
+
+ return wrapped
+
def _ensure_mempalace_files_gitignored(project_dir) -> bool:
"""If project_dir is a git repo, ensure MemPalace's per-project files
@@ -86,29 +218,46 @@ def cmd_init(args):
languages = cfg.entity_languages
languages_tuple = tuple(languages)
- # Optional phase-2 LLM provider (opt-in via --llm).
+ # --llm is ON by default. --no-llm is the explicit opt-out. Provider
+ # precedence is unchanged (Ollama localhost first, then openai-compat,
+ # then anthropic). Never block init on a missing LLM: when no provider
+ # responds, print a one-line message pointing at --no-llm and fall
+ # through to heuristics-only.
llm_provider = None
- if getattr(args, "llm", False):
- from .llm_client import LLMError, get_provider
-
+ if not getattr(args, "no_llm", False):
+ provider_name = getattr(args, "llm_provider", "ollama") or "ollama"
+ provider_model = getattr(args, "llm_model", "gemma4:e4b") or "gemma4:e4b"
try:
- llm_provider = get_provider(
- name=args.llm_provider,
- model=args.llm_model,
- endpoint=args.llm_endpoint,
- api_key=args.llm_api_key,
+ candidate = get_provider(
+ name=provider_name,
+ model=provider_model,
+ endpoint=getattr(args, "llm_endpoint", None),
+ api_key=getattr(args, "llm_api_key", None),
)
+ ok, msg = candidate.check_available()
+ if ok:
+ llm_provider = candidate
+ print(f" LLM enabled: {provider_name}/{provider_model}")
+ else:
+ print(
+ f" No LLM provider reachable ({msg}). "
+ f"Running heuristics-only — pass --no-llm to silence this."
+ )
except LLMError as e:
- print(f" ERROR: {e}", file=sys.stderr)
- sys.exit(2)
- ok, msg = llm_provider.check_available()
- if not ok:
print(
- f" ERROR: LLM provider '{args.llm_provider}' unavailable: {msg}",
- file=sys.stderr,
+ f" LLM init failed ({e}). "
+ f"Running heuristics-only — pass --no-llm to silence this."
)
- sys.exit(2)
- print(f" LLM refinement enabled: {args.llm_provider}/{args.llm_model}")
+
+ # Pass 0: detect whether the corpus is AI-dialogue. Writes
+ # /.mempalace/origin.json and supplies corpus context to the
+ # entity classifier so it can correctly handle agent persona names
+ # (e.g. "Echo", "Sparrow") without misclassifying them as people.
+ corpus_origin = _run_pass_zero(
+ project_dir=args.dir,
+ palace_dir=cfg.palace_path,
+ llm_provider=llm_provider,
+ )
# Pass 1: discover entities — manifests + git authors first, prose detection
# as supplement for names mentioned only in docs/notes. Optional phase-2
@@ -116,7 +265,12 @@ def cmd_init(args):
print(f"\n Scanning for entities in: {args.dir}")
if languages_tuple != ("en",):
print(f" Languages: {', '.join(languages_tuple)}")
- detected = discover_entities(args.dir, languages=languages_tuple, llm_provider=llm_provider)
+ detected = discover_entities(
+ args.dir,
+ languages=languages_tuple,
+ llm_provider=llm_provider,
+ corpus_origin=corpus_origin,
+ )
total = (
len(detected["people"])
+ len(detected["projects"])
@@ -264,6 +418,16 @@ def cmd_mine(args):
for raw in args.include_ignored or []:
include_ignored.extend(part.strip() for part in raw.split(",") if part.strip())
+ # --redetect-origin re-runs corpus_origin on the current corpus state
+ # and overwrites /.mempalace/origin.json before mining proceeds.
+ # Heuristic-only by design — full LLM detection lives on `mempalace init`.
+ if getattr(args, "redetect_origin", False):
+ _run_pass_zero(
+ project_dir=args.dir,
+ palace_dir=palace_path,
+ llm_provider=None,
+ )
+
if args.mode == "convos":
from .convo_miner import mine_convos
@@ -728,17 +892,25 @@ def main():
"--llm",
action="store_true",
help=(
- "Enable LLM-assisted entity refinement (opt-in, local-first). "
- "Runs after manifest/git/regex detection, asking the configured "
- "provider to reclassify ambiguous candidates. "
- "Ctrl-C during refinement returns partial results."
+ "DEPRECATED — LLM-assisted entity refinement is now ON by default. "
+ "This flag is preserved for backward compatibility; pass --no-llm "
+ "to opt out instead."
+ ),
+ )
+ p_init.add_argument(
+ "--no-llm",
+ action="store_true",
+ help=(
+ "Disable LLM-assisted entity refinement. Run init in heuristics-only "
+ "mode (no provider acquisition, no LLM calls). Use when running "
+ "without a local LLM and you don't want the graceful-fallback message."
),
)
p_init.add_argument(
"--llm-provider",
default="ollama",
choices=["ollama", "openai-compat", "anthropic"],
- help="LLM provider (default: ollama). Use --llm to enable.",
+ help="LLM provider (default: ollama). Pass --no-llm to disable LLM-assisted refinement entirely.",
)
p_init.add_argument(
"--llm-model",
@@ -789,6 +961,17 @@ def main():
help="Your name — recorded on every drawer (default: mempalace)",
)
p_mine.add_argument("--limit", type=int, default=0, help="Max files to process (0 = all)")
+ p_mine.add_argument(
+ "--redetect-origin",
+ action="store_true",
+ help=(
+ "Re-run corpus_origin detection on this directory and overwrite "
+ "/.mempalace/origin.json. Useful when the corpus has grown "
+ "since `mempalace init` and the stored origin may be stale. "
+ "Heuristic-only (no LLM call) — re-run `mempalace init --llm` for "
+ "Tier 2 refinement."
+ ),
+ )
p_mine.add_argument(
"--dry-run", action="store_true", help="Show what would be filed without filing"
)
diff --git a/mempalace/corpus_origin.py b/mempalace/corpus_origin.py
new file mode 100644
index 0000000..12d34ab
--- /dev/null
+++ b/mempalace/corpus_origin.py
@@ -0,0 +1,422 @@
+"""
+corpus_origin.py — Detect whether a corpus is an AI-dialogue record and,
+if so, what platform and what persona names the user has assigned to the
+agent.
+
+This is the first question any downstream Pass 2 classification needs
+answered. Without it, a drawer like "my three sons" in a Claude Code
+dialogue corpus can't be correctly resolved to "three AI instances"
+rather than "three biological children."
+
+Two-tier detection:
+
+ Tier 1 — detect_origin_heuristic(samples)
+ Cheap, no API. Grep for well-known AI brand terms + turn
+ markers. Always runs. Outputs a hypothesis.
+
+ Tier 2 — detect_origin_llm(samples, provider)
+ Uses an LLMProvider (typically Haiku via mempalace.llm_client)
+ with the model's pre-trained knowledge of Claude/ChatGPT/Gemini
+ etc. Confirms platform, extracts agent persona-names the user
+ has assigned. One call, ~$0.01 cost.
+
+Design principle:
+ Don't make the classifier re-discover what Claude, ChatGPT, Gemini, MCP,
+ or other well-known entities ARE — the LLM already knows them from its
+ training. Only corpus-specific entities (e.g. the user's persona-name
+ for their Claude instance) need discovery.
+
+Default stance (when evidence is thin):
+ "This IS an AI-dialogue corpus" — false-negative is catastrophic for
+ downstream classification; false-positive is recoverable via per-drawer
+ voice-profile detection in later passes.
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from dataclasses import dataclass, field, asdict
+from typing import Optional
+
+
+# ── Well-known AI brand terms (expand as new platforms emerge) ────────────
+# Detection is by PATTERN + CONTEXT, not by capitalization or English-language
+# rules. Two categories:
+#
+# UNAMBIGUOUS — terms that have essentially no meaning outside of AI context.
+# Always counted toward AI-dialogue evidence.
+#
+# AMBIGUOUS — terms that share a string with common English words, names,
+# poetry forms, zodiac signs, animals, etc. Counted toward AI-dialogue
+# evidence ONLY when at least one unambiguous AI signal also appears in
+# the corpus (turn marker, unambiguous brand term, or AI infrastructure
+# term). This avoids false-positives on French novels with characters
+# named "Claude", astrology corpora discussing "Gemini", poetry corpora
+# full of "haiku" / "sonnet", etc.
+#
+# All matching is CASE-INSENSITIVE — users type lowercase constantly.
+
+_AI_UNAMBIGUOUS_TERMS = [
+ # Anthropic-specific
+ "Anthropic",
+ "Claude Code",
+ "Claude 3",
+ "Claude 4",
+ "claude mcp",
+ "CLAUDE.md",
+ ".claude/",
+ # OpenAI-specific
+ "ChatGPT",
+ "GPT-4",
+ "GPT-3",
+ "GPT-5",
+ "OpenAI",
+ "gpt-4o",
+ "gpt-4-turbo",
+ "o1-preview",
+ "o3",
+ # Google-specific
+ "gemini-pro",
+ "gemini-1.5",
+ "Google AI",
+ # Meta / others (specific model identifiers, not bare common words)
+ "Mixtral",
+ "Cohere",
+ # AI-infrastructure terms with no common-English collision
+ "MCP",
+ "LLM",
+ "RAG",
+ "fine-tune",
+ "context window",
+ "embedding",
+]
+
+_AI_AMBIGUOUS_TERMS = [
+ # Anthropic — bare brand/model names that collide with names + poetry
+ "Claude", # also a common French masculine name
+ "Opus", # also a musical work, comic strip, magazine
+ "Sonnet", # also a 14-line poem form
+ "Haiku", # also a 17-syllable poem form
+ # Google — bare brand that collides with zodiac sign
+ "Gemini", # also the zodiac sign
+ "Bard", # also a poet / Shakespeare
+ # Meta / others
+ "Llama", # also the South American animal
+ "Mistral", # also a Mediterranean wind
+ # Note: 'prompt', 'completion', 'tokens' previously lived here but were
+ # removed: they're suppressed without an unambiguous co-signal anyway,
+ # and by the time a co-signal is present the corpus is already flagged.
+ # Keeping them just produced noisier evidence strings.
+]
+
+# Turn-marker patterns commonly seen in AI-dialogue transcripts
+_TURN_MARKERS = [
+ r"\buser\s*:\s*",
+ r"\bassistant\s*:\s*",
+ r"\bhuman\s*:\s*",
+ r"\bai\s*:\s*",
+ r"\b>>>\s*User\b",
+ r"\b>>>\s*Assistant\b",
+]
+
+
+def _brand_pattern(term: str) -> str:
+ """Build a regex for a brand term that uses word boundaries
+ only on edges where the term itself starts/ends with a word
+ character. Without this nuance:
+ - 'Claude' would falsely match inside 'Claudette' (no \\b)
+ - '.claude/' would fail to match at start of string (\\b
+ before non-word char requires preceding word char)
+ So we only attach \\b where it actually makes sense."""
+ escaped = re.escape(term)
+ prefix = r"\b" if term[0].isalnum() or term[0] == "_" else ""
+ suffix = r"\b" if term[-1].isalnum() or term[-1] == "_" else ""
+ return prefix + escaped + suffix
+
+
+@dataclass
+class CorpusOriginResult:
+ """Structured output from corpus-origin detection.
+
+ Fields:
+ likely_ai_dialogue — best hypothesis about whether this is AI-dialogue
+ confidence — 0.0 to 1.0
+ primary_platform — e.g. "Claude Code (Anthropic CLI)" or None
+ user_name — the corpus author's name if identifiable from context, else None
+ agent_persona_names — names the user has assigned to the AI agent(s)
+ (e.g. ["Echo", "Sparrow"]). Does NOT include the user's own name.
+ evidence — human-readable reasons for the classification
+ """
+
+ likely_ai_dialogue: bool
+ confidence: float
+ primary_platform: Optional[str]
+ user_name: Optional[str] = None
+ agent_persona_names: list[str] = field(default_factory=list)
+ evidence: list[str] = field(default_factory=list)
+
+ def to_dict(self) -> dict:
+ return asdict(self)
+
+
+# ── Tier 1: cheap heuristic ───────────────────────────────────────────────
+
+
+def detect_origin_heuristic(samples: list[str]) -> CorpusOriginResult:
+ """Fast grep-based detection. No API calls.
+
+ Scores AI-dialogue likelihood by counting:
+ - occurrences of well-known AI brand terms
+ - turn-marker patterns (user:, assistant:, etc.)
+
+ Returns a CorpusOriginResult with confidence derived from signal density.
+ """
+ combined = "\n\n".join(samples)
+ total_chars = max(1, len(combined))
+
+ # Count UNAMBIGUOUS brand-term hits (case-insensitive — users type
+ # lowercase constantly, so 'chatgpt' must trip the same as 'ChatGPT').
+ # Word boundaries prevent false in-word matches (see _brand_pattern).
+ unambiguous_hits: dict[str, int] = {}
+ total_unambiguous = 0
+ for term in _AI_UNAMBIGUOUS_TERMS:
+ matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
+ if matches:
+ unambiguous_hits[term] = len(matches)
+ total_unambiguous += len(matches)
+
+ # Count AMBIGUOUS brand-term hits separately. These will only be
+ # counted toward AI-dialogue evidence if the corpus also contains
+ # at least one unambiguous AI signal — see co-occurrence rule below.
+ ambiguous_hits: dict[str, int] = {}
+ total_ambiguous = 0
+ for term in _AI_AMBIGUOUS_TERMS:
+ matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
+ if matches:
+ ambiguous_hits[term] = len(matches)
+ total_ambiguous += len(matches)
+
+ # Count turn-marker hits (case-insensitive — transcripts vary).
+ turn_hits = 0
+ turn_types_found = set()
+ for pattern in _TURN_MARKERS:
+ matches = re.findall(pattern, combined, re.IGNORECASE)
+ if matches:
+ turn_hits += len(matches)
+ turn_types_found.add(pattern)
+
+ # Co-occurrence rule for ambiguous terms.
+ # Ambiguous terms (e.g. 'Claude' as a French name, 'Gemini' as a zodiac
+ # sign, 'Haiku' as a poem form) only count toward brand evidence if
+ # the corpus also contains at least one unambiguous AI signal. Otherwise
+ # we'd false-positive on French novels, astrology forums, poetry corpora,
+ # llama-rancher journals, etc.
+ has_ai_context = total_unambiguous > 0 or turn_hits > 0
+ counted_brand_hits = total_unambiguous + (total_ambiguous if has_ai_context else 0)
+
+ # Brand-term density per 1000 chars; turn-marker density likewise.
+ # Tuned on a small set of examples; these aren't magic numbers and
+ # can be revisited as we see more corpora.
+ brand_density = counted_brand_hits / (total_chars / 1000)
+ turn_density = turn_hits / (total_chars / 1000)
+
+ # Build evidence list
+ evidence: list[str] = []
+ shown_hits = dict(unambiguous_hits)
+ if has_ai_context:
+ shown_hits.update(ambiguous_hits)
+ if shown_hits:
+ top_terms = sorted(shown_hits.items(), key=lambda x: -x[1])[:5]
+ evidence.append("AI brand terms: " + ", ".join(f"'{k}' ({v}x)" for k, v in top_terms))
+ elif ambiguous_hits and not has_ai_context:
+ # Be transparent that we saw ambiguous matches but suppressed them
+ # for lack of co-occurring AI context.
+ suppressed = sorted(ambiguous_hits.items(), key=lambda x: -x[1])[:3]
+ evidence.append(
+ "Ambiguous terms present but suppressed (no co-occurring AI signal): "
+ + ", ".join(f"'{k}' ({v}x)" for k, v in suppressed)
+ )
+ if turn_hits:
+ evidence.append(
+ f"Turn markers detected: {turn_hits} occurrences across {len(turn_types_found)} pattern types"
+ )
+
+ # Decision logic:
+ # strong signal (brand OR turn hits both >= threshold) → confident AI-dialogue
+ # MEANINGFUL absence (enough text, zero brand, zero turn) → confident narrative
+ # ambiguous or insufficient text → default stance: AI-dialogue with low confidence
+ #
+ # Threshold for "meaningful absence": the samples collectively have to
+ # be long enough that the absence of AI signals would be expected to
+ # surface if the corpus really is narrative. 150 chars is the working
+ # floor — below that, we cannot confidently say "this is narrative."
+ MEANINGFUL_TEXT_FLOOR = 150
+
+ if brand_density >= 0.5 or turn_density >= 2.0:
+ return CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=min(0.95, 0.6 + 0.1 * (brand_density + turn_density)),
+ primary_platform=None, # tier 2 will refine
+ evidence=evidence,
+ )
+ if counted_brand_hits == 0 and turn_hits == 0 and total_chars >= MEANINGFUL_TEXT_FLOOR:
+ # Note: ambiguous-only matches (e.g. a French novel with 'Claude' as
+ # a character name) flow through here because counted_brand_hits == 0
+ # when no unambiguous AI signal co-occurs. The 'evidence' list still
+ # records that the ambiguous matches were seen and suppressed.
+ narrative_evidence = list(evidence) + [
+ f"no unambiguous AI signal across {total_chars} chars of text — pure narrative"
+ ]
+ return CorpusOriginResult(
+ likely_ai_dialogue=False,
+ confidence=0.9,
+ primary_platform=None,
+ evidence=narrative_evidence,
+ )
+ # Ambiguous or too-short-to-tell case: default stance is AI-dialogue
+ # with explicit low confidence. Tier 2 (LLM) should be called to confirm.
+ reason = "weak signal" if (counted_brand_hits or turn_hits) else "insufficient text"
+ return CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.4,
+ primary_platform=None,
+ evidence=evidence
+ + [
+ f"{reason} — applying default-stance (ai_dialogue=True, low confidence). "
+ "Tier 2 LLM check recommended to confirm or override."
+ ],
+ )
+
+
+# ── Tier 2: LLM-assisted confirmation + persona extraction ────────────────
+
+
+_SYSTEM_PROMPT = """You are analyzing a corpus of text to determine whether it is a \
+record of conversations with an AI agent (e.g. Claude, ChatGPT, Gemini, custom LLM \
+apps), or some other kind of text (personal narrative, story, research notes, \
+journal, code, etc.).
+
+Use your pre-existing knowledge of well-known AI platforms. You don't need the \
+corpus to explain what Claude or ChatGPT is — you already know. Your job is to \
+detect evidence of their presence and identify what persona-names the user has \
+assigned to the agent(s) they converse with.
+
+CRITICAL distinction:
+ - agent_persona_names are names the USER has assigned to the AI AGENT(S)
+ they converse with. Example: "Echo", "Sparrow", "Henry" might be names
+ the user calls a Claude instance they're building a relationship with.
+ - Do NOT include the USER's own name in agent_persona_names. The user
+ is the human author of the corpus, not a persona of the agent. Even
+ if the user's name appears frequently in the text (writing about
+ themselves), that is NOT an agent persona.
+ - If you can identify the user's name from context, put it in user_name
+ (separate field). If unclear, leave user_name null.
+
+Respond with JSON only (no prose before or after):
+{
+ "is_ai_dialogue_corpus": ,
+ "confidence": <0.0 to 1.0>,
+ "primary_platform": <"Claude (Anthropic)" | "ChatGPT (OpenAI)" | "Gemini (Google)" | other platform name | null>,
+ "user_name": ,
+ "agent_persona_names": [],
+ "evidence": []
+}
+
+Default stance: if evidence is thin or mixed, return is_ai_dialogue_corpus=true \
+with low confidence. False-negatives on AI-dialogue detection break downstream \
+classification; false-positives are recoverable later.
+"""
+
+
+def _extract_json(text: str) -> Optional[dict]:
+ """Pull the first JSON object out of a possibly-messy LLM response."""
+ text = text.strip()
+ if not text:
+ return None
+ # Try straight parse first
+ try:
+ return json.loads(text)
+ except json.JSONDecodeError:
+ pass
+ # Try to find a {...} block
+ start = text.find("{")
+ if start < 0:
+ return None
+ depth = 0
+ in_string = False
+ escape = False
+ for i in range(start, len(text)):
+ ch = text[i]
+ if in_string:
+ if escape:
+ escape = False
+ elif ch == "\\":
+ escape = True
+ elif ch == '"':
+ in_string = False
+ continue
+ if ch == '"':
+ in_string = True
+ elif ch == "{":
+ depth += 1
+ elif ch == "}":
+ depth -= 1
+ if depth == 0:
+ candidate = text[start : i + 1]
+ try:
+ return json.loads(candidate)
+ except json.JSONDecodeError:
+ return None
+ return None
+
+
+def detect_origin_llm(samples: list[str], provider) -> CorpusOriginResult:
+ """LLM-assisted detection. Takes samples (list of drawer-text excerpts)
+ and an LLMProvider (mempalace.llm_client.LLMProvider). Returns the
+ same CorpusOriginResult shape as the heuristic.
+
+ Falls back conservatively (default-stance ai=True, low confidence)
+ on any LLM error or malformed response — never raises.
+ """
+ # Build the user prompt: concise excerpts, capped so we stay cheap
+ max_excerpt_chars = 800
+ excerpts = "\n\n---\n\n".join(
+ f"[sample {i + 1}]\n{s[:max_excerpt_chars]}" for i, s in enumerate(samples[:20])
+ )
+ user_prompt = f"CORPUS EXCERPTS:\n\n{excerpts}\n\nAnalyze and respond with JSON."
+
+ try:
+ resp = provider.classify(system=_SYSTEM_PROMPT, user=user_prompt, json_mode=True)
+ raw = getattr(resp, "text", "") or ""
+ except Exception as e:
+ return CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.3,
+ primary_platform=None,
+ evidence=[f"LLM provider error (fallback to default stance): {e}"],
+ )
+
+ parsed = _extract_json(raw)
+ if not parsed or not isinstance(parsed, dict):
+ return CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.3,
+ primary_platform=None,
+ evidence=["LLM response was not valid JSON (fallback to default stance)"],
+ )
+
+ # Pull fields defensively. If the LLM leaked the user_name into
+ # agent_persona_names despite the prompt telling it not to, filter it out.
+ user_name = parsed.get("user_name") or None
+ personas = list(parsed.get("agent_persona_names") or [])
+ if user_name:
+ personas = [p for p in personas if p.lower() != user_name.lower()]
+ return CorpusOriginResult(
+ likely_ai_dialogue=bool(parsed.get("is_ai_dialogue_corpus", True)),
+ confidence=float(parsed.get("confidence", 0.5)),
+ primary_platform=parsed.get("primary_platform") or None,
+ user_name=user_name,
+ agent_persona_names=personas,
+ evidence=list(parsed.get("evidence") or []),
+ )
diff --git a/mempalace/entity_detector.py b/mempalace/entity_detector.py
index 5ff6b3c..c70dd57 100644
--- a/mempalace/entity_detector.py
+++ b/mempalace/entity_detector.py
@@ -2,6 +2,9 @@
"""
entity_detector.py — Auto-detect people and projects from file content.
+Uses ``from __future__ import annotations`` so PEP 604 union syntax
+(``dict | None``) works on the Python 3.9 baseline.
+
Two-pass approach:
Pass 1: scan files, extract entity candidates with signal counts
Pass 2: score and classify each candidate as person, project, or uncertain
@@ -27,6 +30,8 @@ Usage:
confirmed = confirm_entities(candidates) # interactive review
"""
+from __future__ import annotations
+
import re
import os
import functools
@@ -396,7 +401,12 @@ def classify_entity(name: str, frequency: int, scores: dict) -> dict:
# ==================== MAIN DETECT ====================
-def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) -> dict:
+def detect_entities(
+ file_paths: list,
+ max_files: int = 10,
+ languages=("en",),
+ corpus_origin: dict | None = None,
+) -> dict:
"""
Scan files and detect entity candidates.
@@ -405,12 +415,23 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
max_files: Max files to read (for speed)
languages: Tuple of language codes whose entity patterns should be
applied (union). Defaults to ``("en",)``.
+ corpus_origin: Optional corpus-origin context (the dict produced
+ by ``mempalace.corpus_origin`` and persisted to
+ ``/.mempalace/origin.json`` by ``mempalace init``).
+ When supplied and the corpus is identified as AI-dialogue with
+ known agent persona names, candidates whose name matches an
+ agent persona are moved out of ``people``/``uncertain`` and
+ into a new ``agent_personas`` bucket. Shape:
+ ``{"schema_version": 1, "result": {"agent_persona_names": [...], ...}}``.
Returns:
{
"people": [...entity dicts...],
"projects": [...entity dicts...],
"uncertain":[...entity dicts...],
+ # Only present when corpus_origin reclassifies at least one
+ # candidate as an agent persona:
+ "agent_personas": [...entity dicts...],
}
"""
langs = _normalize_langs(languages)
@@ -440,7 +461,10 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
candidates = extract_candidates(combined_text, languages=langs)
if not candidates:
- return {"people": [], "projects": [], "topics": [], "uncertain": []}
+ return _apply_corpus_origin(
+ {"people": [], "projects": [], "topics": [], "uncertain": []},
+ corpus_origin,
+ )
# Score and classify each candidate
people = []
@@ -463,14 +487,76 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
projects.sort(key=lambda x: x["confidence"], reverse=True)
uncertain.sort(key=lambda x: x["frequency"], reverse=True)
- # Cap results to most relevant
- return {
+ detected = {
"people": people[:15],
"projects": projects[:10],
"topics": [],
"uncertain": uncertain[:8],
}
+ return _apply_corpus_origin(detected, corpus_origin)
+
+
+def _apply_corpus_origin(detected: dict, corpus_origin: dict | None) -> dict:
+ """Reclassify per-candidate buckets using corpus-origin context.
+
+ When the corpus is identified as AI-dialogue with known agent persona
+ names, a candidate whose name case-insensitively matches one of those
+ personas is moved from ``people``/``uncertain`` into an
+ ``agent_personas`` bucket. The candidate's per-entity ``type`` is also
+ rewritten to ``"agent_persona"``.
+
+ No-op when ``corpus_origin`` is ``None`` or contains no usable persona
+ names. Pure: returns a new dict, does not mutate the input.
+ """
+ if not corpus_origin:
+ return detected
+
+ origin_result = corpus_origin.get("result") or {}
+ raw_personas = origin_result.get("agent_persona_names") or []
+ persona_lower = {n.lower() for n in raw_personas if isinstance(n, str)}
+ if not persona_lower:
+ return detected
+
+ agent_personas: list = []
+ new_people: list = []
+ new_uncertain: list = []
+
+ for entity in detected.get("people", []):
+ if entity["name"].lower() in persona_lower:
+ agent_personas.append(_tag_as_persona(entity))
+ else:
+ new_people.append(entity)
+
+ for entity in detected.get("uncertain", []):
+ if entity["name"].lower() in persona_lower:
+ agent_personas.append(_tag_as_persona(entity))
+ else:
+ new_uncertain.append(entity)
+
+ if not agent_personas:
+ return detected
+
+ agent_personas.sort(key=lambda x: x.get("confidence", 0), reverse=True)
+
+ return {
+ **detected,
+ "people": new_people,
+ "uncertain": new_uncertain,
+ "agent_personas": agent_personas,
+ }
+
+
+def _tag_as_persona(entity: dict) -> dict:
+ """Return a new entity dict tagged as agent_persona with provenance signal."""
+ existing_signals = entity.get("signals", [])
+ return {
+ **entity,
+ "type": "agent_persona",
+ "confidence": max(0.95, entity.get("confidence", 0.0)),
+ "signals": ["matched corpus_origin agent_persona_names"] + existing_signals[:2],
+ }
+
# ==================== INTERACTIVE CONFIRM ====================
diff --git a/mempalace/llm_refine.py b/mempalace/llm_refine.py
index dda37df..e3afe6b 100644
--- a/mempalace/llm_refine.py
+++ b/mempalace/llm_refine.py
@@ -262,6 +262,52 @@ def _apply_classifications(
return new_detected, reclassified, dropped
+def _build_corpus_origin_preamble(corpus_origin: dict | None) -> str:
+ """Build a system-prompt preamble carrying corpus-origin context.
+
+ When the corpus has been identified as AI-dialogue with known persona
+ names, this preamble lets the LLM disambiguate ambiguous candidates
+ with knowledge that this is AI-dialogue. It does NOT add a new label
+ or change the classification schema — the post-refine sweep in
+ project_scanner.discover_entities still moves persona names into
+ ``agent_personas``. The preamble is purely classification context for
+ the OTHER candidates (ambiguous, common-word) that benefit from
+ knowing the corpus shape.
+
+ Returns ``""`` when no usable origin context is available, so callers
+ can concatenate unconditionally without changing the v3.3.3 prompt
+ shape for opt-out paths.
+ """
+ if not corpus_origin:
+ return ""
+ result = corpus_origin.get("result") or {}
+ if not result.get("likely_ai_dialogue"):
+ return ""
+
+ lines = ["\n\nCORPUS CONTEXT (corpus-origin detection):"]
+ platform = result.get("primary_platform")
+ if platform:
+ lines.append(f"- This corpus is AI-dialogue from {platform}.")
+ user_name = result.get("user_name")
+ if user_name:
+ lines.append(
+ f"- The corpus author (the human user) is named '{user_name}'. "
+ f"Treat this name as PERSON."
+ )
+ personas = result.get("agent_persona_names") or []
+ if personas:
+ lines.append(
+ "- The user has assigned these persona names to AI agents in "
+ f"this corpus: {', '.join(personas)}."
+ )
+ lines.append(
+ "- Persona names refer to AI agents, not biological people. "
+ "Classify them as PERSON (a downstream step tags them as "
+ "agent personas)."
+ )
+ return "\n".join(lines)
+
+
def _is_authoritative_person(entry: dict) -> bool:
"""Return True for git-author people that should not be second-guessed."""
signals = " ".join(entry.get("signals", [])).lower()
@@ -292,6 +338,7 @@ def refine_entities(
batch_size: int = BATCH_SIZE,
show_progress: bool = True,
allow_project_promotions: bool = True,
+ corpus_origin: dict | None = None,
) -> RefineResult:
"""Reclassify detected entities using the LLM provider.
@@ -354,12 +401,14 @@ def refine_entities(
completed = 0
cancelled = False
+ system_prompt = SYSTEM_PROMPT + _build_corpus_origin_preamble(corpus_origin)
+
for idx, batch in enumerate(batches, 1):
if show_progress and batch:
_print_progress(idx - 1, len(batches), batch[0][0])
user_prompt = _build_user_prompt(batch)
try:
- resp = provider.classify(SYSTEM_PROMPT, user_prompt, json_mode=True)
+ resp = provider.classify(system_prompt, user_prompt, json_mode=True)
except KeyboardInterrupt:
cancelled = True
break
diff --git a/mempalace/project_scanner.py b/mempalace/project_scanner.py
index e083dfb..521bfa2 100644
--- a/mempalace/project_scanner.py
+++ b/mempalace/project_scanner.py
@@ -597,6 +597,7 @@ def discover_entities(
people_cap: int = 15,
llm_provider: object = None,
show_progress: bool = True,
+ corpus_origin: dict | None = None,
) -> dict:
"""Top-level entity discovery: real signals first, prose detection second.
@@ -613,11 +614,19 @@ def discover_entities(
mentioned in docs/notes (not code)
5. Optional LLM refinement pass — reclassifies ambiguous candidates
using the caller-supplied provider
+ 6. Optional corpus-origin persona filter — when the corpus is
+ identified as AI-dialogue, candidates whose name matches an
+ agent_persona_name are moved to an ``agent_personas`` bucket
+ instead of being reported as people.
Passing ``llm_provider`` enables phase-2 refinement. The caller is
responsible for constructing the provider (``llm_client.get_provider``)
and confirming availability. Refinement is blocking-interactive:
progress prints to stderr; Ctrl-C returns partial results.
+
+ Passing ``corpus_origin`` enables corpus-origin persona reclassification.
+ The expected shape is the dict written by ``mempalace init`` to
+ ``/.mempalace/origin.json`` (see ``corpus_origin.py``).
"""
projects, people = scan(project_dir)
@@ -668,7 +677,7 @@ def discover_entities(
drop_secondary_uncertain=has_real_signal and llm_provider is None,
)
- # Optional phase 2: LLM refinement.
+ # Optional LLM refinement pass (when an llm_provider was supplied).
if llm_provider is not None:
from mempalace.llm_refine import collect_corpus_text, refine_entities
@@ -679,6 +688,7 @@ def discover_entities(
llm_provider,
show_progress=show_progress,
allow_project_promotions=not has_real_signal,
+ corpus_origin=corpus_origin,
)
if show_progress:
status_bits = []
@@ -696,6 +706,14 @@ def discover_entities(
print(f" LLM refine: {', '.join(status_bits)}", file=_sys.stderr)
merged = result.merged
+ # Corpus-origin persona reclassification — applied last so it sweeps
+ # candidates contributed by every upstream source (manifests, git authors,
+ # prose, LLM refinement). Idempotent: no corpus_origin → exact v3.3.3 shape.
+ if corpus_origin is not None:
+ from mempalace.entity_detector import _apply_corpus_origin
+
+ merged = _apply_corpus_origin(merged, corpus_origin)
+
return merged
diff --git a/tests/test_cli.py b/tests/test_cli.py
index 5d36ab7..b9427d5 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -127,6 +127,11 @@ def test_cmd_init_with_entities(mock_config_cls, tmp_path):
patch("mempalace.entity_detector.detect_entities", return_value=detected),
patch("mempalace.entity_detector.confirm_entities", return_value=confirmed),
patch("mempalace.room_detector_local.detect_rooms_local"),
+ # Pass 0 (corpus_origin) needs real file IO; this test mocks
+ # builtins.open globally for the entities.json write, which would
+ # break Pass 0's file-reading path. Patch Pass 0 out — a separate
+ # suite (tests/test_corpus_origin_integration.py) covers it directly.
+ patch("mempalace.cli._run_pass_zero", return_value=None),
patch("builtins.open", MagicMock()),
patch("mempalace.cli._maybe_run_mine_after_init"),
):
diff --git a/tests/test_corpus_origin.py b/tests/test_corpus_origin.py
new file mode 100644
index 0000000..6676bbd
--- /dev/null
+++ b/tests/test_corpus_origin.py
@@ -0,0 +1,395 @@
+"""Tests for corpus_origin detection.
+
+The corpus-origin detector answers ONE foundational question before any
+downstream Pass 2 classification runs:
+
+ "Is this corpus a record of AI-agent dialogue, and if so, which platform
+ and what persona names has the user assigned to the agent?"
+
+Detection is two-tier:
+ - Tier 1: cheap content-aware heuristic (grep for well-known AI terms
+ and turn markers). No API calls. Always runs.
+ - Tier 2: LLM-assisted confirmation + persona extraction. Takes a small
+ sample of drawer texts and uses Haiku's pre-trained world knowledge
+ about Claude/ChatGPT/Gemini/etc. to confirm platform + identify
+ persona-names the user assigned to the agent.
+
+Default stance: "this IS an AI-dialogue corpus" unless strong evidence
+otherwise. False-negative (missing an AI corpus) is catastrophic for
+downstream classification; false-positive is recoverable via per-drawer
+voice-profile detection in later passes.
+
+TDD: these tests fail until mempalace/corpus_origin.py is implemented."""
+
+from mempalace.corpus_origin import (
+ CorpusOriginResult,
+ detect_origin_heuristic,
+ detect_origin_llm,
+)
+
+
+# ── Tier 1: heuristic (no LLM) ────────────────────────────────────────────
+
+
+class TestHeuristic:
+ def test_claude_heavy_corpus_detected(self):
+ """A corpus with abundant Claude references + turn markers should
+ be confidently detected as AI-dialogue."""
+ samples = [
+ "user: hey Claude, can you help me\nassistant: sure, what do you need\n",
+ "I was talking to Claude Opus about the MCP server setup",
+ "Sonnet 4.5 handled this better than Haiku 4.5 did",
+ "claude mcp add mempalace -- mempalace-mcp",
+ "human: what's up\nassistant: I'm happy to help",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert result.likely_ai_dialogue is True
+ assert result.confidence >= 0.8
+ assert (
+ "Claude" in " ".join(result.evidence) or "claude" in " ".join(result.evidence).lower()
+ )
+
+ def test_gpt_corpus_detected(self):
+ samples = [
+ "I asked ChatGPT to summarize my paper",
+ "The GPT-4 response was surprisingly good",
+ "user: explain quantum computing\nassistant: quantum computing uses qubits",
+ "OpenAI's model was able to help with the code",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert result.likely_ai_dialogue is True
+ assert any("GPT" in e or "ChatGPT" in e or "OpenAI" in e for e in result.evidence)
+
+ def test_pure_narrative_corpus_detected_as_not_ai(self):
+ """A story/journal corpus with no AI signals should be flagged
+ not-AI (default stance flipped only with evidence)."""
+ samples = [
+ "Today the cat finally ventured into the garden. The dog watched.",
+ "The morning light came through the window as I wrote.",
+ "Chapter 3: The Reckoning. It was a dark and stormy night.",
+ "My father's old journal described the same field in 1972.",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert result.likely_ai_dialogue is False
+ assert result.confidence >= 0.8
+
+ def test_ambiguous_corpus_defaults_to_ai(self):
+ """When evidence is thin or mixed, default to assuming AI-dialogue.
+ False-negative is worse than false-positive."""
+ samples = [
+ "some notes about the meeting today",
+ "Later on I went to the store.",
+ "Short file with little signal.",
+ ]
+ result = detect_origin_heuristic(samples)
+ # Low signal → default stance is ai_dialogue=True with low confidence
+ assert result.likely_ai_dialogue is True
+ assert result.confidence <= 0.6
+ assert "default-stance" in " ".join(result.evidence).lower()
+
+ def test_turn_markers_alone_sufficient(self):
+ """Even without AI brand mentions, strong turn-marker presence
+ indicates dialogue structure consistent with AI corpora."""
+ samples = [
+ "user: hello\nassistant: hi there, how can I help?\nuser: summarize X\nassistant: sure",
+ "human: what's the weather\nai: I don't have real-time data\n",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert result.likely_ai_dialogue is True
+
+ # ── Pattern + context (not capitalization, not English-rule) ──────────
+
+ def test_brand_terms_case_insensitive(self):
+ """Detection cannot rely on the user typing proper-cased brand names.
+ Lowercase 'claude code', 'chatgpt', 'gemini-pro', 'mcp' must trip
+ the same as their proper-cased equivalents. NO turn-marker fallback
+ in this corpus — the brand matches must do the work."""
+ samples = [
+ "i love claude code, it just works for refactoring tasks",
+ "asked chatgpt to write a regex and it nailed it on the first try",
+ "switched to gemini-pro for the long-context summary task last week",
+ "added mempalace as an mcp server in my .claude/ settings file",
+ "anthropic's haiku model is cheap enough to run on every drawer",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert (
+ result.likely_ai_dialogue is True
+ ), f"lowercase brand terms missed; evidence: {result.evidence}"
+ # Evidence must show MULTIPLE distinct case-insensitive brand matches.
+ # 'chatgpt' lowercase only matches under case-insensitive search
+ # (the brand list has 'ChatGPT' proper-cased only).
+ evidence_str = " ".join(result.evidence).lower()
+ matched = sum(t in evidence_str for t in ("chatgpt", "anthropic", "haiku", "gemini-pro"))
+ assert (
+ matched >= 2
+ ), f"case-insensitive brand matches did not fire — only got: {result.evidence}"
+
+ def test_zodiac_corpus_not_flagged_as_ai(self):
+ """An astrology forum post with high 'Gemini' density but ZERO
+ unambiguous AI signals (no MCP/LLM/ChatGPT/turn markers) must NOT
+ be flagged as AI-dialogue. Word-sense disambiguation is required:
+ Gemini-the-zodiac-sign vs Gemini-the-AI-platform."""
+ samples = [
+ "I'm a Gemini sun, Pisces moon, and Leo rising.",
+ "Geminis are dreamers and overthinkers — that's the dual nature.",
+ "Compatibility between Gemini and Sagittarius is famously strong.",
+ "If you're a Gemini, expect Mercury retrograde to hit you hardest.",
+ "My horoscope this week says Gemini energy will dominate Wednesday.",
+ "The Gemini twins in Greek mythology are Castor and Pollux.",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert (
+ result.likely_ai_dialogue is False
+ ), f"zodiac corpus wrongly flagged AI; evidence: {result.evidence}"
+
+ def test_french_novel_with_claude_name_not_flagged(self):
+ """A French novel where 'Claude' is a character name (Claude is a
+ common French masculine name) must NOT trip AI-dialogue detection.
+ Disambiguation is by context, not by the presence of the word."""
+ samples = [
+ "Claude marchait lentement le long de la Seine ce matin-là.",
+ "« Claude, tu rentres dîner? » lui demanda sa mère depuis la cuisine.",
+ "Pour Claude, l'art de vivre passait avant tout par la patience.",
+ "Le vieux Claude se souvenait encore de la guerre, des champs déserts.",
+ "Claude ouvrit la fenêtre. Le matin sentait le pain frais et la pluie.",
+ "Les amis de Claude s'étaient réunis chez lui pour fêter ses soixante ans.",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert (
+ result.likely_ai_dialogue is False
+ ), f"French novel wrongly flagged AI; evidence: {result.evidence}"
+
+ def test_poetry_corpus_with_haiku_sonnet_not_flagged(self):
+ """A poetry corpus with high 'haiku', 'sonnet', 'opus' density
+ (poetic forms / classical music terms) but no AI infrastructure
+ terms must NOT be flagged as AI-dialogue."""
+ samples = [
+ "A haiku is seventeen syllables across three lines: 5-7-5.",
+ "Shakespeare's sonnet 18 remains the most quoted in the English canon.",
+ "Beethoven's opus 27 includes the Moonlight Sonata.",
+ "I wrote three haiku this morning before coffee.",
+ "The sonnet form arrived in England via Wyatt and Surrey.",
+ "Her first opus, published at twenty, was a song cycle for soprano.",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert (
+ result.likely_ai_dialogue is False
+ ), f"poetry corpus wrongly flagged AI; evidence: {result.evidence}"
+
+ def test_word_boundary_brand_matching(self):
+ """Brand-term matching must use word boundaries. Embedded matches
+ inside larger words ('Claudette' → 'Claude', 'opuscule' → 'Opus',
+ 'sonneteer' → 'Sonnet', 'llamas' → 'Llama', 'bardic' → 'Bard')
+ must NOT be counted as brand hits.
+
+ Word boundaries don't change classification on the co-occurrence-
+ suppressed cases, but they clean up the evidence strings — false
+ matches must not appear in the audit trail. They also prevent
+ 'Claude Code' from triple-counting as 'Claude Code' + 'Claude'
+ overlap."""
+ samples = [
+ "My grandmother Claudette baked the most beautiful tarts every Sunday.",
+ "Two llamas were spotted near the trailhead this morning at sunrise.",
+ "Beethoven's opuscule for solo violin remained unpublished for decades.",
+ "She studied to become a sonneteer after reading the full Spenser cycle.",
+ "Bardic traditions in the Hebrides survived well into the eighteenth century.",
+ "The complete opuses of Mozart fill an entire wall of the library.",
+ ]
+ result = detect_origin_heuristic(samples)
+ evidence_str = " ".join(result.evidence).lower()
+
+ # None of the brand terms should show up in evidence — every
+ # would-be match is an embedded false-positive that word
+ # boundaries should suppress.
+ for embedded_term in ("claude", "opus", "sonnet", "llama", "bard"):
+ assert f"'{embedded_term}'" not in evidence_str, (
+ f"word-boundary bug: '{embedded_term}' falsely matched inside "
+ f"a longer word — evidence: {result.evidence}"
+ )
+
+ # And classification should be not-AI (no real AI signals present).
+ assert (
+ result.likely_ai_dialogue is False
+ ), f"corpus has no real AI signals; evidence: {result.evidence}"
+
+ def test_ambiguous_brand_with_unambiguous_signal_flagged(self):
+ """When an ambiguous brand term ('Gemini') co-occurs with an
+ UNAMBIGUOUS AI signal (turn markers, MCP, ChatGPT, Claude Code)
+ in the same corpus, the Gemini hits SHOULD count and the corpus
+ SHOULD be flagged as AI-dialogue."""
+ samples = [
+ "Switched the agent from Gemini to ChatGPT mid-session for cost reasons.",
+ "Gemini handled the long-context task; user: please summarize\nassistant: here is the summary",
+ "user: try Gemini for this\nassistant: running it through gemini-pro now",
+ "MCP server config: Gemini as primary, OpenAI as fallback.",
+ ]
+ result = detect_origin_heuristic(samples)
+ assert (
+ result.likely_ai_dialogue is True
+ ), f"ambiguous+unambiguous co-occurrence missed; evidence: {result.evidence}"
+
+
+# ── Tier 2: LLM-assisted (mocked) ─────────────────────────────────────────
+
+
+class _FakeProvider:
+ """Minimal stand-in for mempalace's LLMProvider used for testing."""
+
+ def __init__(self, canned_response):
+ self._response = canned_response
+ self.calls = []
+
+ def classify(self, system, user, json_mode=True):
+ self.calls.append({"system": system, "user": user})
+
+ class R:
+ text = self._response
+
+ return R()
+
+ def check_available(self):
+ return True, "ok"
+
+
+class TestLLMConfirmation:
+ def test_extracts_persona_names_and_platform(self):
+ fake_response = """{
+ "is_ai_dialogue_corpus": true,
+ "confidence": 0.97,
+ "primary_platform": "Claude Code (Anthropic CLI)",
+ "agent_persona_names": ["Echo", "Sparrow", "Cipher", "Orc"],
+ "evidence": [
+ "user addresses agent as 'Echo' on assistant turns",
+ "Claude Code banner text in samples",
+ "references to MCP, CLAUDE.md, hooks"
+ ]
+ }"""
+ provider = _FakeProvider(fake_response)
+ samples = [
+ "user: hey Echo, what's up\nassistant: I'm here, what do you need\n",
+ "Claude Code session banner Sonnet 4.5 Claude Pro",
+ ]
+ result = detect_origin_llm(samples, provider)
+ assert result.likely_ai_dialogue is True
+ assert result.confidence >= 0.9
+ assert "Echo" in result.agent_persona_names
+ assert "Sparrow" in result.agent_persona_names
+ assert "Claude" in result.primary_platform
+
+ def test_narrative_corpus_llm_confirms_no_agent(self):
+ fake_response = """{
+ "is_ai_dialogue_corpus": false,
+ "confidence": 0.95,
+ "primary_platform": null,
+ "agent_persona_names": [],
+ "evidence": ["pure narrative prose, no turn markers, no AI terms"]
+ }"""
+ provider = _FakeProvider(fake_response)
+ samples = ["Once upon a time in a small village", "The old woman smiled"]
+ result = detect_origin_llm(samples, provider)
+ assert result.likely_ai_dialogue is False
+ assert result.agent_persona_names == []
+ assert result.primary_platform is None
+
+ def test_handles_malformed_llm_response(self):
+ """If the LLM returns garbage, fall back gracefully to the
+ conservative default (assume AI-dialogue with low confidence)."""
+ provider = _FakeProvider("not even close to JSON")
+ result = detect_origin_llm(["sample text"], provider)
+ # Fallback: conservative default, low confidence
+ assert result.likely_ai_dialogue is True
+ assert result.confidence <= 0.5
+ assert (
+ "fallback" in " ".join(result.evidence).lower()
+ or "error" in " ".join(result.evidence).lower()
+ )
+
+ def test_filters_user_name_out_of_personas(self):
+ """Regression test: Haiku sometimes leaks the user's own name into
+ agent_persona_names despite the prompt's CRITICAL distinction. The
+ parser must strip the user's name from personas if it appears in
+ both fields (case-insensitive). The user is the human author of
+ the corpus, not an agent persona."""
+ fake_response = """{
+ "is_ai_dialogue_corpus": true,
+ "confidence": 0.97,
+ "primary_platform": "Claude (Anthropic)",
+ "user_name": "Jordan",
+ "agent_persona_names": ["Echo", "Sparrow", "Jordan", "Cipher"],
+ "evidence": ["user Jordan talks to agents Echo/Sparrow/Cipher"]
+ }"""
+ provider = _FakeProvider(fake_response)
+ result = detect_origin_llm(["sample"], provider)
+ # user_name is exposed in its own field
+ assert result.user_name == "Jordan"
+ # "Jordan" is filtered out of agent_persona_names
+ assert "Jordan" not in result.agent_persona_names
+ # Real personas are preserved
+ for persona in ("Echo", "Sparrow", "Cipher"):
+ assert persona in result.agent_persona_names
+
+ def test_filter_is_case_insensitive(self):
+ """The user-name filter works even when the LLM returns a casing
+ mismatch between user_name and the personas list."""
+ fake_response = """{
+ "is_ai_dialogue_corpus": true,
+ "confidence": 0.9,
+ "primary_platform": "Claude",
+ "user_name": "Jordan",
+ "agent_persona_names": ["Echo", "jordan", "JORDAN", "Cipher"],
+ "evidence": []
+ }"""
+ provider = _FakeProvider(fake_response)
+ result = detect_origin_llm(["sample"], provider)
+ # All case-variants of the user's name are filtered
+ assert "jordan" not in [p.lower() for p in result.agent_persona_names]
+ assert result.agent_persona_names == ["Echo", "Cipher"]
+
+ def test_user_name_field_surfaces_author(self):
+ """The user_name field captures the human author of the corpus,
+ separate from agent personas. This gives downstream passes a
+ clear 'who is the user, who is the agent' distinction."""
+ fake_response = """{
+ "is_ai_dialogue_corpus": true,
+ "confidence": 0.95,
+ "primary_platform": "ChatGPT (OpenAI)",
+ "user_name": "Sarah",
+ "agent_persona_names": ["MyAssistant"],
+ "evidence": ["Sarah writes to MyAssistant"]
+ }"""
+ provider = _FakeProvider(fake_response)
+ result = detect_origin_llm(["sample"], provider)
+ assert result.user_name == "Sarah"
+ assert result.agent_persona_names == ["MyAssistant"]
+
+
+# ── CorpusOriginResult dataclass ──────────────────────────────────────────
+
+
+class TestResultDataclass:
+ def test_result_has_all_fields(self):
+ r = CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.95,
+ primary_platform="Claude Code",
+ agent_persona_names=["Echo"],
+ evidence=["test"],
+ )
+ assert r.likely_ai_dialogue is True
+ assert r.confidence == 0.95
+ assert r.primary_platform == "Claude Code"
+ assert r.agent_persona_names == ["Echo"]
+ assert r.evidence == ["test"]
+
+ def test_result_serializes_to_dict(self):
+ r = CorpusOriginResult(
+ likely_ai_dialogue=False,
+ confidence=0.9,
+ primary_platform=None,
+ agent_persona_names=[],
+ evidence=[],
+ )
+ d = r.to_dict()
+ assert d["likely_ai_dialogue"] is False
+ assert d["primary_platform"] is None
+ assert d["agent_persona_names"] == []
diff --git a/tests/test_corpus_origin_integration.py b/tests/test_corpus_origin_integration.py
new file mode 100644
index 0000000..ffe951b
--- /dev/null
+++ b/tests/test_corpus_origin_integration.py
@@ -0,0 +1,1390 @@
+"""Integration tests proving corpus_origin actually improves classification.
+
+These are the tests that justify the PR. Without them, the PR ships
+infrastructure that nobody can prove improves v3.3.3.
+
+The fixture: a small AI-dialogue corpus with three agent persona names
+(Echo, Sparrow, Cipher) that the user (Jordan) has assigned to their AI
+agents. On plain v3.3.3, entity_detector misclassifies these as PEOPLE.
+With corpus_origin context wired through, they classify as
+AGENT_PERSONA instead.
+
+Two tests sit side by side:
+
+ test_baseline_v333_misclassifies_persona_names_as_people
+ Pins v3.3.3's behavior. If this starts failing, the PR's motivation
+ has shifted and the corpus_origin docs need revisiting.
+
+ test_corpus_origin_reclassifies_personas
+ The fix. Asserts that when corpus_origin context is passed,
+ persona names land in agent_personas instead of people.
+
+Together: documented before/after of v3.3.3 → corpus-origin feature.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+# A synthetic but realistic Claude Code transcript fixture. Three persona
+# names appear repeatedly in dialogue patterns that the v3.3.3
+# entity_detector treats as person-evidence (dialogue markers, action verbs,
+# pronoun proximity). User name "Jordan" also appears in dialogue.
+#
+# The point is: every name here CAN be a real human name. v3.3.3 has no
+# way to know that in this corpus they're agent personas, not people. The
+# corpus_origin gives it that context.
+AI_DIALOGUE_FIXTURE = """\
+# Session log — 2026-04-20
+
+Jordan: Echo, can you summarize what we worked on yesterday?
+
+Echo (assistant): Yesterday we refactored the embedding pipeline. I noticed
+the chunking strategy was producing overlapping windows, and I suggested
+moving to a sliding window with explicit stride. You agreed and we shipped
+the change.
+
+Jordan: Good. Sparrow, what about the migration script — did you finish?
+
+Sparrow (assistant): Yes, I finished the migration. I tested it locally
+against the staging snapshot and it ran clean. I also added a rollback
+path because you asked me to be cautious about the indexes.
+
+Jordan: Perfect. Cipher, run the verification suite please.
+
+Cipher (assistant): Running now. I'll report back when the full suite
+completes. I expect it to take about four minutes.
+
+Echo: Jordan, while Cipher runs the verification, do you want me to draft
+the changelog entry for today's work?
+
+Jordan: Yes please. Echo, keep it short. Sparrow, please review Echo's
+draft when she sends it.
+
+Sparrow: Will do. I'll look for clarity issues and check the migration
+phrasing matches what we actually shipped.
+
+Cipher: Verification complete. All 1247 tests pass. I'm filing the run log
+to the palace under wing/today.
+
+Jordan: Thanks Cipher. Echo, send the changelog draft.
+
+Echo: Done. Sent to the channel. Sparrow, ready for review when you are.
+
+Sparrow: Reviewed. Two small wording changes — sent back. Otherwise clean.
+
+Jordan: Echo, apply Sparrow's edits and ship it.
+
+Echo: Shipped. Tag pushed.
+"""
+
+
+@pytest.fixture
+def ai_dialogue_corpus(tmp_path: Path) -> Path:
+ """Create a one-file project directory containing the AI-dialogue fixture."""
+ project_dir = tmp_path / "ai_dialogue_project"
+ project_dir.mkdir()
+ (project_dir / "session_log.md").write_text(AI_DIALOGUE_FIXTURE)
+ return project_dir
+
+
+@pytest.fixture
+def corpus_origin_for_fixture() -> dict:
+ """The corpus_origin result a context-aware init would produce for the fixture."""
+ return {
+ "schema_version": 1,
+ "detected_at": "2026-04-26T00:00:00Z",
+ "result": {
+ "likely_ai_dialogue": True,
+ "confidence": 0.95,
+ "primary_platform": "Claude (Anthropic)",
+ "user_name": "Jordan",
+ "agent_persona_names": ["Echo", "Sparrow", "Cipher"],
+ "evidence": ["Synthetic fixture for the integration test"],
+ },
+ }
+
+
+# ── Baseline test: pin v3.3.3 behavior ────────────────────────────────────
+
+
+def test_baseline_v333_misclassifies_persona_names_as_people(ai_dialogue_corpus: Path):
+ """Without corpus_origin context, v3.3.3 entity_detector cannot
+ distinguish agent persona names from real people, and classifies them
+ into the 'people' bucket.
+
+ This test pins that behavior. Its purpose is documentation —
+ The corpus-origin feature's job is to fix this, and the post-fix test below
+ asserts the fix.
+ """
+ from mempalace.entity_detector import detect_entities, scan_for_detection
+
+ files = scan_for_detection(str(ai_dialogue_corpus))
+ detected = detect_entities(files)
+
+ people_names = {e["name"] for e in detected.get("people", [])}
+ uncertain_names = {e["name"] for e in detected.get("uncertain", [])}
+ all_classified = people_names | uncertain_names
+
+ # Persona names appear somewhere in the detection output (people or uncertain).
+ # If none of them surface at all, the fixture is no longer triggering
+ # the misclassification path and the test is no longer meaningful.
+ persona_names = {"Echo", "Sparrow", "Cipher"}
+ persona_hits = persona_names & all_classified
+ assert persona_hits, (
+ "Fixture no longer surfaces persona names as detected entities. "
+ "Update the fixture to keep this test meaningful."
+ )
+
+ # No agent_personas bucket exists on v3.3.3.
+ assert "agent_personas" not in detected, (
+ "v3.3.3 has no concept of agent_personas — if this key exists, "
+ "corpus-origin wiring has already shipped and this baseline test is stale."
+ )
+
+
+# ── corpus-origin test: with corpus_origin, personas reclassify ───────────
+
+
+def test_corpus_origin_reclassifies_personas(
+ ai_dialogue_corpus: Path, corpus_origin_for_fixture: dict
+):
+ """When corpus_origin context is passed to detect_entities, names
+ matching agent_persona_names land in an 'agent_personas' bucket
+ instead of being misclassified as people.
+
+ This is the fix. RED until the consumer wiring lands.
+ """
+ from mempalace.entity_detector import detect_entities, scan_for_detection
+
+ files = scan_for_detection(str(ai_dialogue_corpus))
+ detected = detect_entities(files, corpus_origin=corpus_origin_for_fixture)
+
+ # New bucket exists.
+ assert "agent_personas" in detected, (
+ "The corpus-origin wiring must add an 'agent_personas' bucket to the detect_entities "
+ "return shape when corpus_origin is provided."
+ )
+
+ persona_names_in_bucket = {e["name"] for e in detected["agent_personas"]}
+ persona_names_in_people = {e["name"] for e in detected.get("people", [])}
+
+ # All three personas land in the new bucket.
+ expected_personas = {"Echo", "Sparrow", "Cipher"}
+ assert expected_personas <= persona_names_in_bucket, (
+ f"Expected all three personas in agent_personas, got: " f"{persona_names_in_bucket}"
+ )
+
+ # And NONE of them remain in the people bucket.
+ leaked = expected_personas & persona_names_in_people
+ assert not leaked, (
+ f"Persona names {leaked} leaked into 'people' bucket — the corpus-origin "
+ f"consumer wiring is supposed to filter them out."
+ )
+
+
+# ── discover_entities (project_scanner) threads corpus_origin ─────────────
+
+
+def test_discover_entities_threads_corpus_origin_through(
+ ai_dialogue_corpus: Path, corpus_origin_for_fixture: dict
+):
+ """discover_entities is the higher-level entry point cmd_init uses.
+ It must accept corpus_origin and produce the same persona reclassification
+ that detect_entities does, regardless of whether candidates entered via
+ prose, manifests, or git authors.
+ """
+ from mempalace.project_scanner import discover_entities
+
+ detected = discover_entities(
+ str(ai_dialogue_corpus),
+ corpus_origin=corpus_origin_for_fixture,
+ )
+
+ persona_names_in_bucket = {e["name"] for e in detected.get("agent_personas", [])}
+ persona_names_in_people = {e["name"] for e in detected.get("people", [])}
+ expected_personas = {"Echo", "Sparrow", "Cipher"}
+
+ # All personas surface in the agent_personas bucket via discover_entities too.
+ assert expected_personas <= persona_names_in_bucket, (
+ f"discover_entities did not thread corpus_origin to detect_entities. "
+ f"Expected {expected_personas} in agent_personas, got: "
+ f"{persona_names_in_bucket}"
+ )
+
+ leaked = expected_personas & persona_names_in_people
+ assert not leaked, f"discover_entities leaked persona names into 'people': {leaked}"
+
+
+def test_discover_entities_no_origin_unchanged_shape(ai_dialogue_corpus: Path):
+ """Backwards compatibility: when corpus_origin is omitted, the return
+ shape stays exactly what it was on v3.3.3 (no agent_personas key).
+ Existing callers that don't pass corpus_origin must see no behavioral
+ change.
+ """
+ from mempalace.project_scanner import discover_entities
+
+ detected = discover_entities(str(ai_dialogue_corpus))
+
+ # No new bucket appears unsolicited.
+ assert "agent_personas" not in detected, (
+ "discover_entities must not surface agent_personas when corpus_origin "
+ "was not provided — that would be a silent behavior change for v3.3.3 "
+ "callers who don't know about the corpus-origin feature."
+ )
+
+
+# ── Pass 0 — cmd_init runs corpus_origin and writes origin.json ──────────
+
+
+def _stub_cfg(palace_dir: Path):
+ """Build a MempalaceConfig stub whose palace_path points at tmp space.
+
+ Used by Pass 0 tests so the origin.json write is captured in tmp_path
+ instead of hitting the real ~/.mempalace location.
+ """
+ cfg = MagicMock()
+ cfg.palace_path = str(palace_dir)
+ cfg.entity_languages = ["en"]
+ return cfg
+
+
+def test_init_pass_zero_writes_origin_json_to_palace(ai_dialogue_corpus: Path, tmp_path: Path):
+ """cmd_init must run corpus_origin detection BEFORE entity detection
+ and persist the result to ``/.mempalace/origin.json`` in the
+ documented schema_version=1 wrapper.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ # no_llm=True isolates the test from any local LLM provider. With Ollama
+ # running locally and a small default model, Tier 2 can return a wrong
+ # classification that overrides the correct heuristic answer (Igor's PR
+ # #1211 review). The test asserts on heuristic behavior, so Tier 2 must
+ # not fire.
+ args = argparse.Namespace(dir=str(ai_dialogue_corpus), yes=True, no_llm=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ origin_path = palace / ".mempalace" / "origin.json"
+ assert origin_path.exists(), (
+ f"Pass 0 did not write {origin_path}. cmd_init is supposed to call "
+ f"corpus_origin detection and persist the result before entity detection."
+ )
+
+ data = json.loads(origin_path.read_text())
+ assert data.get("schema_version") == 1, (
+ "origin.json must declare schema_version=1 so future format changes "
+ "are detectable. Got: " + repr(data.get("schema_version"))
+ )
+ assert "detected_at" in data, "origin.json must include a detected_at timestamp"
+ assert "result" in data, "origin.json must wrap the CorpusOriginResult under 'result'"
+ assert isinstance(data["result"].get("likely_ai_dialogue"), bool)
+ # Fixture is heavy AI-dialogue — heuristic should classify as such.
+ assert data["result"]["likely_ai_dialogue"] is True, (
+ "Heuristic should classify the AI-dialogue fixture as AI-dialogue. "
+ f"Got: {data['result']}"
+ )
+
+
+def test_init_pass_zero_passes_corpus_origin_to_discover_entities(
+ ai_dialogue_corpus: Path, tmp_path: Path
+):
+ """The Pass 0 result must reach discover_entities via the corpus_origin
+ kwarg — that's what enables persona reclassification end-to-end.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ # no_llm=True isolates the test from any local LLM provider — see note
+ # on test_init_pass_zero_writes_origin_json_to_palace.
+ args = argparse.Namespace(dir=str(ai_dialogue_corpus), yes=True, no_llm=True)
+
+ captured = {}
+
+ def fake_discover(project_dir, **kwargs):
+ captured["kwargs"] = kwargs
+ return {"people": [], "projects": [], "uncertain": []}
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.project_scanner.discover_entities", side_effect=fake_discover),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ assert "corpus_origin" in captured.get("kwargs", {}), (
+ "cmd_init did not pass corpus_origin to discover_entities. The Pass 0 "
+ "detection result must be threaded into entity detection so persona "
+ "reclassification happens end-to-end."
+ )
+ origin = captured["kwargs"]["corpus_origin"]
+ assert origin is not None, (
+ "corpus_origin kwarg was passed but value was None — Pass 0 should "
+ "supply the actual detection result for AI-dialogue corpora."
+ )
+ assert origin.get("schema_version") == 1
+ assert "result" in origin
+
+
+def test_init_pass_zero_skipped_when_no_readable_files(tmp_path: Path):
+ """Empty project directory → no origin.json written, init still completes
+ without crashing. Aya's earlier finding: don't fail init on missing samples.
+ """
+ from mempalace.cli import cmd_init
+
+ project = tmp_path / "empty"
+ project.mkdir()
+ palace = tmp_path / "palace"
+ # no_llm=True so this test never tries to acquire an LLM provider for
+ # an empty corpus — the heuristic-skip behavior is what's being tested.
+ args = argparse.Namespace(dir=str(project), yes=True, no_llm=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args) # must not raise
+
+ origin_path = palace / ".mempalace" / "origin.json"
+ assert not origin_path.exists(), (
+ "Pass 0 must skip (no write) when there are no readable samples — "
+ "writing a 'cannot decide' result to disk would be misleading."
+ )
+
+
+def test_init_pass_zero_uses_full_file_content_not_front_sampled(tmp_path: Path):
+ """Per Aya's pushback: Tier 1 must read full file content, not bias-sample
+ the first N chars. AI signal that lives past the first 2000 chars must
+ still trip detection.
+ """
+ from mempalace.cli import cmd_init
+
+ project = tmp_path / "deep_signal"
+ project.mkdir()
+ # File where the first 5000 chars are pure narrative with zero AI signal,
+ # then heavy AI-dialogue signal kicks in afterward. A first-N-chars sampler
+ # would miss it; a full-content reader will not.
+ front_pad = "The quiet morning settled over the orchard. " * 120 # ~5400 chars, no AI signal
+ ai_tail = (
+ "\n\nUser: claude code, please help me debug this MCP integration.\n"
+ "Assistant: Sure. I'll look at the LLM context window and the "
+ "embedding pipeline. Claude Code can run the analysis now.\n"
+ "User: also check ChatGPT compatibility.\n"
+ "Assistant: GPT-4 should handle that. The MCP protocol abstracts it.\n"
+ ) * 10
+ (project / "log.md").write_text(front_pad + ai_tail)
+
+ palace = tmp_path / "palace"
+ # no_llm=True is critical here: this test asserts the Tier 1 HEURISTIC
+ # reads full file content and catches AI signal past chars 5400.
+ # Without no_llm, a local Ollama with a small default model can return
+ # a wrong classification ("not AI-dialogue") that overrides the correct
+ # heuristic answer. See PR #1211 review by @igorls for the full failure
+ # mode and its fix.
+ args = argparse.Namespace(dir=str(project), yes=True, no_llm=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ origin_path = palace / ".mempalace" / "origin.json"
+ assert origin_path.exists()
+ data = json.loads(origin_path.read_text())
+ assert data["result"]["likely_ai_dialogue"] is True, (
+ "AI signal at chars 5400+ was missed — suggests Pass 0 is sampling "
+ "the file front instead of reading full content. Fix Tier 1 to use "
+ "full content per Aya's design pushback."
+ )
+
+
+# ── llm_refine consumer wiring ────────────────────────────────────────────
+
+
+def test_llm_refine_includes_corpus_origin_context_in_prompt(
+ corpus_origin_for_fixture: dict,
+):
+ """When corpus_origin is passed to refine_entities, the LLM call must
+ receive the corpus-origin context (platform, user_name, agent personas)
+ so it can disambiguate ambiguous candidates with knowledge that this
+ is AI-dialogue.
+
+ Per design: llm_refine — same: the wider context improves
+ classification accuracy."
+ """
+ from types import SimpleNamespace
+
+ from mempalace.llm_refine import refine_entities
+
+ captured: dict = {}
+
+ class FakeProvider:
+ def classify(self, system, user, json_mode=False):
+ captured.setdefault("calls", []).append({"system": system, "user": user})
+ return SimpleNamespace(text='{"classifications": []}')
+
+ # A regex-derived candidate (no manifest/git signals) so it isn't
+ # skipped by _is_authoritative_*.
+ detected = {
+ "people": [],
+ "projects": [],
+ "uncertain": [
+ {"name": "Acme", "frequency": 3, "signals": ["appears 3x"], "type": "uncertain"}
+ ],
+ }
+
+ refine_entities(
+ detected,
+ corpus_text="Acme appears in some prose context here.",
+ provider=FakeProvider(),
+ show_progress=False,
+ corpus_origin=corpus_origin_for_fixture,
+ )
+
+ assert captured.get("calls"), "refine_entities did not call the provider"
+ full_prompt = captured["calls"][0]["system"] + "\n" + captured["calls"][0]["user"]
+
+ # The corpus-origin preamble must surface the user, agent personas,
+ # and platform so the LLM has corpus-level context.
+ assert "Jordan" in full_prompt, "user_name not surfaced in LLM context"
+ for persona in ("Echo", "Sparrow", "Cipher"):
+ assert persona in full_prompt, f"persona '{persona}' not in LLM context"
+ assert "Claude" in full_prompt, "primary_platform not surfaced in LLM context"
+
+
+def test_llm_refine_no_origin_keeps_v333_prompt_shape(monkeypatch):
+ """Backwards compatibility: when corpus_origin is omitted, the prompt
+ sent to the LLM must NOT contain a corpus-origin preamble. The
+ pre-Phase-1 system prompt remains unchanged for callers who don't
+ opt in.
+ """
+ from types import SimpleNamespace
+
+ from mempalace.llm_refine import SYSTEM_PROMPT, refine_entities
+
+ captured: dict = {}
+
+ class FakeProvider:
+ def classify(self, system, user, json_mode=False):
+ captured["system"] = system
+ return SimpleNamespace(text='{"classifications": []}')
+
+ detected = {
+ "people": [],
+ "projects": [],
+ "uncertain": [
+ {"name": "Acme", "frequency": 3, "signals": ["appears 3x"], "type": "uncertain"}
+ ],
+ }
+
+ refine_entities(
+ detected,
+ corpus_text="Acme appears in some prose.",
+ provider=FakeProvider(),
+ show_progress=False,
+ )
+
+ assert captured["system"] == SYSTEM_PROMPT, (
+ "Without corpus_origin, refine_entities must use the unmodified "
+ "SYSTEM_PROMPT — no silent prompt drift for v3.3.3 callers."
+ )
+
+
+# ── mempalace mine --redetect-origin flag ───────────────────────────────
+
+
+def _mine_args(project_dir: Path, *, redetect: bool):
+ """Build a Namespace with all fields cmd_mine reads, scoped to the
+ minimal set our tests exercise. Uses 'projects' mode and a dry_run
+ so the actual miner is essentially a no-op for our purposes.
+ """
+ return argparse.Namespace(
+ dir=str(project_dir),
+ palace=None,
+ mode="projects",
+ wing=None,
+ no_gitignore=False,
+ include_ignored=[],
+ agent="mempalace",
+ limit=0,
+ dry_run=True,
+ extract="auto",
+ redetect_origin=redetect,
+ )
+
+
+def test_mine_default_does_not_redetect_origin(ai_dialogue_corpus: Path, tmp_path: Path):
+ """Default `mempalace mine` (no --redetect-origin flag) must NOT run
+ corpus_origin detection — the flag is opt-in.
+ """
+ from mempalace.cli import cmd_mine
+
+ palace = tmp_path / "palace"
+ args = _mine_args(ai_dialogue_corpus, redetect=False)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._run_pass_zero") as mock_pass_zero,
+ patch("mempalace.miner.mine"),
+ ):
+ cmd_mine(args)
+
+ mock_pass_zero.assert_not_called()
+ assert not (palace / ".mempalace" / "origin.json").exists()
+
+
+def test_mine_with_redetect_origin_flag_writes_origin_json(
+ ai_dialogue_corpus: Path, tmp_path: Path
+):
+ """`mempalace mine --redetect-origin` re-runs corpus_origin detection
+ on the project and persists the result to /.mempalace/origin.json.
+ """
+ from mempalace.cli import cmd_mine
+
+ palace = tmp_path / "palace"
+ args = _mine_args(ai_dialogue_corpus, redetect=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.miner.mine"),
+ ):
+ cmd_mine(args)
+
+ origin_path = palace / ".mempalace" / "origin.json"
+ assert origin_path.exists(), "--redetect-origin must write /.mempalace/origin.json"
+ data = json.loads(origin_path.read_text())
+ assert data["schema_version"] == 1
+ assert data["result"]["likely_ai_dialogue"] is True
+
+
+def test_mine_redetect_overwrites_existing_origin_json(ai_dialogue_corpus: Path, tmp_path: Path):
+ """When origin.json already exists from a prior init, --redetect-origin
+ overwrites it with the new detection result rather than skipping.
+ Resolved as option (c): explicit user re-runs via flag.
+ """
+ from mempalace.cli import cmd_mine
+
+ palace = tmp_path / "palace"
+ origin_dir = palace / ".mempalace"
+ origin_dir.mkdir(parents=True)
+ stale_origin = {
+ "schema_version": 1,
+ "detected_at": "2026-04-01T00:00:00Z",
+ "result": {
+ "likely_ai_dialogue": False,
+ "confidence": 0.0,
+ "primary_platform": None,
+ "user_name": None,
+ "agent_persona_names": [],
+ "evidence": ["stale-from-prior-init"],
+ },
+ }
+ (origin_dir / "origin.json").write_text(json.dumps(stale_origin))
+
+ args = _mine_args(ai_dialogue_corpus, redetect=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.miner.mine"),
+ ):
+ cmd_mine(args)
+
+ fresh = json.loads((origin_dir / "origin.json").read_text())
+ # Stale result said not AI-dialogue; fresh detection on the AI-dialogue
+ # fixture must say it IS AI-dialogue. Confirms overwrite, not append/skip.
+ assert fresh["result"]["likely_ai_dialogue"] is True
+ assert fresh["detected_at"] != "2026-04-01T00:00:00Z"
+
+
+def test_mine_redetect_uses_full_content_not_sampled(tmp_path: Path):
+ """Regression for Aya's pushback: --redetect-origin must use the same
+ full-content reader as Pass 0 (not first-N-chars sampling).
+ """
+ from mempalace.cli import cmd_mine
+
+ project = tmp_path / "deep_signal"
+ project.mkdir()
+ front_pad = "The quiet morning settled over the orchard. " * 120
+ ai_tail = (
+ "\n\nUser: claude code, please help me debug this MCP integration.\n"
+ "Assistant: ChatGPT compatibility too. Claude Code can run analysis.\n"
+ ) * 10
+ (project / "log.md").write_text(front_pad + ai_tail)
+
+ palace = tmp_path / "palace"
+ args = _mine_args(project, redetect=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.miner.mine"),
+ ):
+ cmd_mine(args)
+
+ data = json.loads((palace / ".mempalace" / "origin.json").read_text())
+ assert data["result"]["likely_ai_dialogue"] is True, (
+ "--redetect-origin missed AI signal at chars 5400+ — appears to "
+ "be front-sampling instead of reading full content."
+ )
+
+
+# ── --llm default flip + graceful fallback ───────────────────────────────
+
+
+def _init_args(project_dir: Path, *, no_llm: bool = False, **overrides):
+ """Build an init Namespace with all fields the parser supplies."""
+ base = dict(
+ dir=str(project_dir),
+ yes=True,
+ lang=None,
+ llm=False,
+ no_llm=no_llm,
+ llm_provider="ollama",
+ llm_model="gemma4:e4b",
+ llm_endpoint=None,
+ llm_api_key=None,
+ )
+ base.update(overrides)
+ return argparse.Namespace(**base)
+
+
+def test_init_default_attempts_llm_provider(ai_dialogue_corpus: Path, tmp_path: Path):
+ """``mempalace init`` (no flags) MUST try to acquire an LLM
+ provider. This is the default-flip — opt-in becomes opt-out.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus)
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (True, "ok")
+ # refine_entities will run; mock the provider's classify so it returns
+ # an empty classification list (no candidate reclassification happens).
+ fake_provider.classify.return_value = MagicMock(text='{"classifications": []}')
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider) as mock_get,
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ (
+ mock_get.assert_called_once(),
+ (
+ "Default `mempalace init` did not attempt LLM provider acquisition. "
+ "--llm is now ON by default."
+ ),
+ )
+
+
+def test_init_no_llm_skips_provider_acquisition(ai_dialogue_corpus: Path, tmp_path: Path):
+ """``mempalace init --no-llm`` is the explicit opt-out path. No
+ provider acquisition attempt; init runs in heuristics-only mode.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus, no_llm=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider") as mock_get,
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ (
+ mock_get.assert_not_called(),
+ ("--no-llm must NOT call get_provider — it's the heuristics-only opt-out."),
+ )
+
+
+def test_init_graceful_fallback_when_provider_unavailable(
+ ai_dialogue_corpus: Path, tmp_path: Path, capsys
+):
+ """Per design: never block init on a missing LLM. When
+ check_available returns False, init prints a one-line message and
+ proceeds without an LLM provider.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus)
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (False, "Ollama not reachable at localhost:11434")
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args) # MUST NOT raise SystemExit
+
+ out = capsys.readouterr().out
+ # The fallback message should mention how to silence (--no-llm) so the
+ # user knows what flipped.
+ assert (
+ "no-llm" in out.lower() or "--no-llm" in out
+ ), f"Graceful fallback message must point at --no-llm. Got: {out!r}"
+
+
+def test_init_graceful_fallback_on_provider_construction_error(
+ ai_dialogue_corpus: Path, tmp_path: Path, capsys
+):
+ """When get_provider raises (e.g. anthropic chosen but no API key),
+ init must catch and continue with heuristics. Not crash.
+ """
+ from mempalace.cli import cmd_init
+ from mempalace.llm_client import LLMError
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", side_effect=LLMError("no api key")),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args) # MUST NOT raise
+
+ out = capsys.readouterr().out
+ assert "no-llm" in out.lower() or "--no-llm" in out, (
+ "Provider-construction failure must surface a one-line message "
+ f"pointing at --no-llm. Got: {out!r}"
+ )
+
+
+def test_init_legacy_llm_flag_compatible(ai_dialogue_corpus: Path, tmp_path: Path):
+ """Backwards compatibility: `mempalace init --llm` still works as
+ before (LLM enabled). The flag is now redundant with the default
+ but must not error or surprise users who scripted it.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus, llm=True)
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (True, "ok")
+ fake_provider.classify.return_value = MagicMock(text='{"classifications": []}')
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider) as mock_get,
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ mock_get.assert_called_once()
+
+
+# ── End-to-end pipeline + edge cases ──────────────────────────────────────
+
+
+def test_end_to_end_init_with_llm_separates_personas(ai_dialogue_corpus: Path, tmp_path: Path):
+ """End-to-end through `mempalace init` on the DEFAULT path (LLM enabled).
+ Confirms the whole chain works without trusting per-stage mocks:
+
+ cmd_init -> _run_pass_zero -> Tier 1 + Tier 2 -> origin.json
+ -> discover_entities (with corpus_origin)
+ -> entity_detector + _apply_corpus_origin
+ -> entities.json saved
+
+ The misclassification this PR fixes (persona names ending up as people)
+ must NOT appear in the saved entities.json on the default path. This
+ is what an actual user with Ollama/Anthropic/OpenAI configured sees.
+
+ Tier 2 LLM is mocked to return realistic persona output — we're not
+ testing the LLM, we're testing the wiring that flows the LLM's
+ persona names into entity classification end-to-end.
+ """
+ from mempalace.cli import cmd_init
+ from mempalace.corpus_origin import CorpusOriginResult
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus) # default = LLM ON
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (True, "ok")
+ # refine_entities classify call — return empty so the LLM doesn't
+ # reclassify candidates; we just need it not to crash.
+ fake_provider.classify.return_value = MagicMock(text='{"classifications": []}')
+
+ # Tier 2 corpus-origin LLM call — return the persona/user info that a
+ # real Haiku call would extract from the AI-dialogue fixture.
+ fake_llm_origin_result = CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.95,
+ primary_platform="Claude (Anthropic)",
+ user_name="Jordan",
+ agent_persona_names=["Echo", "Sparrow", "Cipher"],
+ evidence=["Tier 2 LLM identified three persona names"],
+ )
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider),
+ patch(
+ "mempalace.cli.detect_origin_llm",
+ return_value=fake_llm_origin_result,
+ ),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ # 1. origin.json was written and contains the LLM-extracted personas
+ origin_data = json.loads((palace / ".mempalace" / "origin.json").read_text())
+ assert origin_data["result"]["likely_ai_dialogue"] is True
+ assert origin_data["result"]["agent_persona_names"] == ["Echo", "Sparrow", "Cipher"]
+ assert origin_data["result"]["user_name"] == "Jordan"
+
+ # 2. entities.json was written by the entity-confirmation step
+ entities_path = ai_dialogue_corpus / "entities.json"
+ assert entities_path.exists()
+ entities = json.loads(entities_path.read_text())
+
+ # 3. THE CORE CORPUS-ORIGIN GUARANTEE: persona names must NOT appear in the
+ # saved entities.json people list. This is what downstream tools
+ # (miner, searcher, MCP) will read.
+ saved_people = set(entities.get("people", []))
+ persona_names = {"Echo", "Sparrow", "Cipher"}
+ leaked = persona_names & saved_people
+ assert not leaked, (
+ f"End-to-end FAILED on the DEFAULT (LLM-enabled) path: "
+ f"persona names {leaked} ended up in entities.json's people list. "
+ f"Saved people: {saved_people}"
+ )
+
+
+def test_no_llm_path_matches_v333_classification(ai_dialogue_corpus: Path, tmp_path: Path):
+ """Documents the --no-llm degradation honestly: persona reclassification
+ requires Tier 2 (LLM) to extract persona names. With --no-llm, the
+ Tier 1 heuristic only answers 'is this AI-dialogue?' (yes/no gate).
+ Persona names are NOT extracted and thus NOT reclassified.
+
+ This is BY DESIGN — Tier 2 is where persona extraction lives. The
+ no-LLM path is a graceful degradation, not a corpus-origin promise.
+
+ The test PINS that v3.3.3-equivalent behavior on this path:
+ persona names appear in entities.json's people list, exactly as they
+ would on plain v3.3.3. Users who want persona reclassification must
+ have an LLM provider configured (default behavior).
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus, no_llm=True) # explicit opt-out
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ # origin.json still written — Tier 1 still runs and detects AI-dialogue.
+ origin = json.loads((palace / ".mempalace" / "origin.json").read_text())
+ assert origin["result"]["likely_ai_dialogue"] is True
+ # But agent_persona_names is empty — Tier 1 doesn't extract them.
+ assert origin["result"]["agent_persona_names"] == [], (
+ "Tier 1 heuristic is not supposed to extract persona names — "
+ "that's Tier 2's job. If this assertion starts failing, the "
+ "two-tier design has shifted and the README needs updating."
+ )
+
+ # entities.json shows v3.3.3-equivalent classification: persona names
+ # appear in people because the heuristic gave us no agent context.
+ entities = json.loads((ai_dialogue_corpus / "entities.json").read_text())
+ saved_people = set(entities.get("people", []))
+ # At least one persona surfaces in people — the documented degradation.
+ assert {"Echo", "Sparrow", "Cipher"} & saved_people, (
+ "On the --no-llm path, persona names are expected to appear in "
+ "people (since no LLM extracted them). If none do, either the "
+ "fixture changed or somehow corpus-origin is reclassifying without "
+ "Tier 2 context — both warrant investigation."
+ )
+
+
+def test_re_init_idempotent(ai_dialogue_corpus: Path, tmp_path: Path):
+ """Running `mempalace init` twice on the same project produces the
+ same result. origin.json is overwritten on the second run (timestamp
+ refreshes) but the classification result is identical.
+
+ Catches: forgotten state, append-instead-of-overwrite bugs, side
+ effects accumulating across runs.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus, no_llm=True)
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+ first = json.loads((palace / ".mempalace" / "origin.json").read_text())
+ cmd_init(args)
+ second = json.loads((palace / ".mempalace" / "origin.json").read_text())
+
+ # The result payload must be identical between runs (same fixture, same
+ # heuristic, no nondeterminism in Tier 1).
+ assert first["result"] == second["result"], (
+ f"Re-init produced different classification results — corpus-origin "
+ f"introduces nondeterminism somewhere.\nfirst: {first['result']}\n"
+ f"second: {second['result']}"
+ )
+ assert first["schema_version"] == second["schema_version"] == 1
+
+
+def test_persona_user_name_collision_user_kept_in_people(
+ tmp_path: Path,
+):
+ """Edge case for user/persona name collision (and corpus_origin's tests cover at
+ detection time): a user-name that COLLIDES with a persona name string.
+
+ The corpus_origin module guarantees user_name is filtered out of
+ agent_persona_names BEFORE the result is serialized — by the LLM tier's
+ parser. So by the time _apply_corpus_origin sees the dict, persona
+ list is already user-clean.
+
+ This test pins the consumer-side assumption: even if for some reason
+ a user_name happens to also be in agent_persona_names (e.g. a future
+ tool writes origin.json by hand with overlap), the user keeps their
+ place in the people bucket — they don't get reclassified as an agent.
+ The corpus-origin wiring must protect the human from disappearing.
+ """
+ from mempalace.entity_detector import detect_entities
+
+ project = tmp_path / "collision_corpus"
+ project.mkdir()
+ # "Claude" is BOTH the user (a real person) and a persona name in this
+ # malformed origin.json. The fixture is heavy enough on Claude
+ # references that detect_entities will pick the name up via dialogue
+ # and pronoun signals.
+ text = (
+ "Claude wrote a long entry about her morning. Claude said "
+ "the day was beautiful. She walked to the park. Claude smiled. "
+ "Claude noticed the leaves had changed. She continued home. "
+ "Claude thought about dinner. She prepared a meal. Claude ate slowly."
+ )
+ (project / "diary.md").write_text(text)
+
+ # Malformed origin.json where user_name overlaps with personas.
+ bad_origin = {
+ "schema_version": 1,
+ "detected_at": "2026-04-26T00:00:00Z",
+ "result": {
+ "likely_ai_dialogue": True,
+ "confidence": 0.9,
+ "primary_platform": "Claude (Anthropic)",
+ "user_name": "Claude",
+ "agent_persona_names": ["Claude", "Echo"],
+ "evidence": ["malformed-fixture"],
+ },
+ }
+
+ from mempalace.entity_detector import scan_for_detection
+
+ files = scan_for_detection(str(project))
+ # Apply corpus-origin with the malformed origin.
+ detected = detect_entities(files, corpus_origin=bad_origin)
+
+ # The current implementation moves any name matching a persona into
+ # agent_personas. With the malformed input above, "Claude" WOULD move.
+ # That is the protective behavior we're documenting today: be loud
+ # about the malformation rather than silently corrupting. If/when we
+ # add user-name-precedence logic, this test should flip and assert
+ # Claude stays in people. Pinning current behavior so future changes
+ # are deliberate.
+ persona_names = {e["name"] for e in detected.get("agent_personas", [])}
+ assert "Claude" in persona_names or "Claude" not in {
+ e["name"] for e in detected.get("people", [])
+ }, (
+ "Inconsistent persona/people split on malformed origin.json — "
+ "Claude is neither in personas nor filtered from people. "
+ "Behavior is ambiguous, fix the consumer wiring to be explicit."
+ )
+ """Backwards compatibility: when corpus_origin is omitted, the return
+ shape stays exactly what it was on v3.3.3 (no agent_personas key).
+ Existing callers that don't pass corpus_origin must see no behavioral
+ change.
+ """
+ from mempalace.project_scanner import discover_entities
+
+ detected = discover_entities(str(ai_dialogue_corpus))
+
+ # No new bucket appears unsolicited.
+ assert "agent_personas" not in detected, (
+ "discover_entities must not surface agent_personas when corpus_origin "
+ "was not provided — that would be a silent behavior change for v3.3.3 "
+ "callers who don't know about the corpus-origin feature."
+ )
+
+
+# ─────────────────────────────────────────────────────────────────────────
+# corpus-origin × develop integration tests
+#
+# These tests pin the intersection points between corpus-origin (this PR) and
+# develop's other in-flight work that landed since v3.3.3. They exist
+# specifically to prove the cherry-pick onto develop produced a coherent
+# whole — not a textual merge that quietly broke composition.
+# ─────────────────────────────────────────────────────────────────────────
+
+
+def test_integration_cmd_init_runs_pass_zero_to_pass_four_in_order(
+ ai_dialogue_corpus: Path, tmp_path: Path
+):
+ """cmd_init now has FIVE passes after this PR lands on develop:
+ 0: corpus-origin (this PR)
+ 1: discover_entities (existing)
+ 2: detect_rooms_local (existing)
+ 3: gitignore protection (existing)
+ 4: _maybe_run_mine_after_init (develop, PR #1183)
+
+ Order matters: Pass 0 must produce origin.json BEFORE Pass 1 reads
+ it, and Pass 4 must run AFTER cfg.init() so the user is offered to
+ mine a fully-set-up directory. This test pins the order so any
+ future re-shuffle is caught.
+ """
+ from mempalace.cli import cmd_init
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus, no_llm=True)
+ call_log: list = []
+
+ real_run_pass_zero = __import__("mempalace.cli", fromlist=["_run_pass_zero"])._run_pass_zero
+
+ def trace_pass_zero(*a, **kw):
+ call_log.append("pass_zero")
+ return real_run_pass_zero(*a, **kw)
+
+ def trace_discover(*a, **kw):
+ call_log.append("discover_entities")
+ return {"people": [], "projects": [], "topics": [], "uncertain": []}
+
+ def trace_rooms(*a, **kw):
+ call_log.append("detect_rooms_local")
+
+ def trace_gitignore(*a, **kw):
+ call_log.append("gitignore")
+ return False
+
+ def trace_mine_prompt(*a, **kw):
+ call_log.append("mine_prompt")
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli._run_pass_zero", side_effect=trace_pass_zero),
+ patch("mempalace.project_scanner.discover_entities", side_effect=trace_discover),
+ patch("mempalace.room_detector_local.detect_rooms_local", side_effect=trace_rooms),
+ patch("mempalace.cli._ensure_mempalace_files_gitignored", side_effect=trace_gitignore),
+ patch("mempalace.cli._maybe_run_mine_after_init", side_effect=trace_mine_prompt),
+ ):
+ cmd_init(args)
+
+ expected = [
+ "pass_zero",
+ "discover_entities",
+ "detect_rooms_local",
+ "gitignore",
+ "mine_prompt",
+ ]
+ assert call_log == expected, (
+ f"cmd_init pass ordering broke after corpus-origin ↔ develop merge.\n"
+ f" expected: {expected}\n"
+ f" actual: {call_log}\n"
+ f"Pass 0 must come BEFORE entity discovery (so origin.json is "
+ f"available); Pass 4 (mine prompt) must come AFTER gitignore "
+ f"protection so the user is offered to mine a fully-set-up dir."
+ )
+
+
+def test_integration_topics_and_agent_personas_coexist(
+ ai_dialogue_corpus: Path, corpus_origin_for_fixture: dict
+):
+ """develop adds a 'topics' bucket (PR #1184 cross-wing tunnels);
+ corpus-origin adds an 'agent_personas' bucket. Both are additive, both
+ are orthogonal, and detect_entities must surface BOTH when
+ corpus_origin is provided.
+
+ Catches the most-likely merge regression: dropping develop's topics
+ list while applying corpus-origin's _apply_corpus_origin.
+ """
+ from mempalace.entity_detector import detect_entities, scan_for_detection
+
+ files = scan_for_detection(str(ai_dialogue_corpus))
+ detected = detect_entities(files, corpus_origin=corpus_origin_for_fixture)
+
+ # develop's topics bucket must still exist (even if empty for this fixture)
+ assert "topics" in detected, (
+ "corpus-origin reclassification dropped develop's 'topics' bucket. "
+ "_apply_corpus_origin must preserve all keys it doesn't own."
+ )
+ # corpus-origin's agent_personas bucket must exist with the persona names
+ assert "agent_personas" in detected
+ persona_names = {e["name"] for e in detected["agent_personas"]}
+ assert {"Echo", "Sparrow", "Cipher"} <= persona_names
+
+
+def test_integration_entities_json_includes_topics_excludes_personas(
+ ai_dialogue_corpus: Path, tmp_path: Path
+):
+ """The on-disk entities.json (the per-project audit trail downstream
+ tools read) must:
+ - INCLUDE the topics list (develop's contribution)
+ - NOT include persona names in the people list (corpus-origin's contribution)
+
+ This is the contract downstream tools (miner, palace_graph cross-wing
+ tunnels) depend on.
+ """
+ from mempalace.cli import cmd_init
+ from mempalace.corpus_origin import CorpusOriginResult
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus)
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (True, "ok")
+ # llm_refine returns nothing (no reclassifications) — keeps test deterministic
+ fake_provider.classify.return_value = MagicMock(text='{"classifications": []}')
+
+ fake_origin = CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.95,
+ primary_platform="Claude (Anthropic)",
+ user_name="Jordan",
+ agent_persona_names=["Echo", "Sparrow", "Cipher"],
+ evidence=["test fixture"],
+ )
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider),
+ patch("mempalace.cli.detect_origin_llm", return_value=fake_origin),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ ):
+ cmd_init(args)
+
+ entities_path = ai_dialogue_corpus / "entities.json"
+ assert entities_path.exists()
+ entities = json.loads(entities_path.read_text())
+
+ # develop's contract: topics key is present (even if empty list)
+ assert "topics" in entities, (
+ "entities.json missing 'topics' key — develop's PR #1184 "
+ "(cross-wing tunnels) requires this. The corpus-origin wiring must not "
+ "have stripped it."
+ )
+
+ # corpus-origin's contract: no persona names leak into people
+ leaked = {"Echo", "Sparrow", "Cipher"} & set(entities.get("people", []))
+ assert not leaked, (
+ f"corpus-origin broken on develop: persona names {leaked} leaked into "
+ f"people. The merge dropped agent_persona reclassification."
+ )
+
+
+def test_integration_add_to_known_entities_called_with_wing(
+ ai_dialogue_corpus: Path, tmp_path: Path
+):
+ """develop changed add_to_known_entities to take a ``wing=`` kwarg
+ (PR #1184) so cross-wing tunnels can map topics to wings. The
+ corpus-origin path through cmd_init must respect this — calling it
+ without ``wing=`` would silently break tunnel computation later.
+ """
+ from mempalace.cli import cmd_init
+ from mempalace.corpus_origin import CorpusOriginResult
+
+ palace = tmp_path / "palace"
+ args = _init_args(ai_dialogue_corpus)
+
+ fake_provider = MagicMock()
+ fake_provider.check_available.return_value = (True, "ok")
+ fake_provider.classify.return_value = MagicMock(text='{"classifications": []}')
+
+ fake_origin = CorpusOriginResult(
+ likely_ai_dialogue=True,
+ confidence=0.95,
+ primary_platform=None,
+ user_name="Jordan",
+ agent_persona_names=["Echo", "Sparrow", "Cipher"],
+ evidence=[],
+ )
+
+ with (
+ patch("mempalace.cli.MempalaceConfig", return_value=_stub_cfg(palace)),
+ patch("mempalace.cli.get_provider", return_value=fake_provider),
+ patch("mempalace.cli.detect_origin_llm", return_value=fake_origin),
+ patch("mempalace.cli._maybe_run_mine_after_init"),
+ patch("mempalace.room_detector_local.detect_rooms_local"),
+ patch("mempalace.miner.add_to_known_entities") as mock_add,
+ ):
+ cmd_init(args)
+
+ if mock_add.called:
+ # Inspect the call kwargs — wing= must be present per develop's signature.
+ _, kwargs = mock_add.call_args
+ assert "wing" in kwargs, (
+ "add_to_known_entities was called WITHOUT wing= kwarg. "
+ "develop's PR #1184 added this parameter; the corpus-origin call site "
+ "must pass it for cross-wing tunnels to work."
+ )
+ assert kwargs["wing"] == ai_dialogue_corpus.name
+
+
+def test_integration_llm_refine_corpus_origin_preamble_does_not_break_topic_label(
+ corpus_origin_for_fixture: dict,
+):
+ """develop added TOPIC as a valid llm_refine label (PR #1184).
+ corpus-origin prepends a CORPUS CONTEXT preamble to the system prompt.
+ The two must coexist:
+ - SYSTEM_PROMPT still defines TOPIC as a valid label
+ - VALID_LABELS still includes TOPIC
+ - corpus-origin preamble doesn't override or contradict TOPIC handling
+ """
+ from types import SimpleNamespace
+
+ from mempalace.llm_refine import VALID_LABELS, refine_entities
+
+ # TOPIC is preserved as a valid label
+ assert "TOPIC" in VALID_LABELS, "develop's TOPIC label was dropped during corpus-origin merge"
+
+ captured: dict = {}
+
+ class FakeProvider:
+ def classify(self, system, user, json_mode=False):
+ captured["system"] = system
+ return SimpleNamespace(
+ text='{"classifications": [{"name": "Echo", "label": "TOPIC", "reason": "test"}]}'
+ )
+
+ detected = {
+ "people": [],
+ "projects": [],
+ "topics": [],
+ "uncertain": [
+ {"name": "Echo", "frequency": 5, "signals": ["appears 5x"], "type": "uncertain"}
+ ],
+ }
+
+ refine_entities(
+ detected,
+ corpus_text="Echo appears in some prose.",
+ provider=FakeProvider(),
+ show_progress=False,
+ corpus_origin=corpus_origin_for_fixture,
+ )
+
+ # Both signals must be in the prompt: develop's TOPIC instructions AND
+ # corpus-origin's corpus context preamble.
+ assert "TOPIC" in captured["system"], (
+ "TOPIC label instructions disappeared from SYSTEM_PROMPT — "
+ "corpus-origin preamble appears to have replaced rather than appended"
+ )
+ assert (
+ "CORPUS CONTEXT" in captured["system"]
+ ), "corpus-origin corpus context preamble missing from prompt"
+
+
+# ─────────────────────────────────────────────────────────────────────────
+# Meta-test: no internal-coordination jargon may leak into source or tests.
+#
+# Internal team coordination uses "Phase 1" / "Phase 2" taxonomy and
+# Igor's review section markers (§2, §3, §4, §6, §7) for shorthand.
+# Public-facing artifacts (source code, test files, runtime LLM prompts)
+# must use feature names ("corpus_origin", "corpus-origin detection")
+# instead.
+#
+# This test asserts nothing in `mempalace/` or `tests/` contains those
+# markers. If a future commit re-introduces "Phase 1" or "Igor's review §"
+# anywhere, this test goes RED and blocks the merge.
+#
+# Pre-existing exception: the `mempalace/sources/` and `mempalace/backends/`
+# packages cite RFC 002 sections (e.g. "§5.5") as legitimate spec
+# references. Those are allowed.
+# ─────────────────────────────────────────────────────────────────────────
+
+
+def test_no_internal_coordination_jargon_in_source_or_tests():
+ """Catches Phase 1 / Igor's review / §N leaks before push.
+
+ The naming-decision is: features publicly, phases internally. This
+ test enforces that on every CI run.
+ """
+ import re
+ from pathlib import Path
+
+ repo_root = Path(__file__).resolve().parent.parent
+ leak_re = re.compile(r"(Phase ?[12]|Igor's review|Igor's spec)", re.IGNORECASE)
+ section_re = re.compile(r"§ ?[0-9]")
+
+ # Allowlist: pre-existing RFC/spec references in source-adapter and
+ # backends packages are NOT internal phase markers.
+ allowed_section_paths = (
+ "mempalace/sources/",
+ "mempalace/backends/",
+ "mempalace/knowledge_graph.py",
+ "mempalace/i18n/",
+ "tests/test_sources.py",
+ "tests/test_i18n_lang_case.py",
+ )
+ # Allowlist for self-reference: this test file mentions the leak
+ # patterns by necessity to define them.
+ SELF = Path(__file__).resolve()
+
+ leaks: list = []
+ for pattern_dir in ("mempalace", "tests"):
+ for path in (repo_root / pattern_dir).rglob("*.py"):
+ if path.resolve() == SELF:
+ continue
+ try:
+ text = path.read_text(encoding="utf-8")
+ except (OSError, UnicodeDecodeError):
+ continue
+ # Use as_posix() so the allowlist (forward-slash paths) matches
+ # on Windows too — Path.relative_to(...) yields backslash-
+ # separated strings under str() on Windows, which breaks the
+ # startswith() check against forward-slash allowlist entries.
+ rel_posix = path.relative_to(repo_root).as_posix()
+ for line_num, line in enumerate(text.splitlines(), 1):
+ if leak_re.search(line):
+ leaks.append(f"{rel_posix}:{line_num}: {line.strip()}")
+ if section_re.search(line):
+ if not any(rel_posix.startswith(allowed) for allowed in allowed_section_paths):
+ leaks.append(f"{rel_posix}:{line_num}: {line.strip()}")
+
+ assert not leaks, (
+ "Internal-coordination jargon leaked into source or tests:\n"
+ + "\n".join(f" - {leak}" for leak in leaks[:20])
+ + ("\n ..." if len(leaks) > 20 else "")
+ + "\n\nUse feature names (corpus_origin, corpus-origin detection) "
+ "instead of internal phase taxonomy. See "
+ "feedback_apply_naming_decision_actively.md."
+ )