feat(init): context-aware corpus detection
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5. This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests. The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean. But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports. ## What changes for users (in order, from `pip install` onwards) **Install** — `pip install mempalace` is unchanged. The package itself didn't shift. **First run — `mempalace init <folder>`:** 1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context. 2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`: - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`. - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise. - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here. 3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out. 4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder. 5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all. 6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean. 7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it. 8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR. **Later — when your folder grows:** 9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe. ## Behind the changes - **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation. - **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`. - **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged. - **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module. - **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. - **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback. ## Cost story - **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`. - **Ollama / local LLM runtime:** zero cost. Fully offline. - **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes. - **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks. ## Backwards compatibility - All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change. - The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work. - `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3. ## Test coverage - **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation. - **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition). - **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too. - **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures. ## Hygiene guardrail This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
This commit is contained in:
@@ -13,12 +13,16 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|||||||
- **`mempalace init` now prompts to mine the same directory.** After entity confirmation, room detection, and gitignore guard, `init` shows a one-line scope estimate (e.g. `~423 files (~12 MB) would be mined into this palace.`) computed from its existing corpus walk, then asks `Mine this directory now? [Y/n]` (default yes) and runs `mine()` in-process if accepted. The estimate fires before the prompt so users on a real corpus aren't surprised by a minutes-long ChromaDB write. Declining prints the exact `mempalace mine <dir>` command for later. (#1181)
|
- **`mempalace init` now prompts to mine the same directory.** After entity confirmation, room detection, and gitignore guard, `init` shows a one-line scope estimate (e.g. `~423 files (~12 MB) would be mined into this palace.`) computed from its existing corpus walk, then asks `Mine this directory now? [Y/n]` (default yes) and runs `mine()` in-process if accepted. The estimate fires before the prompt so users on a real corpus aren't surprised by a minutes-long ChromaDB write. Declining prints the exact `mempalace mine <dir>` command for later. (#1181)
|
||||||
- **New `--auto-mine` flag on `mempalace init`** for the non-interactive path (`mempalace init --auto-mine <dir>` skips the mine prompt and runs mine directly). `--yes` retains its existing scope of entity auto-accept only and still prompts for the mine step, so existing scripted callers see no behaviour change; combining `--yes --auto-mine` gives a fully non-interactive setup. (#1181)
|
- **New `--auto-mine` flag on `mempalace init`** for the non-interactive path (`mempalace init --auto-mine <dir>` skips the mine prompt and runs mine directly). `--yes` retains its existing scope of entity auto-accept only and still prompts for the mine step, so existing scripted callers see no behaviour change; combining `--yes --auto-mine` gives a fully non-interactive setup. (#1181)
|
||||||
- **Cross-wing topic tunnels.** When two wings have confirmed `TOPIC` labels in common (the LLM-refine bucket from `mempalace init --llm`), the miner now drops a symmetric tunnel between them at mine time so the palace graph reflects shared themes (frameworks, vendors, recurring concepts). Tunnels are routed through the existing `create_tunnel` storage so they share dedup and persistence with explicit tunnels. Topic tunnels are stored under a synthetic `topic:<name>` room and tagged with `kind: "topic"` on the stored dict — this keeps them distinct from literal folder-derived rooms of the same name (a wing with both an `Angular` folder room and an `Angular` topic tunnel no longer collides at `follow_tunnels` read time) and gives LLMs scanning `list_tunnels` a visible discriminator. Threshold is configurable via `MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` env var or `topic_tunnel_min_count` in `~/.mempalace/config.json` (default `1`). Manifest-dependency overlap and per-topic allow/deny lists remain out of scope. (#1180)
|
- **Cross-wing topic tunnels.** When two wings have confirmed `TOPIC` labels in common (the LLM-refine bucket from `mempalace init --llm`), the miner now drops a symmetric tunnel between them at mine time so the palace graph reflects shared themes (frameworks, vendors, recurring concepts). Tunnels are routed through the existing `create_tunnel` storage so they share dedup and persistence with explicit tunnels. Topic tunnels are stored under a synthetic `topic:<name>` room and tagged with `kind: "topic"` on the stored dict — this keeps them distinct from literal folder-derived rooms of the same name (a wing with both an `Angular` folder room and an `Angular` topic tunnel no longer collides at `follow_tunnels` read time) and gives LLMs scanning `list_tunnels` a visible discriminator. Threshold is configurable via `MEMPALACE_TOPIC_TUNNEL_MIN_COUNT` env var or `topic_tunnel_min_count` in `~/.mempalace/config.json` (default `1`). Manifest-dependency overlap and per-topic allow/deny lists remain out of scope. (#1180)
|
||||||
|
- **Context-aware corpus detection at `mempalace init`.** A new Pass 0 runs at the start of `init` — before entity detection — and answers one question: *is this corpus an AI-dialogue record, and if so, which platform and what persona names has the user assigned to the agents?* Tier 1 is a free regex heuristic (well-known AI brand terms + turn-marker patterns, with a co-occurrence rule that suppresses ambiguous terms like `Claude`/`Gemini`/`Haiku` when no unambiguous AI signal is present, so French novels and astrology forums don't false-positive). Tier 2 is an LLM call (~$0.01 with Anthropic Haiku, free with local Ollama/LM Studio/llama.cpp/vLLM) that extracts `user_name` and `agent_persona_names` from dialogue structure. Result is persisted to `<palace>/.mempalace/origin.json` with a `schema_version: 1` envelope so downstream tools can read it. Entity classification then routes names matching `agent_persona_names` (case-insensitive) into a new `agent_personas` bucket instead of `people`, so a Claude Code transcript no longer misclassifies the user's `Echo`/`Sparrow`/`Cipher` agents as biological people. `llm_refine` receives the same context as a system-prompt preamble so it can disambiguate other ambiguous candidates with corpus-level knowledge too. Backwards compatible: callers that don't pass `corpus_origin` see the v3.3.3 return shape unchanged. (#TBD)
|
||||||
|
- **`mempalace init` runs LLM-assisted refinement by default.** v3.3.3 made `--llm` opt-in; the LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications) so it now runs by default. Provider precedence is unchanged — Ollama at `http://localhost:11434` first, then openai-compat, then anthropic with API key. **Never blocks init on a missing LLM**: if no provider is reachable (Ollama not running, no API key set), init prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. `--no-llm` is the new explicit opt-out. The legacy `--llm` flag is preserved as a deprecated alias of the default so scripted callers see no behaviour change. Cost story: zero for users with a local LLM (the majority on this repo), ~$0.01 per init for users with `ANTHROPIC_API_KEY` set who explicitly choose `--llm-provider anthropic`, zero for users with no LLM (graceful fallback). (#TBD)
|
||||||
|
- **`mempalace mine --redetect-origin` flag.** Re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. Useful when the corpus has grown since `mempalace init` and the stored origin may be stale. Heuristic-only by design (the flag is meant to be cheap); re-run `mempalace init` for full Tier 2 LLM refinement. Default `mempalace mine` does not touch `origin.json` — the flag is opt-in. (#TBD)
|
||||||
|
|
||||||
### Bug Fixes
|
### Bug Fixes
|
||||||
|
|
||||||
- **CLI `mempalace search` retrieval quality.** The CLI was using pure ChromaDB cosine distance with no BM25 rerank, so drawers containing every query term but embedding as noise (directory listings, diff output, shell logs) scored `Match: 0.0` alongside genuinely irrelevant results with no way to tell them apart. Wired the CLI through the same `_hybrid_rank` the `mempalace_search` MCP tool already used, and surfaced both `cosine=` and `bm25=` scores in the output so users see which component of the match is firing. MCP search was unaffected; this fixes the human-facing CLI parity gap.
|
- **CLI `mempalace search` retrieval quality.** The CLI was using pure ChromaDB cosine distance with no BM25 rerank, so drawers containing every query term but embedding as noise (directory listings, diff output, shell logs) scored `Match: 0.0` alongside genuinely irrelevant results with no way to tell them apart. Wired the CLI through the same `_hybrid_rank` the `mempalace_search` MCP tool already used, and surfaced both `cosine=` and `bm25=` scores in the output so users see which component of the match is firing. MCP search was unaffected; this fixes the human-facing CLI parity gap.
|
||||||
- **Legacy-palace distance-metric warning.** CLI search now detects palaces created before `hnsw:space=cosine` was consistently set and prints a one-line notice pointing at `mempalace repair`. Without the warning such palaces silently used L2 distance, under which the similarity display floored every result to `Match: 0.0`. New palaces mined today already set cosine correctly and now have invariant tests pinning that behavior so future refactors can't silently regress it. (#1179)
|
- **Legacy-palace distance-metric warning.** CLI search now detects palaces created before `hnsw:space=cosine` was consistently set and prints a one-line notice pointing at `mempalace repair`. Without the warning such palaces silently used L2 distance, under which the similarity display floored every result to `Match: 0.0`. New palaces mined today already set cosine correctly and now have invariant tests pinning that behavior so future refactors can't silently regress it. (#1179)
|
||||||
- **Graceful Ctrl-C during `mempalace mine`.** Interrupting a long mine no longer dumps a multi-frame `KeyboardInterrupt` traceback. The main file-processing loop now catches the signal, prints `files_processed: N/M`, `drawers_filed: K`, and `last_file:` so the user knows what landed, then exits with code 130 (standard SIGINT). Already-filed drawers are upserted idempotently on re-mine via deterministic IDs, so resuming is safe. The hooks PID lock at `~/.mempalace/hook_state/mine.pid` is now also actively cleaned up in a `finally` when its entry points at us — clean exit, error, or interrupt — preventing the next hook fire from briefly waiting on a stale PID. (#1182)
|
- **Graceful Ctrl-C during `mempalace mine`.** Interrupting a long mine no longer dumps a multi-frame `KeyboardInterrupt` traceback. The main file-processing loop now catches the signal, prints `files_processed: N/M`, `drawers_filed: K`, and `last_file:` so the user knows what landed, then exits with code 130 (standard SIGINT). Already-filed drawers are upserted idempotently on re-mine via deterministic IDs, so resuming is safe. The hooks PID lock at `~/.mempalace/hook_state/mine.pid` is now also actively cleaned up in a `finally` when its entry points at us — clean exit, error, or interrupt — preventing the next hook fire from briefly waiting on a stale PID. (#1182)
|
||||||
|
- **`mempalace init` is now idempotent across re-runs.** Running `init` twice on the same project produced different `origin.json` results because the first run wrote `entities.json` into the project directory, and the second run's corpus-origin sampling included that file as corpus content — shifting Tier 1's character-density math. Sampling now skips the per-project artifacts (`entities.json`, `mempalace.yaml`), so re-running `init` produces the same classification it did the first time. Pinned by an integration test in `tests/test_corpus_origin_integration.py`. (#TBD)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -22,7 +22,7 @@ These are non-negotiable. Every PR, every feature, every refactor must honor the
|
|||||||
- **Verbatim always** — Never summarize, paraphrase, or lossy-compress user data. The system searches the index and returns the original words. If a user said it, we store exactly what they said. This is the foundational promise.
|
- **Verbatim always** — Never summarize, paraphrase, or lossy-compress user data. The system searches the index and returns the original words. If a user said it, we store exactly what they said. This is the foundational promise.
|
||||||
- **Incremental only** — Append-only ingest after initial build. Never destroy existing data to rebuild. A crash mid-operation must leave the existing palace untouched.
|
- **Incremental only** — Append-only ingest after initial build. Never destroy existing data to rebuild. A crash mid-operation must leave the existing palace untouched.
|
||||||
- **Entity-first** — Everything is keyed by real names with disambiguation by DOB, ID, or context. People matter more than topics.
|
- **Entity-first** — Everything is keyed by real names with disambiguation by DOB, ID, or context. People matter more than topics.
|
||||||
- **Local-first, zero API** — All extraction, chunking, and embedding happens on the user's machine. No cloud dependency for memory operations. No API keys required.
|
- **Local-first, zero external API by default** — All extraction, chunking, embedding, and LLM-assisted refinement happens on the user's machine by default, using locally-hosted runtimes (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio, etc.). External providers (Anthropic, OpenAI, Google) are supported via BYOK but are never required and never enabled silently. The system never sends user content to a service the user has not explicitly configured. "Local LLM" is not an external API — Ollama and equivalents running on localhost are part of the user's machine. External BYOK is always a deliberate user choice, never a default and never a silent fallback.
|
||||||
- **Performance budgets** — Hooks under 500ms. Startup injection under 100ms. Memory should feel instant.
|
- **Performance budgets** — Hooks under 500ms. Startup injection under 100ms. Memory should feel instant.
|
||||||
- **Privacy by architecture** — The system physically cannot send your data because it never leaves your machine. No telemetry, no phone-home, no external service dependencies for core operations.
|
- **Privacy by architecture** — The system physically cannot send your data because it never leaves your machine. No telemetry, no phone-home, no external service dependencies for core operations.
|
||||||
- **Background everything** — Filing, indexing, timestamps, and pipeline work happen via hooks in the background. Nothing interrupts the user's conversation. Zero tokens spent on bookkeeping in the chat window.
|
- **Background everything** — Filing, indexing, timestamps, and pipeline work happen via hooks in the background. Nothing interrupts the user's conversation. Zero tokens spent on bookkeeping in the chat window.
|
||||||
|
|||||||
+206
-23
@@ -34,11 +34,143 @@ import argparse
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .config import MempalaceConfig
|
from .config import MempalaceConfig
|
||||||
|
from .corpus_origin import detect_origin_heuristic, detect_origin_llm
|
||||||
|
from .llm_client import LLMError, get_provider
|
||||||
from .version import __version__
|
from .version import __version__
|
||||||
|
|
||||||
|
|
||||||
_MEMPALACE_PROJECT_FILES = ("mempalace.yaml", "entities.json")
|
_MEMPALACE_PROJECT_FILES = ("mempalace.yaml", "entities.json")
|
||||||
|
|
||||||
|
# Pass 0 corpus-origin sampling caps. Tier 1 reads FULL file content (no
|
||||||
|
# front-bias sampling) but bounds total memory on enormous corpora. Tier 2
|
||||||
|
# trims to a smaller view because LLM context windows are finite.
|
||||||
|
_PASS_ZERO_MAX_FILES = 30
|
||||||
|
_PASS_ZERO_PER_FILE_CAP = 100_000 # 100KB per file is generous for prose
|
||||||
|
_PASS_ZERO_TOTAL_CAP = 5_000_000 # 5MB total ceiling — bounds memory
|
||||||
|
_PASS_ZERO_LLM_PER_SAMPLE = 2_000 # for Tier 2 LLM call only
|
||||||
|
_PASS_ZERO_LLM_MAX_SAMPLES = 20 # caps the LLM-tier sample count
|
||||||
|
|
||||||
|
|
||||||
|
def _gather_origin_samples(project_dir) -> list:
|
||||||
|
"""Collect Tier-1 samples for corpus-origin detection.
|
||||||
|
|
||||||
|
Reads FULL file content (capped at ``_PASS_ZERO_PER_FILE_CAP`` per file
|
||||||
|
and ``_PASS_ZERO_TOTAL_CAP`` overall). No front-bias sampling — AI
|
||||||
|
signal that lives past the first N chars of a file must still trip
|
||||||
|
detection, so we read the whole file up to the cap.
|
||||||
|
|
||||||
|
Skips mempalace's own per-project artifacts (``entities.json``,
|
||||||
|
``mempalace.yaml``) so a re-run of ``mempalace init`` produces the
|
||||||
|
same classification result it did on the first run. Without this
|
||||||
|
filter, the first run writes entities.json into the corpus, the
|
||||||
|
second run picks it up as a sample, and the Tier-1 density math
|
||||||
|
drifts (different total_chars). That makes init non-idempotent.
|
||||||
|
|
||||||
|
Returns a list of strings (one per readable file). Empty list when
|
||||||
|
the project has no readable text.
|
||||||
|
"""
|
||||||
|
from .entity_detector import scan_for_detection
|
||||||
|
|
||||||
|
files = scan_for_detection(project_dir, max_files=_PASS_ZERO_MAX_FILES)
|
||||||
|
samples: list = []
|
||||||
|
total_chars = 0
|
||||||
|
for filepath in files:
|
||||||
|
if filepath.name in _MEMPALACE_PROJECT_FILES:
|
||||||
|
continue
|
||||||
|
if total_chars >= _PASS_ZERO_TOTAL_CAP:
|
||||||
|
break
|
||||||
|
try:
|
||||||
|
with open(filepath, encoding="utf-8", errors="replace") as f:
|
||||||
|
content = f.read(_PASS_ZERO_PER_FILE_CAP)
|
||||||
|
except OSError:
|
||||||
|
continue
|
||||||
|
if not content:
|
||||||
|
continue
|
||||||
|
samples.append(content)
|
||||||
|
total_chars += len(content)
|
||||||
|
return samples
|
||||||
|
|
||||||
|
|
||||||
|
def _trim_samples_for_llm(samples: list) -> list:
|
||||||
|
"""Reduce Tier-1 full-content samples to LLM-friendly size.
|
||||||
|
|
||||||
|
Tier 2 hits an LLM with a finite context window — we trim each sample
|
||||||
|
to ``_PASS_ZERO_LLM_PER_SAMPLE`` chars and cap the overall sample
|
||||||
|
count at ``_PASS_ZERO_LLM_MAX_SAMPLES``.
|
||||||
|
"""
|
||||||
|
return [s[:_PASS_ZERO_LLM_PER_SAMPLE] for s in samples[:_PASS_ZERO_LLM_MAX_SAMPLES]]
|
||||||
|
|
||||||
|
|
||||||
|
def _run_pass_zero(project_dir, palace_dir, llm_provider) -> dict:
|
||||||
|
"""Pass 0: detect whether the corpus is AI-dialogue and persist the
|
||||||
|
result to ``<palace>/.mempalace/origin.json``.
|
||||||
|
|
||||||
|
Returns the wrapped result dict (same shape as origin.json) on success,
|
||||||
|
or ``None`` when there are no readable samples to detect from. The
|
||||||
|
return value is what cmd_init forwards to ``discover_entities`` via
|
||||||
|
the ``corpus_origin`` kwarg.
|
||||||
|
|
||||||
|
File-write failures (e.g. read-only palace) are caught and reported on
|
||||||
|
stderr; init never blocks on them.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
samples = _gather_origin_samples(project_dir)
|
||||||
|
if not samples:
|
||||||
|
print(" Skipping corpus-origin detection — no readable samples.")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Tier 1 — always runs. Cheap regex grep, no API.
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
|
||||||
|
# Tier 2 — runs only when an LLM provider is available. The provider
|
||||||
|
# contract is best-effort: corpus_origin internally falls back to a
|
||||||
|
# conservative default on transport/parse failure, so we don't need a
|
||||||
|
# try/except here, but we still keep one for any unforeseen exception.
|
||||||
|
if llm_provider is not None:
|
||||||
|
try:
|
||||||
|
llm_result = detect_origin_llm(_trim_samples_for_llm(samples), llm_provider)
|
||||||
|
# LLM-tier result wins on platform/persona/user fields; keep the
|
||||||
|
# heuristic evidence appended so the on-disk record retains the
|
||||||
|
# cheap-tier signal trail.
|
||||||
|
llm_result.evidence = list(llm_result.evidence) + [
|
||||||
|
f"Tier-1 heuristic: {e}" for e in result.evidence
|
||||||
|
]
|
||||||
|
result = llm_result
|
||||||
|
except Exception as exc: # noqa: BLE001 — never block init on LLM failure
|
||||||
|
print(f" LLM corpus-origin tier failed ({exc}); using heuristic only.")
|
||||||
|
|
||||||
|
wrapped = {
|
||||||
|
"schema_version": 1,
|
||||||
|
"detected_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"result": result.to_dict(),
|
||||||
|
}
|
||||||
|
|
||||||
|
origin_path = Path(palace_dir).expanduser() / ".mempalace" / "origin.json"
|
||||||
|
try:
|
||||||
|
origin_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(origin_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(wrapped, f, indent=2, ensure_ascii=False)
|
||||||
|
except OSError as exc:
|
||||||
|
print(f" Could not write {origin_path}: {exc}", file=sys.stderr)
|
||||||
|
# Return the wrapped dict anyway so the in-memory pipeline still
|
||||||
|
# benefits from the detection result this run.
|
||||||
|
return wrapped
|
||||||
|
|
||||||
|
# Banner — one line, two-space indent matching existing init style.
|
||||||
|
res = result
|
||||||
|
if res.likely_ai_dialogue:
|
||||||
|
platform = res.primary_platform or "AI dialogue (platform unidentified)"
|
||||||
|
user = res.user_name or "—"
|
||||||
|
agents = ", ".join(res.agent_persona_names) if res.agent_persona_names else "—"
|
||||||
|
print(f" Detected: {platform} (user: {user}, agents: {agents})")
|
||||||
|
else:
|
||||||
|
print(f" Corpus origin: not AI-dialogue (confidence: {res.confidence:.2f})")
|
||||||
|
|
||||||
|
return wrapped
|
||||||
|
|
||||||
|
|
||||||
def _ensure_mempalace_files_gitignored(project_dir) -> bool:
|
def _ensure_mempalace_files_gitignored(project_dir) -> bool:
|
||||||
"""If project_dir is a git repo, ensure MemPalace's per-project files
|
"""If project_dir is a git repo, ensure MemPalace's per-project files
|
||||||
@@ -86,29 +218,46 @@ def cmd_init(args):
|
|||||||
languages = cfg.entity_languages
|
languages = cfg.entity_languages
|
||||||
languages_tuple = tuple(languages)
|
languages_tuple = tuple(languages)
|
||||||
|
|
||||||
# Optional phase-2 LLM provider (opt-in via --llm).
|
# --llm is ON by default. --no-llm is the explicit opt-out. Provider
|
||||||
|
# precedence is unchanged (Ollama localhost first, then openai-compat,
|
||||||
|
# then anthropic). Never block init on a missing LLM: when no provider
|
||||||
|
# responds, print a one-line message pointing at --no-llm and fall
|
||||||
|
# through to heuristics-only.
|
||||||
llm_provider = None
|
llm_provider = None
|
||||||
if getattr(args, "llm", False):
|
if not getattr(args, "no_llm", False):
|
||||||
from .llm_client import LLMError, get_provider
|
provider_name = getattr(args, "llm_provider", "ollama") or "ollama"
|
||||||
|
provider_model = getattr(args, "llm_model", "gemma4:e4b") or "gemma4:e4b"
|
||||||
try:
|
try:
|
||||||
llm_provider = get_provider(
|
candidate = get_provider(
|
||||||
name=args.llm_provider,
|
name=provider_name,
|
||||||
model=args.llm_model,
|
model=provider_model,
|
||||||
endpoint=args.llm_endpoint,
|
endpoint=getattr(args, "llm_endpoint", None),
|
||||||
api_key=args.llm_api_key,
|
api_key=getattr(args, "llm_api_key", None),
|
||||||
)
|
)
|
||||||
|
ok, msg = candidate.check_available()
|
||||||
|
if ok:
|
||||||
|
llm_provider = candidate
|
||||||
|
print(f" LLM enabled: {provider_name}/{provider_model}")
|
||||||
|
else:
|
||||||
|
print(
|
||||||
|
f" No LLM provider reachable ({msg}). "
|
||||||
|
f"Running heuristics-only — pass --no-llm to silence this."
|
||||||
|
)
|
||||||
except LLMError as e:
|
except LLMError as e:
|
||||||
print(f" ERROR: {e}", file=sys.stderr)
|
|
||||||
sys.exit(2)
|
|
||||||
ok, msg = llm_provider.check_available()
|
|
||||||
if not ok:
|
|
||||||
print(
|
print(
|
||||||
f" ERROR: LLM provider '{args.llm_provider}' unavailable: {msg}",
|
f" LLM init failed ({e}). "
|
||||||
file=sys.stderr,
|
f"Running heuristics-only — pass --no-llm to silence this."
|
||||||
)
|
)
|
||||||
sys.exit(2)
|
|
||||||
print(f" LLM refinement enabled: {args.llm_provider}/{args.llm_model}")
|
# Pass 0: detect whether the corpus is AI-dialogue. Writes
|
||||||
|
# <palace>/.mempalace/origin.json and supplies corpus context to the
|
||||||
|
# entity classifier so it can correctly handle agent persona names
|
||||||
|
# (e.g. "Echo", "Sparrow") without misclassifying them as people.
|
||||||
|
corpus_origin = _run_pass_zero(
|
||||||
|
project_dir=args.dir,
|
||||||
|
palace_dir=cfg.palace_path,
|
||||||
|
llm_provider=llm_provider,
|
||||||
|
)
|
||||||
|
|
||||||
# Pass 1: discover entities — manifests + git authors first, prose detection
|
# Pass 1: discover entities — manifests + git authors first, prose detection
|
||||||
# as supplement for names mentioned only in docs/notes. Optional phase-2
|
# as supplement for names mentioned only in docs/notes. Optional phase-2
|
||||||
@@ -116,7 +265,12 @@ def cmd_init(args):
|
|||||||
print(f"\n Scanning for entities in: {args.dir}")
|
print(f"\n Scanning for entities in: {args.dir}")
|
||||||
if languages_tuple != ("en",):
|
if languages_tuple != ("en",):
|
||||||
print(f" Languages: {', '.join(languages_tuple)}")
|
print(f" Languages: {', '.join(languages_tuple)}")
|
||||||
detected = discover_entities(args.dir, languages=languages_tuple, llm_provider=llm_provider)
|
detected = discover_entities(
|
||||||
|
args.dir,
|
||||||
|
languages=languages_tuple,
|
||||||
|
llm_provider=llm_provider,
|
||||||
|
corpus_origin=corpus_origin,
|
||||||
|
)
|
||||||
total = (
|
total = (
|
||||||
len(detected["people"])
|
len(detected["people"])
|
||||||
+ len(detected["projects"])
|
+ len(detected["projects"])
|
||||||
@@ -264,6 +418,16 @@ def cmd_mine(args):
|
|||||||
for raw in args.include_ignored or []:
|
for raw in args.include_ignored or []:
|
||||||
include_ignored.extend(part.strip() for part in raw.split(",") if part.strip())
|
include_ignored.extend(part.strip() for part in raw.split(",") if part.strip())
|
||||||
|
|
||||||
|
# --redetect-origin re-runs corpus_origin on the current corpus state
|
||||||
|
# and overwrites <palace>/.mempalace/origin.json before mining proceeds.
|
||||||
|
# Heuristic-only by design — full LLM detection lives on `mempalace init`.
|
||||||
|
if getattr(args, "redetect_origin", False):
|
||||||
|
_run_pass_zero(
|
||||||
|
project_dir=args.dir,
|
||||||
|
palace_dir=palace_path,
|
||||||
|
llm_provider=None,
|
||||||
|
)
|
||||||
|
|
||||||
if args.mode == "convos":
|
if args.mode == "convos":
|
||||||
from .convo_miner import mine_convos
|
from .convo_miner import mine_convos
|
||||||
|
|
||||||
@@ -728,17 +892,25 @@ def main():
|
|||||||
"--llm",
|
"--llm",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help=(
|
help=(
|
||||||
"Enable LLM-assisted entity refinement (opt-in, local-first). "
|
"DEPRECATED — LLM-assisted entity refinement is now ON by default. "
|
||||||
"Runs after manifest/git/regex detection, asking the configured "
|
"This flag is preserved for backward compatibility; pass --no-llm "
|
||||||
"provider to reclassify ambiguous candidates. "
|
"to opt out instead."
|
||||||
"Ctrl-C during refinement returns partial results."
|
),
|
||||||
|
)
|
||||||
|
p_init.add_argument(
|
||||||
|
"--no-llm",
|
||||||
|
action="store_true",
|
||||||
|
help=(
|
||||||
|
"Disable LLM-assisted entity refinement. Run init in heuristics-only "
|
||||||
|
"mode (no provider acquisition, no LLM calls). Use when running "
|
||||||
|
"without a local LLM and you don't want the graceful-fallback message."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
p_init.add_argument(
|
p_init.add_argument(
|
||||||
"--llm-provider",
|
"--llm-provider",
|
||||||
default="ollama",
|
default="ollama",
|
||||||
choices=["ollama", "openai-compat", "anthropic"],
|
choices=["ollama", "openai-compat", "anthropic"],
|
||||||
help="LLM provider (default: ollama). Use --llm to enable.",
|
help="LLM provider (default: ollama). Pass --no-llm to disable LLM-assisted refinement entirely.",
|
||||||
)
|
)
|
||||||
p_init.add_argument(
|
p_init.add_argument(
|
||||||
"--llm-model",
|
"--llm-model",
|
||||||
@@ -789,6 +961,17 @@ def main():
|
|||||||
help="Your name — recorded on every drawer (default: mempalace)",
|
help="Your name — recorded on every drawer (default: mempalace)",
|
||||||
)
|
)
|
||||||
p_mine.add_argument("--limit", type=int, default=0, help="Max files to process (0 = all)")
|
p_mine.add_argument("--limit", type=int, default=0, help="Max files to process (0 = all)")
|
||||||
|
p_mine.add_argument(
|
||||||
|
"--redetect-origin",
|
||||||
|
action="store_true",
|
||||||
|
help=(
|
||||||
|
"Re-run corpus_origin detection on this directory and overwrite "
|
||||||
|
"<palace>/.mempalace/origin.json. Useful when the corpus has grown "
|
||||||
|
"since `mempalace init` and the stored origin may be stale. "
|
||||||
|
"Heuristic-only (no LLM call) — re-run `mempalace init --llm` for "
|
||||||
|
"Tier 2 refinement."
|
||||||
|
),
|
||||||
|
)
|
||||||
p_mine.add_argument(
|
p_mine.add_argument(
|
||||||
"--dry-run", action="store_true", help="Show what would be filed without filing"
|
"--dry-run", action="store_true", help="Show what would be filed without filing"
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -0,0 +1,422 @@
|
|||||||
|
"""
|
||||||
|
corpus_origin.py — Detect whether a corpus is an AI-dialogue record and,
|
||||||
|
if so, what platform and what persona names the user has assigned to the
|
||||||
|
agent.
|
||||||
|
|
||||||
|
This is the first question any downstream Pass 2 classification needs
|
||||||
|
answered. Without it, a drawer like "my three sons" in a Claude Code
|
||||||
|
dialogue corpus can't be correctly resolved to "three AI instances"
|
||||||
|
rather than "three biological children."
|
||||||
|
|
||||||
|
Two-tier detection:
|
||||||
|
|
||||||
|
Tier 1 — detect_origin_heuristic(samples)
|
||||||
|
Cheap, no API. Grep for well-known AI brand terms + turn
|
||||||
|
markers. Always runs. Outputs a hypothesis.
|
||||||
|
|
||||||
|
Tier 2 — detect_origin_llm(samples, provider)
|
||||||
|
Uses an LLMProvider (typically Haiku via mempalace.llm_client)
|
||||||
|
with the model's pre-trained knowledge of Claude/ChatGPT/Gemini
|
||||||
|
etc. Confirms platform, extracts agent persona-names the user
|
||||||
|
has assigned. One call, ~$0.01 cost.
|
||||||
|
|
||||||
|
Design principle:
|
||||||
|
Don't make the classifier re-discover what Claude, ChatGPT, Gemini, MCP,
|
||||||
|
or other well-known entities ARE — the LLM already knows them from its
|
||||||
|
training. Only corpus-specific entities (e.g. the user's persona-name
|
||||||
|
for their Claude instance) need discovery.
|
||||||
|
|
||||||
|
Default stance (when evidence is thin):
|
||||||
|
"This IS an AI-dialogue corpus" — false-negative is catastrophic for
|
||||||
|
downstream classification; false-positive is recoverable via per-drawer
|
||||||
|
voice-profile detection in later passes.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass, field, asdict
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
# ── Well-known AI brand terms (expand as new platforms emerge) ────────────
|
||||||
|
# Detection is by PATTERN + CONTEXT, not by capitalization or English-language
|
||||||
|
# rules. Two categories:
|
||||||
|
#
|
||||||
|
# UNAMBIGUOUS — terms that have essentially no meaning outside of AI context.
|
||||||
|
# Always counted toward AI-dialogue evidence.
|
||||||
|
#
|
||||||
|
# AMBIGUOUS — terms that share a string with common English words, names,
|
||||||
|
# poetry forms, zodiac signs, animals, etc. Counted toward AI-dialogue
|
||||||
|
# evidence ONLY when at least one unambiguous AI signal also appears in
|
||||||
|
# the corpus (turn marker, unambiguous brand term, or AI infrastructure
|
||||||
|
# term). This avoids false-positives on French novels with characters
|
||||||
|
# named "Claude", astrology corpora discussing "Gemini", poetry corpora
|
||||||
|
# full of "haiku" / "sonnet", etc.
|
||||||
|
#
|
||||||
|
# All matching is CASE-INSENSITIVE — users type lowercase constantly.
|
||||||
|
|
||||||
|
_AI_UNAMBIGUOUS_TERMS = [
|
||||||
|
# Anthropic-specific
|
||||||
|
"Anthropic",
|
||||||
|
"Claude Code",
|
||||||
|
"Claude 3",
|
||||||
|
"Claude 4",
|
||||||
|
"claude mcp",
|
||||||
|
"CLAUDE.md",
|
||||||
|
".claude/",
|
||||||
|
# OpenAI-specific
|
||||||
|
"ChatGPT",
|
||||||
|
"GPT-4",
|
||||||
|
"GPT-3",
|
||||||
|
"GPT-5",
|
||||||
|
"OpenAI",
|
||||||
|
"gpt-4o",
|
||||||
|
"gpt-4-turbo",
|
||||||
|
"o1-preview",
|
||||||
|
"o3",
|
||||||
|
# Google-specific
|
||||||
|
"gemini-pro",
|
||||||
|
"gemini-1.5",
|
||||||
|
"Google AI",
|
||||||
|
# Meta / others (specific model identifiers, not bare common words)
|
||||||
|
"Mixtral",
|
||||||
|
"Cohere",
|
||||||
|
# AI-infrastructure terms with no common-English collision
|
||||||
|
"MCP",
|
||||||
|
"LLM",
|
||||||
|
"RAG",
|
||||||
|
"fine-tune",
|
||||||
|
"context window",
|
||||||
|
"embedding",
|
||||||
|
]
|
||||||
|
|
||||||
|
_AI_AMBIGUOUS_TERMS = [
|
||||||
|
# Anthropic — bare brand/model names that collide with names + poetry
|
||||||
|
"Claude", # also a common French masculine name
|
||||||
|
"Opus", # also a musical work, comic strip, magazine
|
||||||
|
"Sonnet", # also a 14-line poem form
|
||||||
|
"Haiku", # also a 17-syllable poem form
|
||||||
|
# Google — bare brand that collides with zodiac sign
|
||||||
|
"Gemini", # also the zodiac sign
|
||||||
|
"Bard", # also a poet / Shakespeare
|
||||||
|
# Meta / others
|
||||||
|
"Llama", # also the South American animal
|
||||||
|
"Mistral", # also a Mediterranean wind
|
||||||
|
# Note: 'prompt', 'completion', 'tokens' previously lived here but were
|
||||||
|
# removed: they're suppressed without an unambiguous co-signal anyway,
|
||||||
|
# and by the time a co-signal is present the corpus is already flagged.
|
||||||
|
# Keeping them just produced noisier evidence strings.
|
||||||
|
]
|
||||||
|
|
||||||
|
# Turn-marker patterns commonly seen in AI-dialogue transcripts
|
||||||
|
_TURN_MARKERS = [
|
||||||
|
r"\buser\s*:\s*",
|
||||||
|
r"\bassistant\s*:\s*",
|
||||||
|
r"\bhuman\s*:\s*",
|
||||||
|
r"\bai\s*:\s*",
|
||||||
|
r"\b>>>\s*User\b",
|
||||||
|
r"\b>>>\s*Assistant\b",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _brand_pattern(term: str) -> str:
|
||||||
|
"""Build a regex for a brand term that uses word boundaries
|
||||||
|
only on edges where the term itself starts/ends with a word
|
||||||
|
character. Without this nuance:
|
||||||
|
- 'Claude' would falsely match inside 'Claudette' (no \\b)
|
||||||
|
- '.claude/' would fail to match at start of string (\\b
|
||||||
|
before non-word char requires preceding word char)
|
||||||
|
So we only attach \\b where it actually makes sense."""
|
||||||
|
escaped = re.escape(term)
|
||||||
|
prefix = r"\b" if term[0].isalnum() or term[0] == "_" else ""
|
||||||
|
suffix = r"\b" if term[-1].isalnum() or term[-1] == "_" else ""
|
||||||
|
return prefix + escaped + suffix
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class CorpusOriginResult:
|
||||||
|
"""Structured output from corpus-origin detection.
|
||||||
|
|
||||||
|
Fields:
|
||||||
|
likely_ai_dialogue — best hypothesis about whether this is AI-dialogue
|
||||||
|
confidence — 0.0 to 1.0
|
||||||
|
primary_platform — e.g. "Claude Code (Anthropic CLI)" or None
|
||||||
|
user_name — the corpus author's name if identifiable from context, else None
|
||||||
|
agent_persona_names — names the user has assigned to the AI agent(s)
|
||||||
|
(e.g. ["Echo", "Sparrow"]). Does NOT include the user's own name.
|
||||||
|
evidence — human-readable reasons for the classification
|
||||||
|
"""
|
||||||
|
|
||||||
|
likely_ai_dialogue: bool
|
||||||
|
confidence: float
|
||||||
|
primary_platform: Optional[str]
|
||||||
|
user_name: Optional[str] = None
|
||||||
|
agent_persona_names: list[str] = field(default_factory=list)
|
||||||
|
evidence: list[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return asdict(self)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Tier 1: cheap heuristic ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def detect_origin_heuristic(samples: list[str]) -> CorpusOriginResult:
|
||||||
|
"""Fast grep-based detection. No API calls.
|
||||||
|
|
||||||
|
Scores AI-dialogue likelihood by counting:
|
||||||
|
- occurrences of well-known AI brand terms
|
||||||
|
- turn-marker patterns (user:, assistant:, etc.)
|
||||||
|
|
||||||
|
Returns a CorpusOriginResult with confidence derived from signal density.
|
||||||
|
"""
|
||||||
|
combined = "\n\n".join(samples)
|
||||||
|
total_chars = max(1, len(combined))
|
||||||
|
|
||||||
|
# Count UNAMBIGUOUS brand-term hits (case-insensitive — users type
|
||||||
|
# lowercase constantly, so 'chatgpt' must trip the same as 'ChatGPT').
|
||||||
|
# Word boundaries prevent false in-word matches (see _brand_pattern).
|
||||||
|
unambiguous_hits: dict[str, int] = {}
|
||||||
|
total_unambiguous = 0
|
||||||
|
for term in _AI_UNAMBIGUOUS_TERMS:
|
||||||
|
matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
|
||||||
|
if matches:
|
||||||
|
unambiguous_hits[term] = len(matches)
|
||||||
|
total_unambiguous += len(matches)
|
||||||
|
|
||||||
|
# Count AMBIGUOUS brand-term hits separately. These will only be
|
||||||
|
# counted toward AI-dialogue evidence if the corpus also contains
|
||||||
|
# at least one unambiguous AI signal — see co-occurrence rule below.
|
||||||
|
ambiguous_hits: dict[str, int] = {}
|
||||||
|
total_ambiguous = 0
|
||||||
|
for term in _AI_AMBIGUOUS_TERMS:
|
||||||
|
matches = re.findall(_brand_pattern(term), combined, re.IGNORECASE)
|
||||||
|
if matches:
|
||||||
|
ambiguous_hits[term] = len(matches)
|
||||||
|
total_ambiguous += len(matches)
|
||||||
|
|
||||||
|
# Count turn-marker hits (case-insensitive — transcripts vary).
|
||||||
|
turn_hits = 0
|
||||||
|
turn_types_found = set()
|
||||||
|
for pattern in _TURN_MARKERS:
|
||||||
|
matches = re.findall(pattern, combined, re.IGNORECASE)
|
||||||
|
if matches:
|
||||||
|
turn_hits += len(matches)
|
||||||
|
turn_types_found.add(pattern)
|
||||||
|
|
||||||
|
# Co-occurrence rule for ambiguous terms.
|
||||||
|
# Ambiguous terms (e.g. 'Claude' as a French name, 'Gemini' as a zodiac
|
||||||
|
# sign, 'Haiku' as a poem form) only count toward brand evidence if
|
||||||
|
# the corpus also contains at least one unambiguous AI signal. Otherwise
|
||||||
|
# we'd false-positive on French novels, astrology forums, poetry corpora,
|
||||||
|
# llama-rancher journals, etc.
|
||||||
|
has_ai_context = total_unambiguous > 0 or turn_hits > 0
|
||||||
|
counted_brand_hits = total_unambiguous + (total_ambiguous if has_ai_context else 0)
|
||||||
|
|
||||||
|
# Brand-term density per 1000 chars; turn-marker density likewise.
|
||||||
|
# Tuned on a small set of examples; these aren't magic numbers and
|
||||||
|
# can be revisited as we see more corpora.
|
||||||
|
brand_density = counted_brand_hits / (total_chars / 1000)
|
||||||
|
turn_density = turn_hits / (total_chars / 1000)
|
||||||
|
|
||||||
|
# Build evidence list
|
||||||
|
evidence: list[str] = []
|
||||||
|
shown_hits = dict(unambiguous_hits)
|
||||||
|
if has_ai_context:
|
||||||
|
shown_hits.update(ambiguous_hits)
|
||||||
|
if shown_hits:
|
||||||
|
top_terms = sorted(shown_hits.items(), key=lambda x: -x[1])[:5]
|
||||||
|
evidence.append("AI brand terms: " + ", ".join(f"'{k}' ({v}x)" for k, v in top_terms))
|
||||||
|
elif ambiguous_hits and not has_ai_context:
|
||||||
|
# Be transparent that we saw ambiguous matches but suppressed them
|
||||||
|
# for lack of co-occurring AI context.
|
||||||
|
suppressed = sorted(ambiguous_hits.items(), key=lambda x: -x[1])[:3]
|
||||||
|
evidence.append(
|
||||||
|
"Ambiguous terms present but suppressed (no co-occurring AI signal): "
|
||||||
|
+ ", ".join(f"'{k}' ({v}x)" for k, v in suppressed)
|
||||||
|
)
|
||||||
|
if turn_hits:
|
||||||
|
evidence.append(
|
||||||
|
f"Turn markers detected: {turn_hits} occurrences across {len(turn_types_found)} pattern types"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Decision logic:
|
||||||
|
# strong signal (brand OR turn hits both >= threshold) → confident AI-dialogue
|
||||||
|
# MEANINGFUL absence (enough text, zero brand, zero turn) → confident narrative
|
||||||
|
# ambiguous or insufficient text → default stance: AI-dialogue with low confidence
|
||||||
|
#
|
||||||
|
# Threshold for "meaningful absence": the samples collectively have to
|
||||||
|
# be long enough that the absence of AI signals would be expected to
|
||||||
|
# surface if the corpus really is narrative. 150 chars is the working
|
||||||
|
# floor — below that, we cannot confidently say "this is narrative."
|
||||||
|
MEANINGFUL_TEXT_FLOOR = 150
|
||||||
|
|
||||||
|
if brand_density >= 0.5 or turn_density >= 2.0:
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=True,
|
||||||
|
confidence=min(0.95, 0.6 + 0.1 * (brand_density + turn_density)),
|
||||||
|
primary_platform=None, # tier 2 will refine
|
||||||
|
evidence=evidence,
|
||||||
|
)
|
||||||
|
if counted_brand_hits == 0 and turn_hits == 0 and total_chars >= MEANINGFUL_TEXT_FLOOR:
|
||||||
|
# Note: ambiguous-only matches (e.g. a French novel with 'Claude' as
|
||||||
|
# a character name) flow through here because counted_brand_hits == 0
|
||||||
|
# when no unambiguous AI signal co-occurs. The 'evidence' list still
|
||||||
|
# records that the ambiguous matches were seen and suppressed.
|
||||||
|
narrative_evidence = list(evidence) + [
|
||||||
|
f"no unambiguous AI signal across {total_chars} chars of text — pure narrative"
|
||||||
|
]
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=False,
|
||||||
|
confidence=0.9,
|
||||||
|
primary_platform=None,
|
||||||
|
evidence=narrative_evidence,
|
||||||
|
)
|
||||||
|
# Ambiguous or too-short-to-tell case: default stance is AI-dialogue
|
||||||
|
# with explicit low confidence. Tier 2 (LLM) should be called to confirm.
|
||||||
|
reason = "weak signal" if (counted_brand_hits or turn_hits) else "insufficient text"
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=True,
|
||||||
|
confidence=0.4,
|
||||||
|
primary_platform=None,
|
||||||
|
evidence=evidence
|
||||||
|
+ [
|
||||||
|
f"{reason} — applying default-stance (ai_dialogue=True, low confidence). "
|
||||||
|
"Tier 2 LLM check recommended to confirm or override."
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Tier 2: LLM-assisted confirmation + persona extraction ────────────────
|
||||||
|
|
||||||
|
|
||||||
|
_SYSTEM_PROMPT = """You are analyzing a corpus of text to determine whether it is a \
|
||||||
|
record of conversations with an AI agent (e.g. Claude, ChatGPT, Gemini, custom LLM \
|
||||||
|
apps), or some other kind of text (personal narrative, story, research notes, \
|
||||||
|
journal, code, etc.).
|
||||||
|
|
||||||
|
Use your pre-existing knowledge of well-known AI platforms. You don't need the \
|
||||||
|
corpus to explain what Claude or ChatGPT is — you already know. Your job is to \
|
||||||
|
detect evidence of their presence and identify what persona-names the user has \
|
||||||
|
assigned to the agent(s) they converse with.
|
||||||
|
|
||||||
|
CRITICAL distinction:
|
||||||
|
- agent_persona_names are names the USER has assigned to the AI AGENT(S)
|
||||||
|
they converse with. Example: "Echo", "Sparrow", "Henry" might be names
|
||||||
|
the user calls a Claude instance they're building a relationship with.
|
||||||
|
- Do NOT include the USER's own name in agent_persona_names. The user
|
||||||
|
is the human author of the corpus, not a persona of the agent. Even
|
||||||
|
if the user's name appears frequently in the text (writing about
|
||||||
|
themselves), that is NOT an agent persona.
|
||||||
|
- If you can identify the user's name from context, put it in user_name
|
||||||
|
(separate field). If unclear, leave user_name null.
|
||||||
|
|
||||||
|
Respond with JSON only (no prose before or after):
|
||||||
|
{
|
||||||
|
"is_ai_dialogue_corpus": <true|false>,
|
||||||
|
"confidence": <0.0 to 1.0>,
|
||||||
|
"primary_platform": <"Claude (Anthropic)" | "ChatGPT (OpenAI)" | "Gemini (Google)" | other platform name | null>,
|
||||||
|
"user_name": <user's name if clearly identifiable from context, else null>,
|
||||||
|
"agent_persona_names": [<names the user has assigned to the AI AGENT(S), NOT the user's own name>],
|
||||||
|
"evidence": [<short bullet strings explaining the decision>]
|
||||||
|
}
|
||||||
|
|
||||||
|
Default stance: if evidence is thin or mixed, return is_ai_dialogue_corpus=true \
|
||||||
|
with low confidence. False-negatives on AI-dialogue detection break downstream \
|
||||||
|
classification; false-positives are recoverable later.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_json(text: str) -> Optional[dict]:
|
||||||
|
"""Pull the first JSON object out of a possibly-messy LLM response."""
|
||||||
|
text = text.strip()
|
||||||
|
if not text:
|
||||||
|
return None
|
||||||
|
# Try straight parse first
|
||||||
|
try:
|
||||||
|
return json.loads(text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
pass
|
||||||
|
# Try to find a {...} block
|
||||||
|
start = text.find("{")
|
||||||
|
if start < 0:
|
||||||
|
return None
|
||||||
|
depth = 0
|
||||||
|
in_string = False
|
||||||
|
escape = False
|
||||||
|
for i in range(start, len(text)):
|
||||||
|
ch = text[i]
|
||||||
|
if in_string:
|
||||||
|
if escape:
|
||||||
|
escape = False
|
||||||
|
elif ch == "\\":
|
||||||
|
escape = True
|
||||||
|
elif ch == '"':
|
||||||
|
in_string = False
|
||||||
|
continue
|
||||||
|
if ch == '"':
|
||||||
|
in_string = True
|
||||||
|
elif ch == "{":
|
||||||
|
depth += 1
|
||||||
|
elif ch == "}":
|
||||||
|
depth -= 1
|
||||||
|
if depth == 0:
|
||||||
|
candidate = text[start : i + 1]
|
||||||
|
try:
|
||||||
|
return json.loads(candidate)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def detect_origin_llm(samples: list[str], provider) -> CorpusOriginResult:
|
||||||
|
"""LLM-assisted detection. Takes samples (list of drawer-text excerpts)
|
||||||
|
and an LLMProvider (mempalace.llm_client.LLMProvider). Returns the
|
||||||
|
same CorpusOriginResult shape as the heuristic.
|
||||||
|
|
||||||
|
Falls back conservatively (default-stance ai=True, low confidence)
|
||||||
|
on any LLM error or malformed response — never raises.
|
||||||
|
"""
|
||||||
|
# Build the user prompt: concise excerpts, capped so we stay cheap
|
||||||
|
max_excerpt_chars = 800
|
||||||
|
excerpts = "\n\n---\n\n".join(
|
||||||
|
f"[sample {i + 1}]\n{s[:max_excerpt_chars]}" for i, s in enumerate(samples[:20])
|
||||||
|
)
|
||||||
|
user_prompt = f"CORPUS EXCERPTS:\n\n{excerpts}\n\nAnalyze and respond with JSON."
|
||||||
|
|
||||||
|
try:
|
||||||
|
resp = provider.classify(system=_SYSTEM_PROMPT, user=user_prompt, json_mode=True)
|
||||||
|
raw = getattr(resp, "text", "") or ""
|
||||||
|
except Exception as e:
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=True,
|
||||||
|
confidence=0.3,
|
||||||
|
primary_platform=None,
|
||||||
|
evidence=[f"LLM provider error (fallback to default stance): {e}"],
|
||||||
|
)
|
||||||
|
|
||||||
|
parsed = _extract_json(raw)
|
||||||
|
if not parsed or not isinstance(parsed, dict):
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=True,
|
||||||
|
confidence=0.3,
|
||||||
|
primary_platform=None,
|
||||||
|
evidence=["LLM response was not valid JSON (fallback to default stance)"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pull fields defensively. If the LLM leaked the user_name into
|
||||||
|
# agent_persona_names despite the prompt telling it not to, filter it out.
|
||||||
|
user_name = parsed.get("user_name") or None
|
||||||
|
personas = list(parsed.get("agent_persona_names") or [])
|
||||||
|
if user_name:
|
||||||
|
personas = [p for p in personas if p.lower() != user_name.lower()]
|
||||||
|
return CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=bool(parsed.get("is_ai_dialogue_corpus", True)),
|
||||||
|
confidence=float(parsed.get("confidence", 0.5)),
|
||||||
|
primary_platform=parsed.get("primary_platform") or None,
|
||||||
|
user_name=user_name,
|
||||||
|
agent_persona_names=personas,
|
||||||
|
evidence=list(parsed.get("evidence") or []),
|
||||||
|
)
|
||||||
@@ -2,6 +2,9 @@
|
|||||||
"""
|
"""
|
||||||
entity_detector.py — Auto-detect people and projects from file content.
|
entity_detector.py — Auto-detect people and projects from file content.
|
||||||
|
|
||||||
|
Uses ``from __future__ import annotations`` so PEP 604 union syntax
|
||||||
|
(``dict | None``) works on the Python 3.9 baseline.
|
||||||
|
|
||||||
Two-pass approach:
|
Two-pass approach:
|
||||||
Pass 1: scan files, extract entity candidates with signal counts
|
Pass 1: scan files, extract entity candidates with signal counts
|
||||||
Pass 2: score and classify each candidate as person, project, or uncertain
|
Pass 2: score and classify each candidate as person, project, or uncertain
|
||||||
@@ -27,6 +30,8 @@ Usage:
|
|||||||
confirmed = confirm_entities(candidates) # interactive review
|
confirmed = confirm_entities(candidates) # interactive review
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
import re
|
import re
|
||||||
import os
|
import os
|
||||||
import functools
|
import functools
|
||||||
@@ -396,7 +401,12 @@ def classify_entity(name: str, frequency: int, scores: dict) -> dict:
|
|||||||
# ==================== MAIN DETECT ====================
|
# ==================== MAIN DETECT ====================
|
||||||
|
|
||||||
|
|
||||||
def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) -> dict:
|
def detect_entities(
|
||||||
|
file_paths: list,
|
||||||
|
max_files: int = 10,
|
||||||
|
languages=("en",),
|
||||||
|
corpus_origin: dict | None = None,
|
||||||
|
) -> dict:
|
||||||
"""
|
"""
|
||||||
Scan files and detect entity candidates.
|
Scan files and detect entity candidates.
|
||||||
|
|
||||||
@@ -405,12 +415,23 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
|||||||
max_files: Max files to read (for speed)
|
max_files: Max files to read (for speed)
|
||||||
languages: Tuple of language codes whose entity patterns should be
|
languages: Tuple of language codes whose entity patterns should be
|
||||||
applied (union). Defaults to ``("en",)``.
|
applied (union). Defaults to ``("en",)``.
|
||||||
|
corpus_origin: Optional corpus-origin context (the dict produced
|
||||||
|
by ``mempalace.corpus_origin`` and persisted to
|
||||||
|
``<palace>/.mempalace/origin.json`` by ``mempalace init``).
|
||||||
|
When supplied and the corpus is identified as AI-dialogue with
|
||||||
|
known agent persona names, candidates whose name matches an
|
||||||
|
agent persona are moved out of ``people``/``uncertain`` and
|
||||||
|
into a new ``agent_personas`` bucket. Shape:
|
||||||
|
``{"schema_version": 1, "result": {"agent_persona_names": [...], ...}}``.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
{
|
{
|
||||||
"people": [...entity dicts...],
|
"people": [...entity dicts...],
|
||||||
"projects": [...entity dicts...],
|
"projects": [...entity dicts...],
|
||||||
"uncertain":[...entity dicts...],
|
"uncertain":[...entity dicts...],
|
||||||
|
# Only present when corpus_origin reclassifies at least one
|
||||||
|
# candidate as an agent persona:
|
||||||
|
"agent_personas": [...entity dicts...],
|
||||||
}
|
}
|
||||||
"""
|
"""
|
||||||
langs = _normalize_langs(languages)
|
langs = _normalize_langs(languages)
|
||||||
@@ -440,7 +461,10 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
|||||||
candidates = extract_candidates(combined_text, languages=langs)
|
candidates = extract_candidates(combined_text, languages=langs)
|
||||||
|
|
||||||
if not candidates:
|
if not candidates:
|
||||||
return {"people": [], "projects": [], "topics": [], "uncertain": []}
|
return _apply_corpus_origin(
|
||||||
|
{"people": [], "projects": [], "topics": [], "uncertain": []},
|
||||||
|
corpus_origin,
|
||||||
|
)
|
||||||
|
|
||||||
# Score and classify each candidate
|
# Score and classify each candidate
|
||||||
people = []
|
people = []
|
||||||
@@ -463,14 +487,76 @@ def detect_entities(file_paths: list, max_files: int = 10, languages=("en",)) ->
|
|||||||
projects.sort(key=lambda x: x["confidence"], reverse=True)
|
projects.sort(key=lambda x: x["confidence"], reverse=True)
|
||||||
uncertain.sort(key=lambda x: x["frequency"], reverse=True)
|
uncertain.sort(key=lambda x: x["frequency"], reverse=True)
|
||||||
|
|
||||||
# Cap results to most relevant
|
detected = {
|
||||||
return {
|
|
||||||
"people": people[:15],
|
"people": people[:15],
|
||||||
"projects": projects[:10],
|
"projects": projects[:10],
|
||||||
"topics": [],
|
"topics": [],
|
||||||
"uncertain": uncertain[:8],
|
"uncertain": uncertain[:8],
|
||||||
}
|
}
|
||||||
|
|
||||||
|
return _apply_corpus_origin(detected, corpus_origin)
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_corpus_origin(detected: dict, corpus_origin: dict | None) -> dict:
|
||||||
|
"""Reclassify per-candidate buckets using corpus-origin context.
|
||||||
|
|
||||||
|
When the corpus is identified as AI-dialogue with known agent persona
|
||||||
|
names, a candidate whose name case-insensitively matches one of those
|
||||||
|
personas is moved from ``people``/``uncertain`` into an
|
||||||
|
``agent_personas`` bucket. The candidate's per-entity ``type`` is also
|
||||||
|
rewritten to ``"agent_persona"``.
|
||||||
|
|
||||||
|
No-op when ``corpus_origin`` is ``None`` or contains no usable persona
|
||||||
|
names. Pure: returns a new dict, does not mutate the input.
|
||||||
|
"""
|
||||||
|
if not corpus_origin:
|
||||||
|
return detected
|
||||||
|
|
||||||
|
origin_result = corpus_origin.get("result") or {}
|
||||||
|
raw_personas = origin_result.get("agent_persona_names") or []
|
||||||
|
persona_lower = {n.lower() for n in raw_personas if isinstance(n, str)}
|
||||||
|
if not persona_lower:
|
||||||
|
return detected
|
||||||
|
|
||||||
|
agent_personas: list = []
|
||||||
|
new_people: list = []
|
||||||
|
new_uncertain: list = []
|
||||||
|
|
||||||
|
for entity in detected.get("people", []):
|
||||||
|
if entity["name"].lower() in persona_lower:
|
||||||
|
agent_personas.append(_tag_as_persona(entity))
|
||||||
|
else:
|
||||||
|
new_people.append(entity)
|
||||||
|
|
||||||
|
for entity in detected.get("uncertain", []):
|
||||||
|
if entity["name"].lower() in persona_lower:
|
||||||
|
agent_personas.append(_tag_as_persona(entity))
|
||||||
|
else:
|
||||||
|
new_uncertain.append(entity)
|
||||||
|
|
||||||
|
if not agent_personas:
|
||||||
|
return detected
|
||||||
|
|
||||||
|
agent_personas.sort(key=lambda x: x.get("confidence", 0), reverse=True)
|
||||||
|
|
||||||
|
return {
|
||||||
|
**detected,
|
||||||
|
"people": new_people,
|
||||||
|
"uncertain": new_uncertain,
|
||||||
|
"agent_personas": agent_personas,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _tag_as_persona(entity: dict) -> dict:
|
||||||
|
"""Return a new entity dict tagged as agent_persona with provenance signal."""
|
||||||
|
existing_signals = entity.get("signals", [])
|
||||||
|
return {
|
||||||
|
**entity,
|
||||||
|
"type": "agent_persona",
|
||||||
|
"confidence": max(0.95, entity.get("confidence", 0.0)),
|
||||||
|
"signals": ["matched corpus_origin agent_persona_names"] + existing_signals[:2],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
# ==================== INTERACTIVE CONFIRM ====================
|
# ==================== INTERACTIVE CONFIRM ====================
|
||||||
|
|
||||||
|
|||||||
+50
-1
@@ -262,6 +262,52 @@ def _apply_classifications(
|
|||||||
return new_detected, reclassified, dropped
|
return new_detected, reclassified, dropped
|
||||||
|
|
||||||
|
|
||||||
|
def _build_corpus_origin_preamble(corpus_origin: dict | None) -> str:
|
||||||
|
"""Build a system-prompt preamble carrying corpus-origin context.
|
||||||
|
|
||||||
|
When the corpus has been identified as AI-dialogue with known persona
|
||||||
|
names, this preamble lets the LLM disambiguate ambiguous candidates
|
||||||
|
with knowledge that this is AI-dialogue. It does NOT add a new label
|
||||||
|
or change the classification schema — the post-refine sweep in
|
||||||
|
project_scanner.discover_entities still moves persona names into
|
||||||
|
``agent_personas``. The preamble is purely classification context for
|
||||||
|
the OTHER candidates (ambiguous, common-word) that benefit from
|
||||||
|
knowing the corpus shape.
|
||||||
|
|
||||||
|
Returns ``""`` when no usable origin context is available, so callers
|
||||||
|
can concatenate unconditionally without changing the v3.3.3 prompt
|
||||||
|
shape for opt-out paths.
|
||||||
|
"""
|
||||||
|
if not corpus_origin:
|
||||||
|
return ""
|
||||||
|
result = corpus_origin.get("result") or {}
|
||||||
|
if not result.get("likely_ai_dialogue"):
|
||||||
|
return ""
|
||||||
|
|
||||||
|
lines = ["\n\nCORPUS CONTEXT (corpus-origin detection):"]
|
||||||
|
platform = result.get("primary_platform")
|
||||||
|
if platform:
|
||||||
|
lines.append(f"- This corpus is AI-dialogue from {platform}.")
|
||||||
|
user_name = result.get("user_name")
|
||||||
|
if user_name:
|
||||||
|
lines.append(
|
||||||
|
f"- The corpus author (the human user) is named '{user_name}'. "
|
||||||
|
f"Treat this name as PERSON."
|
||||||
|
)
|
||||||
|
personas = result.get("agent_persona_names") or []
|
||||||
|
if personas:
|
||||||
|
lines.append(
|
||||||
|
"- The user has assigned these persona names to AI agents in "
|
||||||
|
f"this corpus: {', '.join(personas)}."
|
||||||
|
)
|
||||||
|
lines.append(
|
||||||
|
"- Persona names refer to AI agents, not biological people. "
|
||||||
|
"Classify them as PERSON (a downstream step tags them as "
|
||||||
|
"agent personas)."
|
||||||
|
)
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
def _is_authoritative_person(entry: dict) -> bool:
|
def _is_authoritative_person(entry: dict) -> bool:
|
||||||
"""Return True for git-author people that should not be second-guessed."""
|
"""Return True for git-author people that should not be second-guessed."""
|
||||||
signals = " ".join(entry.get("signals", [])).lower()
|
signals = " ".join(entry.get("signals", [])).lower()
|
||||||
@@ -292,6 +338,7 @@ def refine_entities(
|
|||||||
batch_size: int = BATCH_SIZE,
|
batch_size: int = BATCH_SIZE,
|
||||||
show_progress: bool = True,
|
show_progress: bool = True,
|
||||||
allow_project_promotions: bool = True,
|
allow_project_promotions: bool = True,
|
||||||
|
corpus_origin: dict | None = None,
|
||||||
) -> RefineResult:
|
) -> RefineResult:
|
||||||
"""Reclassify detected entities using the LLM provider.
|
"""Reclassify detected entities using the LLM provider.
|
||||||
|
|
||||||
@@ -354,12 +401,14 @@ def refine_entities(
|
|||||||
completed = 0
|
completed = 0
|
||||||
cancelled = False
|
cancelled = False
|
||||||
|
|
||||||
|
system_prompt = SYSTEM_PROMPT + _build_corpus_origin_preamble(corpus_origin)
|
||||||
|
|
||||||
for idx, batch in enumerate(batches, 1):
|
for idx, batch in enumerate(batches, 1):
|
||||||
if show_progress and batch:
|
if show_progress and batch:
|
||||||
_print_progress(idx - 1, len(batches), batch[0][0])
|
_print_progress(idx - 1, len(batches), batch[0][0])
|
||||||
user_prompt = _build_user_prompt(batch)
|
user_prompt = _build_user_prompt(batch)
|
||||||
try:
|
try:
|
||||||
resp = provider.classify(SYSTEM_PROMPT, user_prompt, json_mode=True)
|
resp = provider.classify(system_prompt, user_prompt, json_mode=True)
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
cancelled = True
|
cancelled = True
|
||||||
break
|
break
|
||||||
|
|||||||
@@ -597,6 +597,7 @@ def discover_entities(
|
|||||||
people_cap: int = 15,
|
people_cap: int = 15,
|
||||||
llm_provider: object = None,
|
llm_provider: object = None,
|
||||||
show_progress: bool = True,
|
show_progress: bool = True,
|
||||||
|
corpus_origin: dict | None = None,
|
||||||
) -> dict:
|
) -> dict:
|
||||||
"""Top-level entity discovery: real signals first, prose detection second.
|
"""Top-level entity discovery: real signals first, prose detection second.
|
||||||
|
|
||||||
@@ -613,11 +614,19 @@ def discover_entities(
|
|||||||
mentioned in docs/notes (not code)
|
mentioned in docs/notes (not code)
|
||||||
5. Optional LLM refinement pass — reclassifies ambiguous candidates
|
5. Optional LLM refinement pass — reclassifies ambiguous candidates
|
||||||
using the caller-supplied provider
|
using the caller-supplied provider
|
||||||
|
6. Optional corpus-origin persona filter — when the corpus is
|
||||||
|
identified as AI-dialogue, candidates whose name matches an
|
||||||
|
agent_persona_name are moved to an ``agent_personas`` bucket
|
||||||
|
instead of being reported as people.
|
||||||
|
|
||||||
Passing ``llm_provider`` enables phase-2 refinement. The caller is
|
Passing ``llm_provider`` enables phase-2 refinement. The caller is
|
||||||
responsible for constructing the provider (``llm_client.get_provider``)
|
responsible for constructing the provider (``llm_client.get_provider``)
|
||||||
and confirming availability. Refinement is blocking-interactive:
|
and confirming availability. Refinement is blocking-interactive:
|
||||||
progress prints to stderr; Ctrl-C returns partial results.
|
progress prints to stderr; Ctrl-C returns partial results.
|
||||||
|
|
||||||
|
Passing ``corpus_origin`` enables corpus-origin persona reclassification.
|
||||||
|
The expected shape is the dict written by ``mempalace init`` to
|
||||||
|
``<palace>/.mempalace/origin.json`` (see ``corpus_origin.py``).
|
||||||
"""
|
"""
|
||||||
projects, people = scan(project_dir)
|
projects, people = scan(project_dir)
|
||||||
|
|
||||||
@@ -668,7 +677,7 @@ def discover_entities(
|
|||||||
drop_secondary_uncertain=has_real_signal and llm_provider is None,
|
drop_secondary_uncertain=has_real_signal and llm_provider is None,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Optional phase 2: LLM refinement.
|
# Optional LLM refinement pass (when an llm_provider was supplied).
|
||||||
if llm_provider is not None:
|
if llm_provider is not None:
|
||||||
from mempalace.llm_refine import collect_corpus_text, refine_entities
|
from mempalace.llm_refine import collect_corpus_text, refine_entities
|
||||||
|
|
||||||
@@ -679,6 +688,7 @@ def discover_entities(
|
|||||||
llm_provider,
|
llm_provider,
|
||||||
show_progress=show_progress,
|
show_progress=show_progress,
|
||||||
allow_project_promotions=not has_real_signal,
|
allow_project_promotions=not has_real_signal,
|
||||||
|
corpus_origin=corpus_origin,
|
||||||
)
|
)
|
||||||
if show_progress:
|
if show_progress:
|
||||||
status_bits = []
|
status_bits = []
|
||||||
@@ -696,6 +706,14 @@ def discover_entities(
|
|||||||
print(f" LLM refine: {', '.join(status_bits)}", file=_sys.stderr)
|
print(f" LLM refine: {', '.join(status_bits)}", file=_sys.stderr)
|
||||||
merged = result.merged
|
merged = result.merged
|
||||||
|
|
||||||
|
# Corpus-origin persona reclassification — applied last so it sweeps
|
||||||
|
# candidates contributed by every upstream source (manifests, git authors,
|
||||||
|
# prose, LLM refinement). Idempotent: no corpus_origin → exact v3.3.3 shape.
|
||||||
|
if corpus_origin is not None:
|
||||||
|
from mempalace.entity_detector import _apply_corpus_origin
|
||||||
|
|
||||||
|
merged = _apply_corpus_origin(merged, corpus_origin)
|
||||||
|
|
||||||
return merged
|
return merged
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -127,6 +127,11 @@ def test_cmd_init_with_entities(mock_config_cls, tmp_path):
|
|||||||
patch("mempalace.entity_detector.detect_entities", return_value=detected),
|
patch("mempalace.entity_detector.detect_entities", return_value=detected),
|
||||||
patch("mempalace.entity_detector.confirm_entities", return_value=confirmed),
|
patch("mempalace.entity_detector.confirm_entities", return_value=confirmed),
|
||||||
patch("mempalace.room_detector_local.detect_rooms_local"),
|
patch("mempalace.room_detector_local.detect_rooms_local"),
|
||||||
|
# Pass 0 (corpus_origin) needs real file IO; this test mocks
|
||||||
|
# builtins.open globally for the entities.json write, which would
|
||||||
|
# break Pass 0's file-reading path. Patch Pass 0 out — a separate
|
||||||
|
# suite (tests/test_corpus_origin_integration.py) covers it directly.
|
||||||
|
patch("mempalace.cli._run_pass_zero", return_value=None),
|
||||||
patch("builtins.open", MagicMock()),
|
patch("builtins.open", MagicMock()),
|
||||||
patch("mempalace.cli._maybe_run_mine_after_init"),
|
patch("mempalace.cli._maybe_run_mine_after_init"),
|
||||||
):
|
):
|
||||||
|
|||||||
@@ -0,0 +1,395 @@
|
|||||||
|
"""Tests for corpus_origin detection.
|
||||||
|
|
||||||
|
The corpus-origin detector answers ONE foundational question before any
|
||||||
|
downstream Pass 2 classification runs:
|
||||||
|
|
||||||
|
"Is this corpus a record of AI-agent dialogue, and if so, which platform
|
||||||
|
and what persona names has the user assigned to the agent?"
|
||||||
|
|
||||||
|
Detection is two-tier:
|
||||||
|
- Tier 1: cheap content-aware heuristic (grep for well-known AI terms
|
||||||
|
and turn markers). No API calls. Always runs.
|
||||||
|
- Tier 2: LLM-assisted confirmation + persona extraction. Takes a small
|
||||||
|
sample of drawer texts and uses Haiku's pre-trained world knowledge
|
||||||
|
about Claude/ChatGPT/Gemini/etc. to confirm platform + identify
|
||||||
|
persona-names the user assigned to the agent.
|
||||||
|
|
||||||
|
Default stance: "this IS an AI-dialogue corpus" unless strong evidence
|
||||||
|
otherwise. False-negative (missing an AI corpus) is catastrophic for
|
||||||
|
downstream classification; false-positive is recoverable via per-drawer
|
||||||
|
voice-profile detection in later passes.
|
||||||
|
|
||||||
|
TDD: these tests fail until mempalace/corpus_origin.py is implemented."""
|
||||||
|
|
||||||
|
from mempalace.corpus_origin import (
|
||||||
|
CorpusOriginResult,
|
||||||
|
detect_origin_heuristic,
|
||||||
|
detect_origin_llm,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Tier 1: heuristic (no LLM) ────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class TestHeuristic:
|
||||||
|
def test_claude_heavy_corpus_detected(self):
|
||||||
|
"""A corpus with abundant Claude references + turn markers should
|
||||||
|
be confidently detected as AI-dialogue."""
|
||||||
|
samples = [
|
||||||
|
"user: hey Claude, can you help me\nassistant: sure, what do you need\n",
|
||||||
|
"I was talking to Claude Opus about the MCP server setup",
|
||||||
|
"Sonnet 4.5 handled this better than Haiku 4.5 did",
|
||||||
|
"claude mcp add mempalace -- mempalace-mcp",
|
||||||
|
"human: what's up\nassistant: I'm happy to help",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
assert result.confidence >= 0.8
|
||||||
|
assert (
|
||||||
|
"Claude" in " ".join(result.evidence) or "claude" in " ".join(result.evidence).lower()
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_gpt_corpus_detected(self):
|
||||||
|
samples = [
|
||||||
|
"I asked ChatGPT to summarize my paper",
|
||||||
|
"The GPT-4 response was surprisingly good",
|
||||||
|
"user: explain quantum computing\nassistant: quantum computing uses qubits",
|
||||||
|
"OpenAI's model was able to help with the code",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
assert any("GPT" in e or "ChatGPT" in e or "OpenAI" in e for e in result.evidence)
|
||||||
|
|
||||||
|
def test_pure_narrative_corpus_detected_as_not_ai(self):
|
||||||
|
"""A story/journal corpus with no AI signals should be flagged
|
||||||
|
not-AI (default stance flipped only with evidence)."""
|
||||||
|
samples = [
|
||||||
|
"Today the cat finally ventured into the garden. The dog watched.",
|
||||||
|
"The morning light came through the window as I wrote.",
|
||||||
|
"Chapter 3: The Reckoning. It was a dark and stormy night.",
|
||||||
|
"My father's old journal described the same field in 1972.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert result.likely_ai_dialogue is False
|
||||||
|
assert result.confidence >= 0.8
|
||||||
|
|
||||||
|
def test_ambiguous_corpus_defaults_to_ai(self):
|
||||||
|
"""When evidence is thin or mixed, default to assuming AI-dialogue.
|
||||||
|
False-negative is worse than false-positive."""
|
||||||
|
samples = [
|
||||||
|
"some notes about the meeting today",
|
||||||
|
"Later on I went to the store.",
|
||||||
|
"Short file with little signal.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
# Low signal → default stance is ai_dialogue=True with low confidence
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
assert result.confidence <= 0.6
|
||||||
|
assert "default-stance" in " ".join(result.evidence).lower()
|
||||||
|
|
||||||
|
def test_turn_markers_alone_sufficient(self):
|
||||||
|
"""Even without AI brand mentions, strong turn-marker presence
|
||||||
|
indicates dialogue structure consistent with AI corpora."""
|
||||||
|
samples = [
|
||||||
|
"user: hello\nassistant: hi there, how can I help?\nuser: summarize X\nassistant: sure",
|
||||||
|
"human: what's the weather\nai: I don't have real-time data\n",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
|
||||||
|
# ── Pattern + context (not capitalization, not English-rule) ──────────
|
||||||
|
|
||||||
|
def test_brand_terms_case_insensitive(self):
|
||||||
|
"""Detection cannot rely on the user typing proper-cased brand names.
|
||||||
|
Lowercase 'claude code', 'chatgpt', 'gemini-pro', 'mcp' must trip
|
||||||
|
the same as their proper-cased equivalents. NO turn-marker fallback
|
||||||
|
in this corpus — the brand matches must do the work."""
|
||||||
|
samples = [
|
||||||
|
"i love claude code, it just works for refactoring tasks",
|
||||||
|
"asked chatgpt to write a regex and it nailed it on the first try",
|
||||||
|
"switched to gemini-pro for the long-context summary task last week",
|
||||||
|
"added mempalace as an mcp server in my .claude/ settings file",
|
||||||
|
"anthropic's haiku model is cheap enough to run on every drawer",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is True
|
||||||
|
), f"lowercase brand terms missed; evidence: {result.evidence}"
|
||||||
|
# Evidence must show MULTIPLE distinct case-insensitive brand matches.
|
||||||
|
# 'chatgpt' lowercase only matches under case-insensitive search
|
||||||
|
# (the brand list has 'ChatGPT' proper-cased only).
|
||||||
|
evidence_str = " ".join(result.evidence).lower()
|
||||||
|
matched = sum(t in evidence_str for t in ("chatgpt", "anthropic", "haiku", "gemini-pro"))
|
||||||
|
assert (
|
||||||
|
matched >= 2
|
||||||
|
), f"case-insensitive brand matches did not fire — only got: {result.evidence}"
|
||||||
|
|
||||||
|
def test_zodiac_corpus_not_flagged_as_ai(self):
|
||||||
|
"""An astrology forum post with high 'Gemini' density but ZERO
|
||||||
|
unambiguous AI signals (no MCP/LLM/ChatGPT/turn markers) must NOT
|
||||||
|
be flagged as AI-dialogue. Word-sense disambiguation is required:
|
||||||
|
Gemini-the-zodiac-sign vs Gemini-the-AI-platform."""
|
||||||
|
samples = [
|
||||||
|
"I'm a Gemini sun, Pisces moon, and Leo rising.",
|
||||||
|
"Geminis are dreamers and overthinkers — that's the dual nature.",
|
||||||
|
"Compatibility between Gemini and Sagittarius is famously strong.",
|
||||||
|
"If you're a Gemini, expect Mercury retrograde to hit you hardest.",
|
||||||
|
"My horoscope this week says Gemini energy will dominate Wednesday.",
|
||||||
|
"The Gemini twins in Greek mythology are Castor and Pollux.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is False
|
||||||
|
), f"zodiac corpus wrongly flagged AI; evidence: {result.evidence}"
|
||||||
|
|
||||||
|
def test_french_novel_with_claude_name_not_flagged(self):
|
||||||
|
"""A French novel where 'Claude' is a character name (Claude is a
|
||||||
|
common French masculine name) must NOT trip AI-dialogue detection.
|
||||||
|
Disambiguation is by context, not by the presence of the word."""
|
||||||
|
samples = [
|
||||||
|
"Claude marchait lentement le long de la Seine ce matin-là.",
|
||||||
|
"« Claude, tu rentres dîner? » lui demanda sa mère depuis la cuisine.",
|
||||||
|
"Pour Claude, l'art de vivre passait avant tout par la patience.",
|
||||||
|
"Le vieux Claude se souvenait encore de la guerre, des champs déserts.",
|
||||||
|
"Claude ouvrit la fenêtre. Le matin sentait le pain frais et la pluie.",
|
||||||
|
"Les amis de Claude s'étaient réunis chez lui pour fêter ses soixante ans.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is False
|
||||||
|
), f"French novel wrongly flagged AI; evidence: {result.evidence}"
|
||||||
|
|
||||||
|
def test_poetry_corpus_with_haiku_sonnet_not_flagged(self):
|
||||||
|
"""A poetry corpus with high 'haiku', 'sonnet', 'opus' density
|
||||||
|
(poetic forms / classical music terms) but no AI infrastructure
|
||||||
|
terms must NOT be flagged as AI-dialogue."""
|
||||||
|
samples = [
|
||||||
|
"A haiku is seventeen syllables across three lines: 5-7-5.",
|
||||||
|
"Shakespeare's sonnet 18 remains the most quoted in the English canon.",
|
||||||
|
"Beethoven's opus 27 includes the Moonlight Sonata.",
|
||||||
|
"I wrote three haiku this morning before coffee.",
|
||||||
|
"The sonnet form arrived in England via Wyatt and Surrey.",
|
||||||
|
"Her first opus, published at twenty, was a song cycle for soprano.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is False
|
||||||
|
), f"poetry corpus wrongly flagged AI; evidence: {result.evidence}"
|
||||||
|
|
||||||
|
def test_word_boundary_brand_matching(self):
|
||||||
|
"""Brand-term matching must use word boundaries. Embedded matches
|
||||||
|
inside larger words ('Claudette' → 'Claude', 'opuscule' → 'Opus',
|
||||||
|
'sonneteer' → 'Sonnet', 'llamas' → 'Llama', 'bardic' → 'Bard')
|
||||||
|
must NOT be counted as brand hits.
|
||||||
|
|
||||||
|
Word boundaries don't change classification on the co-occurrence-
|
||||||
|
suppressed cases, but they clean up the evidence strings — false
|
||||||
|
matches must not appear in the audit trail. They also prevent
|
||||||
|
'Claude Code' from triple-counting as 'Claude Code' + 'Claude'
|
||||||
|
overlap."""
|
||||||
|
samples = [
|
||||||
|
"My grandmother Claudette baked the most beautiful tarts every Sunday.",
|
||||||
|
"Two llamas were spotted near the trailhead this morning at sunrise.",
|
||||||
|
"Beethoven's opuscule for solo violin remained unpublished for decades.",
|
||||||
|
"She studied to become a sonneteer after reading the full Spenser cycle.",
|
||||||
|
"Bardic traditions in the Hebrides survived well into the eighteenth century.",
|
||||||
|
"The complete opuses of Mozart fill an entire wall of the library.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
evidence_str = " ".join(result.evidence).lower()
|
||||||
|
|
||||||
|
# None of the brand terms should show up in evidence — every
|
||||||
|
# would-be match is an embedded false-positive that word
|
||||||
|
# boundaries should suppress.
|
||||||
|
for embedded_term in ("claude", "opus", "sonnet", "llama", "bard"):
|
||||||
|
assert f"'{embedded_term}'" not in evidence_str, (
|
||||||
|
f"word-boundary bug: '{embedded_term}' falsely matched inside "
|
||||||
|
f"a longer word — evidence: {result.evidence}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# And classification should be not-AI (no real AI signals present).
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is False
|
||||||
|
), f"corpus has no real AI signals; evidence: {result.evidence}"
|
||||||
|
|
||||||
|
def test_ambiguous_brand_with_unambiguous_signal_flagged(self):
|
||||||
|
"""When an ambiguous brand term ('Gemini') co-occurs with an
|
||||||
|
UNAMBIGUOUS AI signal (turn markers, MCP, ChatGPT, Claude Code)
|
||||||
|
in the same corpus, the Gemini hits SHOULD count and the corpus
|
||||||
|
SHOULD be flagged as AI-dialogue."""
|
||||||
|
samples = [
|
||||||
|
"Switched the agent from Gemini to ChatGPT mid-session for cost reasons.",
|
||||||
|
"Gemini handled the long-context task; user: please summarize\nassistant: here is the summary",
|
||||||
|
"user: try Gemini for this\nassistant: running it through gemini-pro now",
|
||||||
|
"MCP server config: Gemini as primary, OpenAI as fallback.",
|
||||||
|
]
|
||||||
|
result = detect_origin_heuristic(samples)
|
||||||
|
assert (
|
||||||
|
result.likely_ai_dialogue is True
|
||||||
|
), f"ambiguous+unambiguous co-occurrence missed; evidence: {result.evidence}"
|
||||||
|
|
||||||
|
|
||||||
|
# ── Tier 2: LLM-assisted (mocked) ─────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class _FakeProvider:
|
||||||
|
"""Minimal stand-in for mempalace's LLMProvider used for testing."""
|
||||||
|
|
||||||
|
def __init__(self, canned_response):
|
||||||
|
self._response = canned_response
|
||||||
|
self.calls = []
|
||||||
|
|
||||||
|
def classify(self, system, user, json_mode=True):
|
||||||
|
self.calls.append({"system": system, "user": user})
|
||||||
|
|
||||||
|
class R:
|
||||||
|
text = self._response
|
||||||
|
|
||||||
|
return R()
|
||||||
|
|
||||||
|
def check_available(self):
|
||||||
|
return True, "ok"
|
||||||
|
|
||||||
|
|
||||||
|
class TestLLMConfirmation:
|
||||||
|
def test_extracts_persona_names_and_platform(self):
|
||||||
|
fake_response = """{
|
||||||
|
"is_ai_dialogue_corpus": true,
|
||||||
|
"confidence": 0.97,
|
||||||
|
"primary_platform": "Claude Code (Anthropic CLI)",
|
||||||
|
"agent_persona_names": ["Echo", "Sparrow", "Cipher", "Orc"],
|
||||||
|
"evidence": [
|
||||||
|
"user addresses agent as 'Echo' on assistant turns",
|
||||||
|
"Claude Code banner text in samples",
|
||||||
|
"references to MCP, CLAUDE.md, hooks"
|
||||||
|
]
|
||||||
|
}"""
|
||||||
|
provider = _FakeProvider(fake_response)
|
||||||
|
samples = [
|
||||||
|
"user: hey Echo, what's up\nassistant: I'm here, what do you need\n",
|
||||||
|
"Claude Code session banner Sonnet 4.5 Claude Pro",
|
||||||
|
]
|
||||||
|
result = detect_origin_llm(samples, provider)
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
assert result.confidence >= 0.9
|
||||||
|
assert "Echo" in result.agent_persona_names
|
||||||
|
assert "Sparrow" in result.agent_persona_names
|
||||||
|
assert "Claude" in result.primary_platform
|
||||||
|
|
||||||
|
def test_narrative_corpus_llm_confirms_no_agent(self):
|
||||||
|
fake_response = """{
|
||||||
|
"is_ai_dialogue_corpus": false,
|
||||||
|
"confidence": 0.95,
|
||||||
|
"primary_platform": null,
|
||||||
|
"agent_persona_names": [],
|
||||||
|
"evidence": ["pure narrative prose, no turn markers, no AI terms"]
|
||||||
|
}"""
|
||||||
|
provider = _FakeProvider(fake_response)
|
||||||
|
samples = ["Once upon a time in a small village", "The old woman smiled"]
|
||||||
|
result = detect_origin_llm(samples, provider)
|
||||||
|
assert result.likely_ai_dialogue is False
|
||||||
|
assert result.agent_persona_names == []
|
||||||
|
assert result.primary_platform is None
|
||||||
|
|
||||||
|
def test_handles_malformed_llm_response(self):
|
||||||
|
"""If the LLM returns garbage, fall back gracefully to the
|
||||||
|
conservative default (assume AI-dialogue with low confidence)."""
|
||||||
|
provider = _FakeProvider("not even close to JSON")
|
||||||
|
result = detect_origin_llm(["sample text"], provider)
|
||||||
|
# Fallback: conservative default, low confidence
|
||||||
|
assert result.likely_ai_dialogue is True
|
||||||
|
assert result.confidence <= 0.5
|
||||||
|
assert (
|
||||||
|
"fallback" in " ".join(result.evidence).lower()
|
||||||
|
or "error" in " ".join(result.evidence).lower()
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_filters_user_name_out_of_personas(self):
|
||||||
|
"""Regression test: Haiku sometimes leaks the user's own name into
|
||||||
|
agent_persona_names despite the prompt's CRITICAL distinction. The
|
||||||
|
parser must strip the user's name from personas if it appears in
|
||||||
|
both fields (case-insensitive). The user is the human author of
|
||||||
|
the corpus, not an agent persona."""
|
||||||
|
fake_response = """{
|
||||||
|
"is_ai_dialogue_corpus": true,
|
||||||
|
"confidence": 0.97,
|
||||||
|
"primary_platform": "Claude (Anthropic)",
|
||||||
|
"user_name": "Jordan",
|
||||||
|
"agent_persona_names": ["Echo", "Sparrow", "Jordan", "Cipher"],
|
||||||
|
"evidence": ["user Jordan talks to agents Echo/Sparrow/Cipher"]
|
||||||
|
}"""
|
||||||
|
provider = _FakeProvider(fake_response)
|
||||||
|
result = detect_origin_llm(["sample"], provider)
|
||||||
|
# user_name is exposed in its own field
|
||||||
|
assert result.user_name == "Jordan"
|
||||||
|
# "Jordan" is filtered out of agent_persona_names
|
||||||
|
assert "Jordan" not in result.agent_persona_names
|
||||||
|
# Real personas are preserved
|
||||||
|
for persona in ("Echo", "Sparrow", "Cipher"):
|
||||||
|
assert persona in result.agent_persona_names
|
||||||
|
|
||||||
|
def test_filter_is_case_insensitive(self):
|
||||||
|
"""The user-name filter works even when the LLM returns a casing
|
||||||
|
mismatch between user_name and the personas list."""
|
||||||
|
fake_response = """{
|
||||||
|
"is_ai_dialogue_corpus": true,
|
||||||
|
"confidence": 0.9,
|
||||||
|
"primary_platform": "Claude",
|
||||||
|
"user_name": "Jordan",
|
||||||
|
"agent_persona_names": ["Echo", "jordan", "JORDAN", "Cipher"],
|
||||||
|
"evidence": []
|
||||||
|
}"""
|
||||||
|
provider = _FakeProvider(fake_response)
|
||||||
|
result = detect_origin_llm(["sample"], provider)
|
||||||
|
# All case-variants of the user's name are filtered
|
||||||
|
assert "jordan" not in [p.lower() for p in result.agent_persona_names]
|
||||||
|
assert result.agent_persona_names == ["Echo", "Cipher"]
|
||||||
|
|
||||||
|
def test_user_name_field_surfaces_author(self):
|
||||||
|
"""The user_name field captures the human author of the corpus,
|
||||||
|
separate from agent personas. This gives downstream passes a
|
||||||
|
clear 'who is the user, who is the agent' distinction."""
|
||||||
|
fake_response = """{
|
||||||
|
"is_ai_dialogue_corpus": true,
|
||||||
|
"confidence": 0.95,
|
||||||
|
"primary_platform": "ChatGPT (OpenAI)",
|
||||||
|
"user_name": "Sarah",
|
||||||
|
"agent_persona_names": ["MyAssistant"],
|
||||||
|
"evidence": ["Sarah writes to MyAssistant"]
|
||||||
|
}"""
|
||||||
|
provider = _FakeProvider(fake_response)
|
||||||
|
result = detect_origin_llm(["sample"], provider)
|
||||||
|
assert result.user_name == "Sarah"
|
||||||
|
assert result.agent_persona_names == ["MyAssistant"]
|
||||||
|
|
||||||
|
|
||||||
|
# ── CorpusOriginResult dataclass ──────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class TestResultDataclass:
|
||||||
|
def test_result_has_all_fields(self):
|
||||||
|
r = CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=True,
|
||||||
|
confidence=0.95,
|
||||||
|
primary_platform="Claude Code",
|
||||||
|
agent_persona_names=["Echo"],
|
||||||
|
evidence=["test"],
|
||||||
|
)
|
||||||
|
assert r.likely_ai_dialogue is True
|
||||||
|
assert r.confidence == 0.95
|
||||||
|
assert r.primary_platform == "Claude Code"
|
||||||
|
assert r.agent_persona_names == ["Echo"]
|
||||||
|
assert r.evidence == ["test"]
|
||||||
|
|
||||||
|
def test_result_serializes_to_dict(self):
|
||||||
|
r = CorpusOriginResult(
|
||||||
|
likely_ai_dialogue=False,
|
||||||
|
confidence=0.9,
|
||||||
|
primary_platform=None,
|
||||||
|
agent_persona_names=[],
|
||||||
|
evidence=[],
|
||||||
|
)
|
||||||
|
d = r.to_dict()
|
||||||
|
assert d["likely_ai_dialogue"] is False
|
||||||
|
assert d["primary_platform"] is None
|
||||||
|
assert d["agent_persona_names"] == []
|
||||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user