b99e54546b
10 files changed. 2,563 insertions, 30 deletions. 48 new tests, including end-to-end coverage live-tested with Anthropic Haiku 4.5. This PR overhauls the first-run experience of `mempalace init` end-to-end, ships a new corpus-origin detection module from scratch, wires it into entity classification and LLM refinement, adds a graceful-fallback path that means `init` never crashes on a missing LLM, and ships a meta-test that prevents internal-coordination jargon from leaking into source or tests. The headline change is that `mempalace init` now understands what kind of folder you're pointing it at — AI conversations, regular writing, code, narrative — and adapts how it classifies entities accordingly. The same folder containing `Echo`, `Sparrow`, and `Cipher` (names you've assigned to AI agents) used to dump those into your "people" list alongside biological humans. Now they go into a separate `agent_personas` bucket, and your `people` list stays clean. But the broader change is that `mempalace init` got upgraded across the board — smarter defaults, smarter degradation, smarter classification, smarter persistence, and a new way to refresh as your folder grows. Built and live-verified with Anthropic Haiku 4.5; runs unmodified on the local LLM runtimes mempalace already supports. ## What changes for users (in order, from `pip install` onwards) **Install** — `pip install mempalace` is unchanged. The package itself didn't shift. **First run — `mempalace init <folder>`:** 1. **`init` examines your folder before classifying anything.** A free regex heuristic decides in milliseconds: AI conversations, regular writing, narrative, or code? If an LLM is reachable, a second pass extracts the corpus author's name and any agent persona names from the dialogue. v3.3.3 had no such step — it dove straight into entity detection with no corpus context. 2. **LLM-assisted classification is now ON by default.** v3.3.3 made `--llm` opt-in. The LLM-assisted path is qualitatively better (extracts persona names, refines ambiguous classifications, gives the model corpus context) so it now runs by default. The provider abstraction is unchanged from v3.3.3 — three buckets are supported by `mempalace.llm_client`: - **Anthropic** (`--llm-provider anthropic` + `ANTHROPIC_API_KEY`) — the official Messages API. **This is the path live-verified end-to-end in this PR with Haiku 4.5.** Cost: ~\$0.01 per `init`. - **Ollama** (`--llm-provider ollama` — the default) — local models via `http://localhost:11434`. Fully offline. Honors the "zero-API required" promise. - **OpenAI-compatible** (`--llm-provider openai-compat` + `--llm-endpoint`) — per the v3.3.3 `mempalace/llm_client.py` docstring, this covers "OpenRouter, LM Studio, llama.cpp server, vLLM, Groq, Fireworks, Together, and most self-hosted setups." We did not test each of those individually as part of this PR; the abstraction has been stable since v3.3.3. If you try this PR with a specific provider and hit a quirk, please file an issue or comment here. 3. **`init` never blocks on a missing LLM.** No Ollama running, no API key set? `init` prints a one-line message pointing at `--no-llm` and falls through to the heuristic-only path. New default behavior, new graceful fallback to support it. `--no-llm` is the new explicit opt-out. 4. **`init` shows you what it detected.** A one-line banner — `Detected: Claude (Anthropic) (user: Jordan, agents: Echo, Sparrow, Cipher)` or `Corpus origin: not AI-dialogue (confidence: 0.98)` — tells you at a glance whether mempalace understood your folder. 5. **Entity classification gets smarter across the board.** Even non-persona candidates benefit: the LLM has corpus context (this is AI-dialogue, this is the user's name, these are agent names) and uses it to disambiguate ambiguous candidates that aren't personas at all. 6. **Agent personas live in their own bucket.** Names you've assigned to AI agents (Echo, Sparrow, Cipher) go into a new `agent_personas` bucket instead of your `people` list. Your real-person entity list stays clean. 7. **Detection result persists to `<palace>/.mempalace/origin.json`** with a `schema_version: 1` envelope, so downstream tools can read it. 8. **Re-running `init` is now idempotent.** Bug fix — running `init` twice on the same folder used to give different classification results because the detection step was sampling its own `entities.json` output. Caught by integration testing during this PR. **Later — when your folder grows:** 9. **`mempalace mine --redetect-origin`** is a new flag for refreshing the stored detection without redoing the whole `init`. Heuristic-only by design (the flag is meant to be cheap). If you want the full LLM-extracted detection refreshed (persona names, user name, etc.), run `mempalace init <yourfolder>` again — `init` is now idempotent (item 8), so re-running it on the same folder is safe. ## Behind the changes - **New module** `mempalace/corpus_origin.py` (422 lines) with two-tier detection: regex heuristic with co-occurrence rule (suppresses ambiguous terms like `Claude` / `Gemini` / `Haiku` when no unambiguous AI signal is present, so French novels, astrology forums, poetry corpora, llama-rancher journals don't false-positive), and LLM tier that extracts `user_name` and `agent_persona_names` from dialogue structure with belt-and-suspenders user-vs-agent disambiguation. - **Entity-classification consumer wiring.** `entity_detector.detect_entities` and `project_scanner.discover_entities` accept an optional `corpus_origin` kwarg. When present and the corpus is identified as AI-dialogue, candidates whose name case-insensitively matches an `agent_persona_name` are routed into the `agent_personas` bucket instead of `people`. Per-entity `type` is rewritten to `"agent_persona"`. - **LLM-refine consumer wiring.** `llm_refine.refine_entities` accepts the same `corpus_origin` kwarg and prepends a `CORPUS CONTEXT` preamble to its system prompt giving the LLM the platform / user / persona context. Existing `TOPIC` / `PERSON` / `PROJECT` / `COMMON_WORD` / `AMBIGUOUS` labels are unchanged. - **`init` overhaul.** Pass 0 (corpus-origin detection) inserted before existing Pass 1 (entity discovery). `--llm` flipped to default-on. `--no-llm` added. Graceful-fallback path replaces the previous hard-error on missing LLM. Provider precedence unchanged from the existing `llm_client` module. - **`mine` flag.** `mempalace mine --redetect-origin` re-runs corpus-origin detection on the current corpus state and overwrites `<palace>/.mempalace/origin.json`. - **`CLAUDE.md` design principle reworded** — "Local-first, zero external API by default." Local LLMs running on `localhost` (Ollama, LM Studio, llama.cpp, vLLM, unsloth studio) are part of the user's machine, not external APIs. External BYOK providers (Anthropic, OpenAI, Google) are supported but always opt-in, never default, never silent fallback. ## Cost story - **Anthropic (verified path):** ~\$0.01 per `init` via Haiku 4.5 with `ANTHROPIC_API_KEY`. - **Ollama / local LLM runtime:** zero cost. Fully offline. - **OpenAI-compatible service:** depends entirely on the service. The abstraction supports any service speaking the standard `/v1/chat/completions` API; specific quirks vary per provider. Try it and tell us how it goes. - **No LLM at all:** graceful fallback to heuristic-only. Zero cost. `init` never blocks. ## Backwards compatibility - All public function signatures gained the `corpus_origin` kwarg as optional (default `None`). Callers that don't pass it see the v3.3.3 return shape unchanged — no `agent_personas` key, no behavioral change. - The `--llm` CLI flag is preserved as a deprecated alias of the default. Existing scripts that pass it continue to work. - `corpus_origin=None` keeps `llm_refine.SYSTEM_PROMPT` byte-identical to v3.3.3. ## Test coverage - **19 unit tests** in `tests/test_corpus_origin.py` covering both tiers, the co-occurrence rule, ambiguous-term suppression, word-boundary brand matching, and user/persona disambiguation. - **29 integration tests** in `tests/test_corpus_origin_integration.py` covering end-to-end through `mempalace init`, persona reclassification, the `--redetect-origin` flag, the `--llm` default flip, graceful fallback paths, and re-init idempotency. Of those 29, five specifically cover the intersection with develop's other in-flight work (Pass 0 ↔ auto-mine ordering, topics + agent_personas bucket coexistence, entities.json shape, the `wing=` kwarg threading, llm_refine TOPIC label + corpus_origin preamble composition). - **1354 total mempalace tests pass.** 2 pre-existing environmental failures (`test_mcp_stdio_protection` — chromadb optional dep) unrelated to this change; they fail on plain `develop` too. - **Live-smoke-tested** with real Anthropic Haiku 4.5 on AI-dialogue and narrative fixtures. ## Hygiene guardrail This PR also adds a meta-test (`test_no_internal_coordination_jargon_in_source_or_tests`) that walks the source tree and asserts no internal-coordination jargon (e.g. development-phase markers, internal review-section references) leaks into runtime code, comments, docstrings, or LLM prompts. RED if anything slips in. Allowlist for legitimate RFC/spec section citations in `sources/`, `backends/`, `knowledge_graph.py`, and `i18n/`.
501 lines
18 KiB
Python
501 lines
18 KiB
Python
"""
|
|
llm_refine.py — Optional LLM refinement of regex-detected entities.
|
|
|
|
Takes the candidate set produced by phase-1 detection (manifests, git
|
|
authors, regex on prose) and asks an LLM to reclassify each candidate as
|
|
PERSON / PROJECT / TOPIC / COMMON_WORD / AMBIGUOUS.
|
|
|
|
Design constraints:
|
|
- Opt-in. Default init path never imports this module.
|
|
- Local-first by default (Ollama).
|
|
- Interactive UX: visible progress, clean cancellation (Ctrl-C returns
|
|
whatever was classified before the interrupt).
|
|
- Don't feed the raw corpus to the LLM — feed candidates + a few sampled
|
|
context lines each. Keeps total input to ~50-100K tokens even for huge
|
|
prose corpora.
|
|
|
|
Public:
|
|
refine_entities(detected, corpus_text, provider, ...) -> dict
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import re
|
|
import sys
|
|
from dataclasses import dataclass
|
|
|
|
from mempalace.llm_client import LLMError, LLMProvider
|
|
|
|
|
|
BATCH_SIZE = 25 # candidates per LLM call; tuned for 4B local models
|
|
CONTEXT_LINES_PER_CANDIDATE = 3
|
|
CONTEXT_WINDOW_CHARS = 240 # max chars per context line to keep tokens bounded
|
|
|
|
# Valid labels the LLM is allowed to return. Anything else is treated as
|
|
# AMBIGUOUS so the user reviews it.
|
|
VALID_LABELS = {"PERSON", "PROJECT", "TOPIC", "COMMON_WORD", "AMBIGUOUS"}
|
|
|
|
|
|
SYSTEM_PROMPT = """You are helping organize a user's memory palace by classifying capitalized tokens found in their files.
|
|
|
|
For each candidate, pick exactly ONE label:
|
|
- PERSON: a specific real person the user knows (colleague, family, character they write about)
|
|
- PROJECT: a named product, codebase, or effort the user works on
|
|
- TOPIC: a recurring theme or subject (not a person, not a project) — cities, technologies, concepts
|
|
- COMMON_WORD: an English word, verb, or fragment that isn't a named entity at all (e.g. "Created", "Before", "Never")
|
|
- AMBIGUOUS: context is insufficient to decide between two of the above
|
|
|
|
Frameworks, runtimes, APIs, cloud services, vendors, and third-party products
|
|
(e.g. Angular, OpenAPI, Terraform, Bun, Google) are TOPIC unless the context
|
|
clearly says this is the user's own named codebase, product, or active effort.
|
|
|
|
Use the provided context lines to disambiguate. A capitalized word that only appears in metadata ("Created: 2026-04-24") is COMMON_WORD. A name that appears with pronouns and dialogue is PERSON.
|
|
|
|
Respond with JSON only. Schema:
|
|
{"classifications": [{"name": "<exact candidate name>", "label": "<LABEL>", "reason": "<one short sentence>"}]}
|
|
|
|
One entry per candidate, same order as the input."""
|
|
|
|
|
|
@dataclass
|
|
class RefineResult:
|
|
merged: dict # updated detected dict
|
|
reclassified: int # entries whose type changed
|
|
dropped: int # entries removed from the merged result (COMMON_WORD only)
|
|
errors: list[str] # per-batch error messages (transport/parse failures)
|
|
batches_completed: int
|
|
batches_total: int
|
|
cancelled: bool
|
|
|
|
|
|
def _collect_contexts(
|
|
corpus_lines: list[str], name: str, max_lines: int = CONTEXT_LINES_PER_CANDIDATE
|
|
) -> list[str]:
|
|
"""Return up to `max_lines` distinct lines from the corpus that mention `name`.
|
|
|
|
Case-insensitive token-boundary match. Lines are truncated to
|
|
CONTEXT_WINDOW_CHARS chars to keep token usage bounded.
|
|
"""
|
|
needle = re.compile(rf"(?<!\w){re.escape(name)}(?!\w)", re.IGNORECASE)
|
|
seen: set[str] = set()
|
|
out: list[str] = []
|
|
for line in corpus_lines:
|
|
if not needle.search(line):
|
|
continue
|
|
trimmed = line.strip()[:CONTEXT_WINDOW_CHARS]
|
|
if not trimmed or trimmed in seen:
|
|
continue
|
|
seen.add(trimmed)
|
|
out.append(trimmed)
|
|
if len(out) >= max_lines:
|
|
break
|
|
return out
|
|
|
|
|
|
def _build_user_prompt(candidates_with_contexts: list[tuple[str, str, list[str]]]) -> str:
|
|
"""Shape: for each candidate, list its current type guess + sampled contexts."""
|
|
parts: list[str] = ["CANDIDATES:"]
|
|
for i, (name, current_type, contexts) in enumerate(candidates_with_contexts, 1):
|
|
parts.append(f"\n{i}. {name} (currently: {current_type})")
|
|
if contexts:
|
|
for c in contexts:
|
|
parts.append(f" > {c}")
|
|
else:
|
|
parts.append(" > (no context available)")
|
|
return "\n".join(parts)
|
|
|
|
|
|
def _extract_json_candidates(text: str) -> list[str]:
|
|
"""Return plausible JSON payloads extracted from an LLM response."""
|
|
text = text.strip()
|
|
if not text:
|
|
return []
|
|
|
|
candidates: list[str] = [text]
|
|
|
|
for match in re.finditer(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.IGNORECASE):
|
|
candidate = match.group(1).strip()
|
|
if candidate and candidate not in candidates:
|
|
candidates.append(candidate)
|
|
|
|
for start, opener in ((i, ch) for i, ch in enumerate(text) if ch in "{["):
|
|
closer = "}" if opener == "{" else "]"
|
|
depth = 0
|
|
in_string = False
|
|
escaped = False
|
|
for i in range(start, len(text)):
|
|
ch = text[i]
|
|
if in_string:
|
|
if escaped:
|
|
escaped = False
|
|
elif ch == "\\":
|
|
escaped = True
|
|
elif ch == '"':
|
|
in_string = False
|
|
continue
|
|
|
|
if ch == '"':
|
|
in_string = True
|
|
elif ch == opener:
|
|
depth += 1
|
|
elif ch == closer:
|
|
depth -= 1
|
|
if depth == 0:
|
|
candidate = text[start : i + 1].strip()
|
|
if candidate and candidate not in candidates:
|
|
candidates.append(candidate)
|
|
break
|
|
|
|
return candidates
|
|
|
|
|
|
def _parse_response(text: str, expected_names: list[str]) -> dict[str, tuple[str, str]]:
|
|
"""Parse the LLM's JSON response into {name: (label, reason)}.
|
|
|
|
Robust to the model occasionally wrapping JSON in text or returning
|
|
slight schema variations. Falls back to matching by candidate name.
|
|
"""
|
|
data = None
|
|
for candidate in _extract_json_candidates(text):
|
|
try:
|
|
data = json.loads(candidate)
|
|
break
|
|
except json.JSONDecodeError:
|
|
continue
|
|
if data is None:
|
|
return {}
|
|
|
|
entries = data.get("classifications") if isinstance(data, dict) else data
|
|
if not isinstance(entries, list):
|
|
return {}
|
|
|
|
name_to_label: dict[str, tuple[str, str]] = {}
|
|
expected_set = {n.lower(): n for n in expected_names}
|
|
for entry in entries:
|
|
if not isinstance(entry, dict):
|
|
continue
|
|
name = entry.get("name") or entry.get("candidate")
|
|
label = entry.get("label") or entry.get("type") or entry.get("classification")
|
|
reason = entry.get("reason") or ""
|
|
if not isinstance(name, str) or not isinstance(label, str):
|
|
continue
|
|
# Restore canonical casing from expected_names
|
|
canonical = expected_set.get(name.lower(), name)
|
|
lbl = label.strip().upper()
|
|
if lbl not in VALID_LABELS:
|
|
lbl = "AMBIGUOUS"
|
|
name_to_label[canonical] = (lbl, reason.strip()[:120])
|
|
return name_to_label
|
|
|
|
|
|
def _apply_classifications(
|
|
detected: dict,
|
|
decisions: dict[str, tuple[str, str]],
|
|
allow_project_promotions: bool = True,
|
|
) -> tuple[dict, int, int]:
|
|
"""Merge LLM decisions back into the detected dict.
|
|
|
|
Returns (new_detected, reclassified_count, dropped_count).
|
|
|
|
Topics get their own bucket so the caller can persist them as
|
|
cross-wing tunnel signal. ``AMBIGUOUS`` still falls back to
|
|
``uncertain`` for human review.
|
|
"""
|
|
label_to_bucket = {
|
|
"PERSON": "people",
|
|
"PROJECT": "projects",
|
|
"TOPIC": "topics",
|
|
"AMBIGUOUS": "uncertain",
|
|
}
|
|
bucket_to_type = {
|
|
"people": "person",
|
|
"projects": "project",
|
|
"topics": "topic",
|
|
"uncertain": "uncertain",
|
|
}
|
|
|
|
# Index every entity by name for in-place update
|
|
all_entries: list[tuple[str, dict]] = []
|
|
for bucket, items in detected.items():
|
|
for e in items:
|
|
all_entries.append((bucket, e))
|
|
|
|
reclassified = 0
|
|
dropped = 0
|
|
new_detected: dict[str, list[dict]] = {
|
|
"people": [],
|
|
"projects": [],
|
|
"topics": [],
|
|
"uncertain": [],
|
|
}
|
|
|
|
for old_bucket, entry in all_entries:
|
|
decision = decisions.get(entry["name"])
|
|
if decision is None:
|
|
# No LLM opinion — keep as-is
|
|
new_detected.setdefault(old_bucket, []).append(entry)
|
|
continue
|
|
|
|
label, reason = decision
|
|
if label == "COMMON_WORD":
|
|
dropped += 1
|
|
continue
|
|
|
|
target_bucket = label_to_bucket[label]
|
|
if (
|
|
label == "PROJECT"
|
|
and not allow_project_promotions
|
|
and not _is_authoritative_project(entry)
|
|
):
|
|
target_bucket = "uncertain"
|
|
updated = dict(entry)
|
|
# Append the LLM's reason as a new signal so the user sees why it moved
|
|
signals = list(updated.get("signals", []))
|
|
signals.append(f"LLM: {label.lower()} — {reason}" if reason else f"LLM: {label.lower()}")
|
|
updated["signals"] = signals
|
|
if target_bucket != old_bucket:
|
|
reclassified += 1
|
|
updated["type"] = bucket_to_type.get(target_bucket, "uncertain")
|
|
new_detected[target_bucket].append(updated)
|
|
|
|
return new_detected, reclassified, dropped
|
|
|
|
|
|
def _build_corpus_origin_preamble(corpus_origin: dict | None) -> str:
|
|
"""Build a system-prompt preamble carrying corpus-origin context.
|
|
|
|
When the corpus has been identified as AI-dialogue with known persona
|
|
names, this preamble lets the LLM disambiguate ambiguous candidates
|
|
with knowledge that this is AI-dialogue. It does NOT add a new label
|
|
or change the classification schema — the post-refine sweep in
|
|
project_scanner.discover_entities still moves persona names into
|
|
``agent_personas``. The preamble is purely classification context for
|
|
the OTHER candidates (ambiguous, common-word) that benefit from
|
|
knowing the corpus shape.
|
|
|
|
Returns ``""`` when no usable origin context is available, so callers
|
|
can concatenate unconditionally without changing the v3.3.3 prompt
|
|
shape for opt-out paths.
|
|
"""
|
|
if not corpus_origin:
|
|
return ""
|
|
result = corpus_origin.get("result") or {}
|
|
if not result.get("likely_ai_dialogue"):
|
|
return ""
|
|
|
|
lines = ["\n\nCORPUS CONTEXT (corpus-origin detection):"]
|
|
platform = result.get("primary_platform")
|
|
if platform:
|
|
lines.append(f"- This corpus is AI-dialogue from {platform}.")
|
|
user_name = result.get("user_name")
|
|
if user_name:
|
|
lines.append(
|
|
f"- The corpus author (the human user) is named '{user_name}'. "
|
|
f"Treat this name as PERSON."
|
|
)
|
|
personas = result.get("agent_persona_names") or []
|
|
if personas:
|
|
lines.append(
|
|
"- The user has assigned these persona names to AI agents in "
|
|
f"this corpus: {', '.join(personas)}."
|
|
)
|
|
lines.append(
|
|
"- Persona names refer to AI agents, not biological people. "
|
|
"Classify them as PERSON (a downstream step tags them as "
|
|
"agent personas)."
|
|
)
|
|
return "\n".join(lines)
|
|
|
|
|
|
def _is_authoritative_person(entry: dict) -> bool:
|
|
"""Return True for git-author people that should not be second-guessed."""
|
|
signals = " ".join(entry.get("signals", [])).lower()
|
|
return "commit" in signals and "repo" in signals
|
|
|
|
|
|
def _is_authoritative_project(entry: dict) -> bool:
|
|
"""Return True for manifest/git-backed projects that are already source-backed."""
|
|
signals = " ".join(entry.get("signals", [])).lower()
|
|
manifest_markers = ("package.json", "pyproject.toml", "cargo.toml", "go.mod")
|
|
return any(marker in signals for marker in manifest_markers) or "commit" in signals
|
|
|
|
|
|
def _print_progress(batch_idx: int, total: int, current_name: str) -> None:
|
|
"""Overwrite-line progress indicator."""
|
|
width = 40
|
|
filled = int(width * batch_idx / total) if total else 0
|
|
bar = "█" * filled + "░" * (width - filled)
|
|
msg = f"\r LLM refine: [{bar}] batch {batch_idx}/{total} current: {current_name[:30]:<30}"
|
|
sys.stderr.write(msg)
|
|
sys.stderr.flush()
|
|
|
|
|
|
def refine_entities(
|
|
detected: dict,
|
|
corpus_text: str,
|
|
provider: LLMProvider,
|
|
batch_size: int = BATCH_SIZE,
|
|
show_progress: bool = True,
|
|
allow_project_promotions: bool = True,
|
|
corpus_origin: dict | None = None,
|
|
) -> RefineResult:
|
|
"""Reclassify detected entities using the LLM provider.
|
|
|
|
Only regex-derived candidates are sent for refinement. Git authors and
|
|
manifest/git-backed projects are already source-backed and don't benefit
|
|
from LLM second-guessing.
|
|
|
|
Ctrl-C during refinement: cancels the remaining batches, returns a
|
|
RefineResult with ``cancelled=True`` and whatever was classified before
|
|
the interrupt. The partial result is safe to pass straight to
|
|
``confirm_entities``.
|
|
|
|
Transport or parse failures in individual batches are recorded in
|
|
``errors`` and do not abort the run.
|
|
|
|
``allow_project_promotions=False`` keeps LLM-only project guesses in the
|
|
uncertain bucket. This is useful when manifest/git signal already supplied
|
|
canonical projects and regex/LLM hits are likely tools, vendors, or topics.
|
|
"""
|
|
candidates: list[tuple[str, str]] = []
|
|
current_type = {"people": "person", "projects": "project", "uncertain": "uncertain"}
|
|
for bucket in ("people", "projects", "uncertain"):
|
|
for e in detected.get(bucket, []):
|
|
if bucket == "people" and _is_authoritative_person(e):
|
|
continue
|
|
if bucket == "projects" and _is_authoritative_project(e):
|
|
continue
|
|
candidates.append((e["name"], current_type[bucket]))
|
|
|
|
corpus_lines = corpus_text.splitlines() if corpus_text else []
|
|
|
|
# Deduplicate candidate names while preserving order
|
|
seen: set[str] = set()
|
|
unique: list[tuple[str, str]] = []
|
|
for name, kind in candidates:
|
|
if name not in seen:
|
|
seen.add(name)
|
|
unique.append((name, kind))
|
|
|
|
if not unique:
|
|
return RefineResult(
|
|
merged=detected,
|
|
reclassified=0,
|
|
dropped=0,
|
|
errors=[],
|
|
batches_completed=0,
|
|
batches_total=0,
|
|
cancelled=False,
|
|
)
|
|
|
|
# Build batches
|
|
batches: list[list[tuple[str, str, list[str]]]] = []
|
|
for i in range(0, len(unique), batch_size):
|
|
chunk = unique[i : i + batch_size]
|
|
enriched = [(name, kind, _collect_contexts(corpus_lines, name)) for name, kind in chunk]
|
|
batches.append(enriched)
|
|
|
|
all_decisions: dict[str, tuple[str, str]] = {}
|
|
errors: list[str] = []
|
|
completed = 0
|
|
cancelled = False
|
|
|
|
system_prompt = SYSTEM_PROMPT + _build_corpus_origin_preamble(corpus_origin)
|
|
|
|
for idx, batch in enumerate(batches, 1):
|
|
if show_progress and batch:
|
|
_print_progress(idx - 1, len(batches), batch[0][0])
|
|
user_prompt = _build_user_prompt(batch)
|
|
try:
|
|
resp = provider.classify(system_prompt, user_prompt, json_mode=True)
|
|
except KeyboardInterrupt:
|
|
cancelled = True
|
|
break
|
|
except LLMError as e:
|
|
errors.append(f"batch {idx}: {e}")
|
|
continue
|
|
names_in_batch = [name for name, _, _ in batch]
|
|
decisions = _parse_response(resp.text, names_in_batch)
|
|
if not decisions:
|
|
errors.append(f"batch {idx}: could not parse response")
|
|
all_decisions.update(decisions)
|
|
completed += 1
|
|
if show_progress:
|
|
_print_progress(idx, len(batches), batch[-1][0])
|
|
|
|
if show_progress:
|
|
sys.stderr.write("\n")
|
|
sys.stderr.flush()
|
|
|
|
merged, reclassified, dropped = _apply_classifications(
|
|
detected,
|
|
all_decisions,
|
|
allow_project_promotions=allow_project_promotions,
|
|
)
|
|
|
|
return RefineResult(
|
|
merged=merged,
|
|
reclassified=reclassified,
|
|
dropped=dropped,
|
|
errors=errors,
|
|
batches_completed=completed,
|
|
batches_total=len(batches),
|
|
cancelled=cancelled,
|
|
)
|
|
|
|
|
|
def collect_corpus_text(
|
|
project_dir: str,
|
|
max_files: int = 30,
|
|
max_bytes_per_file: int = 20_000,
|
|
) -> str:
|
|
"""Gather prose text from ``project_dir`` for use as LLM context source.
|
|
|
|
Stratified: reads up to ``max_files`` prose files (``.md``, ``.txt``,
|
|
``.rst``), preferring recently-modified. Each file capped at
|
|
``max_bytes_per_file`` to bound total input.
|
|
"""
|
|
from pathlib import Path
|
|
|
|
from mempalace.entity_detector import PROSE_EXTENSIONS, SKIP_DIRS
|
|
|
|
root = Path(project_dir).expanduser().resolve()
|
|
if not root.is_dir():
|
|
return ""
|
|
candidates: list[tuple[float, Path]] = []
|
|
for dirpath, dirs, files in _walk_prose(root, SKIP_DIRS):
|
|
for fname in files:
|
|
p = dirpath / fname
|
|
if p.suffix.lower() not in PROSE_EXTENSIONS:
|
|
continue
|
|
try:
|
|
mtime = p.stat().st_mtime
|
|
except OSError:
|
|
continue
|
|
candidates.append((mtime, p))
|
|
candidates.sort(reverse=True)
|
|
selected = [p for _, p in candidates[:max_files]]
|
|
chunks: list[str] = []
|
|
for p in selected:
|
|
try:
|
|
with open(p, encoding="utf-8", errors="replace") as f:
|
|
chunks.append(f.read(max_bytes_per_file))
|
|
except OSError:
|
|
continue
|
|
return "\n".join(chunks)
|
|
|
|
|
|
def _walk_prose(root, skip_dirs):
|
|
"""Walk a directory yielding (Path, dirs, files), pruning skip_dirs.
|
|
|
|
Inlined from ``project_scanner._walk`` to avoid a private-name import
|
|
coupling. Functionality is intentionally narrow: prose collection only.
|
|
"""
|
|
import os
|
|
from pathlib import Path
|
|
|
|
for dirpath, dirs, files in os.walk(root):
|
|
dirs[:] = [d for d in dirs if d not in skip_dirs and not d.startswith(".")]
|
|
yield Path(dirpath), dirs, files
|