fix(llm): tighter refinement — word boundaries, JSON extraction, authoritative sources
Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.
This commit is contained in:
@@ -3,6 +3,7 @@
|
||||
import json
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from types import SimpleNamespace
|
||||
|
||||
from mempalace.project_scanner import (
|
||||
PersonInfo,
|
||||
@@ -390,6 +391,49 @@ def test_discover_entities_prefers_real_signal_over_prose(tmp_path):
|
||||
assert "realproj" in proj_names
|
||||
|
||||
|
||||
def test_discover_entities_keeps_uncertain_for_llm_when_real_signal(tmp_path):
|
||||
"""With --llm, regex-uncertain prose candidates should reach refinement."""
|
||||
(tmp_path / "package.json").write_text(json.dumps({"name": "realproj"}))
|
||||
_init_git_repo(tmp_path)
|
||||
(tmp_path / "doc.md").write_text("Noise appeared. Noise repeated. Noise again.")
|
||||
|
||||
class FakeProvider:
|
||||
def __init__(self):
|
||||
self.prompts = []
|
||||
|
||||
def classify(self, _system, user, json_mode=True):
|
||||
self.prompts.append(user)
|
||||
return SimpleNamespace(
|
||||
text='{"classifications": [{"name": "Noise", "label": "COMMON_WORD"}]}'
|
||||
)
|
||||
|
||||
provider = FakeProvider()
|
||||
d = discover_entities(str(tmp_path), llm_provider=provider, show_progress=False)
|
||||
|
||||
assert len(provider.prompts) == 1
|
||||
assert "Noise" in provider.prompts[0]
|
||||
assert "Noise" not in [e["name"] for cat in d.values() for e in cat]
|
||||
|
||||
|
||||
def test_discover_entities_keeps_llm_only_project_uncertain_when_real_signal(tmp_path):
|
||||
"""Repo roots should not auto-promote LLM-only tools/topics into projects."""
|
||||
(tmp_path / "package.json").write_text(json.dumps({"name": "realproj"}))
|
||||
_init_git_repo(tmp_path)
|
||||
(tmp_path / "doc.md").write_text("Terraform shipped. Terraform changed. Terraform runs.")
|
||||
|
||||
class FakeProvider:
|
||||
def classify(self, _system, _user, json_mode=True):
|
||||
return SimpleNamespace(
|
||||
text='{"classifications": [{"name": "Terraform", "label": "PROJECT"}]}'
|
||||
)
|
||||
|
||||
d = discover_entities(str(tmp_path), llm_provider=FakeProvider(), show_progress=False)
|
||||
|
||||
assert "realproj" in [e["name"] for e in d["projects"]]
|
||||
assert "Terraform" not in [e["name"] for e in d["projects"]]
|
||||
assert "Terraform" in [e["name"] for e in d["uncertain"]]
|
||||
|
||||
|
||||
# ── _UnionFind basics ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user