035fe6d658
Addresses issues found while reviewing the initial phase-2 implementation against real data: **Bug: uncertain bucket starved from the LLM.** `discover_entities` was dropping the regex-uncertain bucket whenever real git/manifest signal existed — which is exactly when `--llm` is most useful for cleaning up prose noise. The uncertain candidates never reached the refinement step. Fixed: only drop when `llm_provider is None`. **Context collection: word boundaries, not substring.** `_collect_contexts` used substring matching on lower-cased lines, so the name "Go" matched "good", "going", "forgot". Switched to a `(?<!\w)…(?!\w)` regex so short names only match at token boundaries. **Authoritative-source detection replaces confidence threshold.** Previously the refinement step skipped entries with `confidence >= 0.95` to avoid second-guessing manifest-backed projects. That threshold was fragile — the regex detector produces 0.99 confidence for things like `code file reference (5x)` on framework names (OpenAPI, etc.), so those skipped the LLM despite being regex-only noise. New helpers `_is_authoritative_person` / `_is_authoritative_project` look at the actual signal strings (commits, package.json, etc.) to decide. **Now also refines regex-derived people.** After #1148's high-pronoun-signal fix, the regex detector can promote non-people to the `people` bucket (e.g. a capitalized common noun that happened to appear near pronouns). The LLM now gets a chance to clean those up, while git-authored people are still skipped. **Robust JSON extraction.** Small local models routinely wrap JSON output in prose ("Sure, here's the classification: {…}"). The previous code-fence stripper failed on that. `_extract_json_candidates` now does balanced-bracket extraction with string-aware quote handling, so it recovers JSON from: - raw responses - markdown fenced blocks - JSON embedded inside surrounding text - multiple candidate objects/arrays **Prompt guidance for frameworks vs user projects.** Added an explicit instruction: frameworks, runtimes, APIs, cloud services, and third-party vendors (Angular, OpenAPI, Terraform, Bun, Google, etc.) are TOPIC unless the context clearly says it's the user's own codebase. Directly addresses a false-positive pattern observed during dev runs. **Defensive mtime.** `convo_scanner._safe_mtime` catches OSError during `stat()` — permission changes, filesystem races, broken symlinks — and sorts the affected file to the end of the newest-first order rather than crashing the scan. **Cosmetic:** merged two adjacent f-strings on the same line in `backends/chroma.py` and `llm_client.py` (no behaviour change). 15 new tests cover the OSError fallback, word-boundary matching, JSON extraction variants, authoritative-source helpers, refining high- confidence regex projects, and end-to-end LLM refinement preserving the uncertain bucket.
161 lines
5.1 KiB
Python
161 lines
5.1 KiB
Python
"""
|
|
convo_scanner.py — Parse Claude Code conversation directories into ProjectInfo.
|
|
|
|
Claude Code stores sessions under ``~/.claude/projects/<slug>/<id>.jsonl``,
|
|
where the ``<slug>`` is the original CWD with ``/`` replaced by ``-``. That
|
|
encoding is lossy: we can't tell whether ``foo-bar`` in a slug is the
|
|
literal project name ``foo-bar`` or two path segments ``foo/bar``.
|
|
|
|
Fortunately, every message record in the JSONL carries a ``cwd`` field with
|
|
the true path. This scanner reads one record per session to recover the
|
|
accurate project name, falling back to slug-decoding only if the JSONL
|
|
is malformed or empty.
|
|
|
|
Output is the same ``ProjectInfo`` shape used by ``project_scanner``, so the
|
|
``discover_entities`` orchestrator can mix-and-match sources.
|
|
|
|
Public:
|
|
is_claude_projects_root(path) -> bool
|
|
scan_claude_projects(path) -> list[ProjectInfo]
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
from pathlib import Path
|
|
from typing import Optional
|
|
|
|
from mempalace.project_scanner import ProjectInfo
|
|
|
|
|
|
MAX_HEADER_LINES = 20 # lines to read per session looking for `cwd`
|
|
|
|
|
|
def is_claude_projects_root(path: Path) -> bool:
|
|
"""Return True if path looks like `.claude/projects/`.
|
|
|
|
Heuristic: at least one child dir whose name starts with ``-`` and which
|
|
contains at least one ``.jsonl`` file.
|
|
"""
|
|
if not path.is_dir():
|
|
return False
|
|
try:
|
|
children = list(path.iterdir())
|
|
except OSError:
|
|
return False
|
|
for child in children:
|
|
if not (child.is_dir() and child.name.startswith("-")):
|
|
continue
|
|
try:
|
|
if any(p.suffix == ".jsonl" for p in child.iterdir() if p.is_file()):
|
|
return True
|
|
except OSError:
|
|
continue
|
|
return False
|
|
|
|
|
|
def _extract_cwd_from_session(session_file: Path) -> Optional[str]:
|
|
"""Return the ``cwd`` from the first message record that carries one.
|
|
|
|
Returns None if the file can't be read, has no JSON, or no record has cwd.
|
|
"""
|
|
try:
|
|
with open(session_file, encoding="utf-8", errors="replace") as f:
|
|
for i, line in enumerate(f):
|
|
if i >= MAX_HEADER_LINES:
|
|
break
|
|
line = line.strip()
|
|
if not line:
|
|
continue
|
|
try:
|
|
obj = json.loads(line)
|
|
except json.JSONDecodeError:
|
|
continue
|
|
cwd = obj.get("cwd")
|
|
if isinstance(cwd, str) and cwd:
|
|
return cwd
|
|
except OSError:
|
|
return None
|
|
return None
|
|
|
|
|
|
def _decode_slug_fallback(slug: str) -> str:
|
|
"""Best-effort project name from slug when cwd is unavailable.
|
|
|
|
The slug is lossy (`/` and `-` both become `-`). Last non-empty segment
|
|
is the closest guess at the project name, preserving kebab-case is
|
|
impossible without cwd.
|
|
"""
|
|
stripped = slug.lstrip("-")
|
|
parts = [p for p in stripped.split("-") if p]
|
|
return parts[-1] if parts else slug
|
|
|
|
|
|
def _safe_mtime(path: Path) -> float:
|
|
"""Return file mtime, defaulting old on permission or filesystem errors."""
|
|
try:
|
|
return path.stat().st_mtime
|
|
except OSError:
|
|
return 0.0
|
|
|
|
|
|
def _resolve_project_name(project_dir: Path) -> str:
|
|
"""Read one session's cwd to recover the original project name.
|
|
|
|
Falls back to slug-decoding if no session has a readable cwd.
|
|
"""
|
|
sessions = sorted(
|
|
(p for p in project_dir.iterdir() if p.is_file() and p.suffix == ".jsonl"),
|
|
key=_safe_mtime,
|
|
reverse=True, # newest first — most likely to be well-formed
|
|
)
|
|
for session in sessions:
|
|
cwd = _extract_cwd_from_session(session)
|
|
if cwd:
|
|
return Path(cwd).name or cwd
|
|
return _decode_slug_fallback(project_dir.name)
|
|
|
|
|
|
def scan_claude_projects(path: str | Path) -> list[ProjectInfo]:
|
|
"""Scan a ``.claude/projects/`` directory for Claude Code conversations.
|
|
|
|
One ProjectInfo per subdir. ``has_git`` is False (the directory isn't a
|
|
repo itself) but ``total_commits`` is repurposed here as session count so
|
|
the UX surfaces a density signal for ranking.
|
|
"""
|
|
root = Path(path).expanduser().resolve()
|
|
if not is_claude_projects_root(root):
|
|
return []
|
|
|
|
projects: dict[str, ProjectInfo] = {}
|
|
for sub in sorted(root.iterdir()):
|
|
if not (sub.is_dir() and sub.name.startswith("-")):
|
|
continue
|
|
try:
|
|
sessions = [p for p in sub.iterdir() if p.is_file() and p.suffix == ".jsonl"]
|
|
except OSError:
|
|
continue
|
|
if not sessions:
|
|
continue
|
|
|
|
name = _resolve_project_name(sub)
|
|
session_count = len(sessions)
|
|
|
|
proj = ProjectInfo(
|
|
name=name,
|
|
repo_root=sub,
|
|
manifest=None,
|
|
has_git=False,
|
|
total_commits=session_count,
|
|
user_commits=session_count,
|
|
is_mine=True, # Claude Code sessions are authored by the user
|
|
)
|
|
existing = projects.get(name)
|
|
if existing is None or session_count > existing.user_commits:
|
|
projects[name] = proj
|
|
|
|
return sorted(
|
|
projects.values(),
|
|
key=lambda p: (-p.user_commits, p.name),
|
|
)
|