`mempalace init` previously leaned entirely on regex-based entity
extraction from prose. That path works for text-only folders but wastes
signal in any codebase: the project's own name is already in
`package.json` / `pyproject.toml` / `Cargo.toml` / `go.mod`, and the
people who worked on it are in `git log`.
This adds `project_scanner.py`, which becomes the primary signal source
when real signal is available, with the regex detector preserved as the
fallback for prose-only folders (diaries, research notes, writing).
What it does:
- Walks the target directory, parses manifests for canonical project
names, and detects git repos by the presence of a `.git` directory.
- For each repo, reads `git log` for authors and filters obvious bots
(`[bot]`, `dependabot`, `renovate`, `github-actions`, names ending in
`bot`, `-autoroll`). Importantly does NOT filter
`@users.noreply.github.com` - that's GitHub's privacy-protected human
email, used by real contributors.
- Resolves author aliases with a union-find: commits that share a name
OR an email collapse into one person. Picks the most-frequent
real-name variant as display, ignoring handles and single-token
usernames.
- Flags "mine" projects: user is top-5 committer OR has >=10% of
commits OR >=20 commits. Ordered by user_commits in the UX.
- `discover_entities()` merges scanner results with the regex detector
case-insensitively (so `mempalace` from pyproject absorbs `MemPalace`
from docs), and suppresses the regex `uncertain` bucket when real
signal is already found - the user doesn't need to adjudicate prose
noise when the answer is already in git.
Integration: `cmd_init` now calls `discover_entities` instead of
running the regex detector directly. Same output shape, so
`confirm_entities` works unchanged.
Ships with 39 new tests covering manifest parsing, bot filtering,
union-find dedup, git repo discovery, scan integration, and
merge/fallback behavior. Existing 56 regex-detector tests all pass.