Merges develop (closet hardening #826, strip_noise #785, lock #784) and
replaces every sub-feature in this PR with a correct, tested
implementation. Shippable now.
## 1. Real Okapi-BM25 (searcher.py)
The prior `_bm25_score()` hardcoded `idf = log(2.0)` for every term — it
was really a scaled TF, not BM25, and couldn't tell a discriminative
term from a generic one. Replaced with `_bm25_scores(query, documents)`
that computes proper IDF over the provided candidate corpus using the
Lucene smoothed formula `log((N - df + 0.5) / (df + 0.5) + 1)`. Well-
defined for re-ranking vector-retrieval candidates — IDF there measures
how discriminative each term is *within the candidate set*, exactly the
signal we want.
`_hybrid_rank` also fixed:
- Vector normalization is now absolute `max(0, 1 - dist)`, not
`1 - dist/max_dist` — adding/removing a candidate no longer reshuffles
the others.
- BM25 is min-max normalized within candidates (bounded [0, 1]).
- Closet path now re-ranks too (was previously returning closet-order
hits without hybrid scoring).
- `_hybrid_score` internal field stripped from output; `bm25_score`
exposed for debugging.
## 2. Entity metadata (miner.py)
- Reuses `_ENTITY_STOPLIST` from palace.py so sentence-starters like
"When", "After", "The" no longer land as entities (regression test
covers this).
- Known-entity registry is cached at module level, keyed by the
registry file's mtime — no more disk read per drawer.
- File handle now uses a context manager.
- Truncates the entity LIST (to 25) before joining — never splits a
name in the middle.
## 3. Diary ingest (diary_ingest.py)
- State file now lives at `~/.mempalace/state/diary_ingest_<hash>.json`,
keyed by (palace_path, diary_dir). No more pollution of the user's
content directory.
- Drawer IDs now hash `(wing, date_str)` — a user with personal + work
diaries on the same day no longer silently clobbers.
- Each day's upsert runs inside `mine_lock(source_file)` so concurrent
ingest from two terminals can't race.
- `force=True` now calls `purge_file_closets` before rebuild so
leftover numbered closets from a longer prior day don't orphan.
## 4. Tests (tests/test_closets.py)
Merged this PR's MineLock/Entity/BM25/Diary tests with develop's
hardened Build/Upsert/Purge/Rebuild/SearchClosetFirst tests. Added
specific regression tests for every fix above:
- entity stoplist applies (no "When/After/The")
- entity list capped before join (no partial tokens)
- registry cached by mtime (mock-verified zero re-reads)
- BM25 IDF downweights terms present in every doc (real BM25 evidence)
- hybrid rank absolute normalization stable against outliers
- diary state file outside user's diary dir
- diary wing-prefixed IDs prevent cross-wing date collisions
35/35 closet tests pass; full suite 743/743. ruff + format clean under
CI-pinned 0.4.x.
Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION,
#784 file locking) and addresses six concerns surfaced during PR review
of the closet feature:
1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to
existing closets (mismatched the doc's "fully replaced" promise). With
NORMALIZE_VERSION rebuilds on develop, this would have stacked stale
v1 topics on top of fresh v2 content forever. Fix:
- Drop the read-and-append branch from upsert_closet_lines (now a pure
numbered-id overwrite).
- Add purge_file_closets(closets_col, source_file) helper that wipes
every closet for a source file by where-filter.
- process_file calls purge_file_closets before upsert on every mine,
mirroring the existing drawer purge.
2. Searcher returned whole-file blobs from the closet path while the
direct path returned chunk-level drawers. Refactored:
- _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b`
pointers out of closet documents.
- _closet_first_hits hydrates exactly those drawer IDs (chunk-level),
not collection.get(where=source_file) (which returned everything).
- Same hit shape as direct-search path; both now carry matched_via.
3. max_distance was bypassed on the closet path. Now applied per-hit;
when every closet candidate gets filtered, _closet_first_hits returns
None and the caller falls through to direct drawer search.
4. Entity extraction caught sentence-starters like "When", "The",
"After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false
positives + day/month names + role words). Real names like Igor /
Milla still survive — covered by tests.
5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but
code appended; claimed BM25 hybrid that doesn't exist; claimed a
10K char hydration cap that wasn't enforced). Rewritten to describe
what actually ships, with explicit notes on the BM25 / convo-closet
follow-ups.
6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases:
- build_closet_lines: pointer shape, header extraction, stoplist
filtering (with regression case for "When/After/The"), real-name
survival, fallback-line guarantee, drawer-ref slicing.
- upsert_closet_lines: pure overwrite semantics (regression for the
append bug), char-limit packing without splitting lines.
- purge_file_closets: scoped to source_file, doesn't touch others.
- End-to-end miner rebuild: re-mining a file with fewer topics fully
purges leftover numbered closets from the larger first run.
- _extract_drawer_ids_from_closet: parsing + dedup edge cases.
- search_memories closet-first: fallback when empty, chunk-level
hits with matched_via, no whole-file glue, max_distance enforced.
Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock
from develop with the closet helpers from this branch. process_file
auto-merged cleanly (closet block sits inside develop's lock body).
724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.
Trimmed version of Milla's omnibus test_closets.py to only cover
features present in this PR stack (#784 lock, #788 closets, this
PR's entity/BM25/diary). Strip-noise tests will land with #785;
tunnel tests will land with the tunnels PR.
16/16 pass.
Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>