Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION,
#784 file locking) and addresses six concerns surfaced during PR review
of the closet feature:
1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to
existing closets (mismatched the doc's "fully replaced" promise). With
NORMALIZE_VERSION rebuilds on develop, this would have stacked stale
v1 topics on top of fresh v2 content forever. Fix:
- Drop the read-and-append branch from upsert_closet_lines (now a pure
numbered-id overwrite).
- Add purge_file_closets(closets_col, source_file) helper that wipes
every closet for a source file by where-filter.
- process_file calls purge_file_closets before upsert on every mine,
mirroring the existing drawer purge.
2. Searcher returned whole-file blobs from the closet path while the
direct path returned chunk-level drawers. Refactored:
- _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b`
pointers out of closet documents.
- _closet_first_hits hydrates exactly those drawer IDs (chunk-level),
not collection.get(where=source_file) (which returned everything).
- Same hit shape as direct-search path; both now carry matched_via.
3. max_distance was bypassed on the closet path. Now applied per-hit;
when every closet candidate gets filtered, _closet_first_hits returns
None and the caller falls through to direct drawer search.
4. Entity extraction caught sentence-starters like "When", "The",
"After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false
positives + day/month names + role words). Real names like Igor /
Milla still survive — covered by tests.
5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but
code appended; claimed BM25 hybrid that doesn't exist; claimed a
10K char hydration cap that wasn't enforced). Rewritten to describe
what actually ships, with explicit notes on the BM25 / convo-closet
follow-ups.
6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases:
- build_closet_lines: pointer shape, header extraction, stoplist
filtering (with regression case for "When/After/The"), real-name
survival, fallback-line guarantee, drawer-ref slicing.
- upsert_closet_lines: pure overwrite semantics (regression for the
append bug), char-limit packing without splitting lines.
- purge_file_closets: scoped to source_file, doesn't touch others.
- End-to-end miner rebuild: re-mining a file with fewer topics fully
purges leftover numbered closets from the larger first run.
- _extract_drawer_ids_from_closet: parsing + dedup edge cases.
- search_memories closet-first: fallback when empty, chunk-level
hits with matched_via, no whole-file glue, max_distance enforced.
Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock
from develop with the closet helpers from this branch. process_file
auto-merged cleanly (closet block sits inside develop's lock body).
724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.
Cherry-picked the docs portion of 67e4ac6 to accompany the closet
feature. Test coverage for closets is omnibus with tests for entity
metadata and BM25 (see PR targeting those features) and will land
together in a follow-up.
Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>