mempalace

jason/mempalace

Fork 0

Commit Graph

Author	SHA1	Message	Date
Igor Lins e Silva	32d7f4376b	merge: develop + harden entity metadata, BM25, and diary ingest for production Merges develop (closet hardening #826, strip_noise #785, lock #784) and replaces every sub-feature in this PR with a correct, tested implementation. Shippable now. ## 1. Real Okapi-BM25 (searcher.py) The prior `_bm25_score()` hardcoded `idf = log(2.0)` for every term — it was really a scaled TF, not BM25, and couldn't tell a discriminative term from a generic one. Replaced with `_bm25_scores(query, documents)` that computes proper IDF over the provided candidate corpus using the Lucene smoothed formula `log((N - df + 0.5) / (df + 0.5) + 1)`. Well- defined for re-ranking vector-retrieval candidates — IDF there measures how discriminative each term is within the candidate set, exactly the signal we want. `_hybrid_rank` also fixed: - Vector normalization is now absolute `max(0, 1 - dist)`, not `1 - dist/max_dist` — adding/removing a candidate no longer reshuffles the others. - BM25 is min-max normalized within candidates (bounded [0, 1]). - Closet path now re-ranks too (was previously returning closet-order hits without hybrid scoring). - `_hybrid_score` internal field stripped from output; `bm25_score` exposed for debugging. ## 2. Entity metadata (miner.py) - Reuses `_ENTITY_STOPLIST` from palace.py so sentence-starters like "When", "After", "The" no longer land as entities (regression test covers this). - Known-entity registry is cached at module level, keyed by the registry file's mtime — no more disk read per drawer. - File handle now uses a context manager. - Truncates the entity LIST (to 25) before joining — never splits a name in the middle. ## 3. Diary ingest (diary_ingest.py) - State file now lives at `~/.mempalace/state/diary_ingest_<hash>.json`, keyed by (palace_path, diary_dir). No more pollution of the user's content directory. - Drawer IDs now hash `(wing, date_str)` — a user with personal + work diaries on the same day no longer silently clobbers. - Each day's upsert runs inside `mine_lock(source_file)` so concurrent ingest from two terminals can't race. - `force=True` now calls `purge_file_closets` before rebuild so leftover numbered closets from a longer prior day don't orphan. ## 4. Tests (tests/test_closets.py) Merged this PR's MineLock/Entity/BM25/Diary tests with develop's hardened Build/Upsert/Purge/Rebuild/SearchClosetFirst tests. Added specific regression tests for every fix above: - entity stoplist applies (no "When/After/The") - entity list capped before join (no partial tokens) - registry cached by mtime (mock-verified zero re-reads) - BM25 IDF downweights terms present in every doc (real BM25 evidence) - hybrid rank absolute normalization stable against outliers - diary state file outside user's diary dir - diary wing-prefixed IDs prevent cross-wing date collisions 35/35 closet tests pass; full suite 743/743. ruff + format clean under CI-pinned 0.4.x.	2026-04-13 17:37:45 -03:00
Igor Lins e Silva	21d4a23430	merge: develop + harden closet layer for production Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION, #784 file locking) and addresses six concerns surfaced during PR review of the closet feature: 1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to existing closets (mismatched the doc's "fully replaced" promise). With NORMALIZE_VERSION rebuilds on develop, this would have stacked stale v1 topics on top of fresh v2 content forever. Fix: - Drop the read-and-append branch from upsert_closet_lines (now a pure numbered-id overwrite). - Add purge_file_closets(closets_col, source_file) helper that wipes every closet for a source file by where-filter. - process_file calls purge_file_closets before upsert on every mine, mirroring the existing drawer purge. 2. Searcher returned whole-file blobs from the closet path while the direct path returned chunk-level drawers. Refactored: - _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b` pointers out of closet documents. - _closet_first_hits hydrates exactly those drawer IDs (chunk-level), not collection.get(where=source_file) (which returned everything). - Same hit shape as direct-search path; both now carry matched_via. 3. max_distance was bypassed on the closet path. Now applied per-hit; when every closet candidate gets filtered, _closet_first_hits returns None and the caller falls through to direct drawer search. 4. Entity extraction caught sentence-starters like "When", "The", "After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false positives + day/month names + role words). Real names like Igor / Milla still survive — covered by tests. 5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but code appended; claimed BM25 hybrid that doesn't exist; claimed a 10K char hydration cap that wasn't enforced). Rewritten to describe what actually ships, with explicit notes on the BM25 / convo-closet follow-ups. 6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases: - build_closet_lines: pointer shape, header extraction, stoplist filtering (with regression case for "When/After/The"), real-name survival, fallback-line guarantee, drawer-ref slicing. - upsert_closet_lines: pure overwrite semantics (regression for the append bug), char-limit packing without splitting lines. - purge_file_closets: scoped to source_file, doesn't touch others. - End-to-end miner rebuild: re-mining a file with fewer topics fully purges leftover numbered closets from the larger first run. - _extract_drawer_ids_from_closet: parsing + dedup edge cases. - search_memories closet-first: fallback when empty, chunk-level hits with matched_via, no whole-file glue, max_distance enforced. Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock from develop with the closet helpers from this branch. process_file auto-merged cleanly (closet block sits inside develop's lock body). 724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.	2026-04-13 17:00:55 -03:00
Igor Lins e Silva	f72ffbbcb2	test: add tests for mine_lock, closets, entity metadata, BM25, diary Trimmed version of Milla's omnibus test_closets.py to only cover features present in this PR stack (#784 lock, #788 closets, this PR's entity/BM25/diary). Strip-noise tests will land with #785; tunnel tests will land with the tunnels PR. 16/16 pass. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>	2026-04-13 07:42:25 -03:00

Author

SHA1

Message

Date

Igor Lins e Silva

32d7f4376b

merge: develop + harden entity metadata, BM25, and diary ingest for production

Merges develop (closet hardening #826, strip_noise #785, lock #784) and
replaces every sub-feature in this PR with a correct, tested
implementation. Shippable now.

## 1. Real Okapi-BM25 (searcher.py)

The prior `_bm25_score()` hardcoded `idf = log(2.0)` for every term — it
was really a scaled TF, not BM25, and couldn't tell a discriminative
term from a generic one. Replaced with `_bm25_scores(query, documents)`
that computes proper IDF over the provided candidate corpus using the
Lucene smoothed formula `log((N - df + 0.5) / (df + 0.5) + 1)`. Well-
defined for re-ranking vector-retrieval candidates — IDF there measures
how discriminative each term is *within the candidate set*, exactly the
signal we want.

`_hybrid_rank` also fixed:
- Vector normalization is now absolute `max(0, 1 - dist)`, not
  `1 - dist/max_dist` — adding/removing a candidate no longer reshuffles
  the others.
- BM25 is min-max normalized within candidates (bounded [0, 1]).
- Closet path now re-ranks too (was previously returning closet-order
  hits without hybrid scoring).
- `_hybrid_score` internal field stripped from output; `bm25_score`
  exposed for debugging.

## 2. Entity metadata (miner.py)

- Reuses `_ENTITY_STOPLIST` from palace.py so sentence-starters like
  "When", "After", "The" no longer land as entities (regression test
  covers this).
- Known-entity registry is cached at module level, keyed by the
  registry file's mtime — no more disk read per drawer.
- File handle now uses a context manager.
- Truncates the entity LIST (to 25) before joining — never splits a
  name in the middle.

## 3. Diary ingest (diary_ingest.py)

- State file now lives at `~/.mempalace/state/diary_ingest_<hash>.json`,
  keyed by (palace_path, diary_dir). No more pollution of the user's
  content directory.
- Drawer IDs now hash `(wing, date_str)` — a user with personal + work
  diaries on the same day no longer silently clobbers.
- Each day's upsert runs inside `mine_lock(source_file)` so concurrent
  ingest from two terminals can't race.
- `force=True` now calls `purge_file_closets` before rebuild so
  leftover numbered closets from a longer prior day don't orphan.

## 4. Tests (tests/test_closets.py)

Merged this PR's MineLock/Entity/BM25/Diary tests with develop's
hardened Build/Upsert/Purge/Rebuild/SearchClosetFirst tests. Added
specific regression tests for every fix above:
- entity stoplist applies (no "When/After/The")
- entity list capped before join (no partial tokens)
- registry cached by mtime (mock-verified zero re-reads)
- BM25 IDF downweights terms present in every doc (real BM25 evidence)
- hybrid rank absolute normalization stable against outliers
- diary state file outside user's diary dir
- diary wing-prefixed IDs prevent cross-wing date collisions

35/35 closet tests pass; full suite 743/743. ruff + format clean under
CI-pinned 0.4.x.

2026-04-13 17:37:45 -03:00

Igor Lins e Silva

21d4a23430

merge: develop + harden closet layer for production

Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION,
#784 file locking) and addresses six concerns surfaced during PR review
of the closet feature:

1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to
   existing closets (mismatched the doc's "fully replaced" promise). With
   NORMALIZE_VERSION rebuilds on develop, this would have stacked stale
   v1 topics on top of fresh v2 content forever. Fix:
   - Drop the read-and-append branch from upsert_closet_lines (now a pure
     numbered-id overwrite).
   - Add purge_file_closets(closets_col, source_file) helper that wipes
     every closet for a source file by where-filter.
   - process_file calls purge_file_closets before upsert on every mine,
     mirroring the existing drawer purge.

2. Searcher returned whole-file blobs from the closet path while the
   direct path returned chunk-level drawers. Refactored:
   - _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b`
     pointers out of closet documents.
   - _closet_first_hits hydrates exactly those drawer IDs (chunk-level),
     not collection.get(where=source_file) (which returned everything).
   - Same hit shape as direct-search path; both now carry matched_via.

3. max_distance was bypassed on the closet path. Now applied per-hit;
   when every closet candidate gets filtered, _closet_first_hits returns
   None and the caller falls through to direct drawer search.

4. Entity extraction caught sentence-starters like "When", "The",
   "After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false
   positives + day/month names + role words). Real names like Igor /
   Milla still survive — covered by tests.

5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but
   code appended; claimed BM25 hybrid that doesn't exist; claimed a
   10K char hydration cap that wasn't enforced). Rewritten to describe
   what actually ships, with explicit notes on the BM25 / convo-closet
   follow-ups.

6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases:
   - build_closet_lines: pointer shape, header extraction, stoplist
     filtering (with regression case for "When/After/The"), real-name
     survival, fallback-line guarantee, drawer-ref slicing.
   - upsert_closet_lines: pure overwrite semantics (regression for the
     append bug), char-limit packing without splitting lines.
   - purge_file_closets: scoped to source_file, doesn't touch others.
   - End-to-end miner rebuild: re-mining a file with fewer topics fully
     purges leftover numbered closets from the larger first run.
   - _extract_drawer_ids_from_closet: parsing + dedup edge cases.
   - search_memories closet-first: fallback when empty, chunk-level
     hits with matched_via, no whole-file glue, max_distance enforced.

Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock
from develop with the closet helpers from this branch. process_file
auto-merged cleanly (closet block sits inside develop's lock body).

724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.

2026-04-13 17:00:55 -03:00

Igor Lins e Silva

f72ffbbcb2

test: add tests for mine_lock, closets, entity metadata, BM25, diary

Trimmed version of Milla's omnibus test_closets.py to only cover
features present in this PR stack (#784 lock, #788 closets, this
PR's entity/BM25/diary). Strip-noise tests will land with #785;
tunnel tests will land with the tunnels PR.

16/16 pass.

Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>

2026-04-13 07:42:25 -03:00

3 Commits