21d4a23430
Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION, #784 file locking) and addresses six concerns surfaced during PR review of the closet feature: 1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to existing closets (mismatched the doc's "fully replaced" promise). With NORMALIZE_VERSION rebuilds on develop, this would have stacked stale v1 topics on top of fresh v2 content forever. Fix: - Drop the read-and-append branch from upsert_closet_lines (now a pure numbered-id overwrite). - Add purge_file_closets(closets_col, source_file) helper that wipes every closet for a source file by where-filter. - process_file calls purge_file_closets before upsert on every mine, mirroring the existing drawer purge. 2. Searcher returned whole-file blobs from the closet path while the direct path returned chunk-level drawers. Refactored: - _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b` pointers out of closet documents. - _closet_first_hits hydrates exactly those drawer IDs (chunk-level), not collection.get(where=source_file) (which returned everything). - Same hit shape as direct-search path; both now carry matched_via. 3. max_distance was bypassed on the closet path. Now applied per-hit; when every closet candidate gets filtered, _closet_first_hits returns None and the caller falls through to direct drawer search. 4. Entity extraction caught sentence-starters like "When", "The", "After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false positives + day/month names + role words). Real names like Igor / Milla still survive — covered by tests. 5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but code appended; claimed BM25 hybrid that doesn't exist; claimed a 10K char hydration cap that wasn't enforced). Rewritten to describe what actually ships, with explicit notes on the BM25 / convo-closet follow-ups. 6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases: - build_closet_lines: pointer shape, header extraction, stoplist filtering (with regression case for "When/After/The"), real-name survival, fallback-line guarantee, drawer-ref slicing. - upsert_closet_lines: pure overwrite semantics (regression for the append bug), char-limit packing without splitting lines. - purge_file_closets: scoped to source_file, doesn't touch others. - End-to-end miner rebuild: re-mining a file with fewer topics fully purges leftover numbered closets from the larger first run. - _extract_drawer_ids_from_closet: parsing + dedup edge cases. - search_memories closet-first: fallback when empty, chunk-level hits with matched_via, no whole-file glue, max_distance enforced. Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock from develop with the closet helpers from this branch. process_file auto-merged cleanly (closet block sits inside develop's lock body). 724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.
89 lines
4.5 KiB
Markdown
89 lines
4.5 KiB
Markdown
# Closets — The Searchable Index Layer
|
|
|
|
## What closets are
|
|
|
|
Drawers hold your verbatim content. Closets are the index — compact pointers that tell the searcher which drawers to open.
|
|
|
|
```
|
|
CLOSET: "built auth system|Ben;Igor|→drawer_api_auth_a1b2c3"
|
|
↑ topic ↑ entities ↑ points to this drawer
|
|
```
|
|
|
|
An agent searching "who built the auth?" hits the closet first (fast scan of short text), then opens the referenced drawer to get the full verbatim content.
|
|
|
|
## Lifecycle
|
|
|
|
### When are closets created?
|
|
|
|
Closets are created during `mempalace mine`. For each file mined:
|
|
1. Content is chunked into drawers (verbatim, ~800 chars each)
|
|
2. Topics, entities, and quotes are extracted from the content
|
|
3. A closet is created with pointer lines to those drawers
|
|
|
|
### What's inside a closet?
|
|
|
|
Each line is one atomic topic pointer:
|
|
```
|
|
topic description|entity1;entity2|→drawer_id_1,drawer_id_2
|
|
"verbatim quote from the content"|entity1|→drawer_id_3
|
|
```
|
|
|
|
Topics are never split across closets. If adding a topic would exceed 1,500 characters, a new closet is created.
|
|
|
|
### When do closets update?
|
|
|
|
When a file is re-mined (content changed, or `NORMALIZE_VERSION` was bumped), the miner first deletes every closet for that source file (`purge_file_closets`) and then writes a fresh set. Stale topics from the prior mine are gone — closets are always a snapshot of the current content, never an accumulation across runs.
|
|
|
|
### What about stale topics?
|
|
|
|
There are no stale topics: each re-mine is a clean rebuild for that source file. If a file gets larger and produces fewer or more closets than last time, the leftover numbered closets from the larger run are still purged because the delete is done by `source_file`, not by ID.
|
|
|
|
### Do closets survive palace rebuilds?
|
|
|
|
Closets are stored in the `mempalace_closets` ChromaDB collection alongside `mempalace_drawers`. If you delete and rebuild the palace, closets are recreated during the next `mempalace mine`.
|
|
|
|
## How search uses closets
|
|
|
|
```
|
|
Query → search mempalace_closets (fast, small documents)
|
|
↓
|
|
top closet hits → parse `→drawer_id_a,drawer_id_b` pointers
|
|
↓
|
|
fetch exactly those drawers from mempalace_drawers (verbatim content)
|
|
↓
|
|
apply max_distance filter
|
|
↓
|
|
return chunk-level results (same shape as direct search)
|
|
```
|
|
|
|
Hits carry `matched_via: "closet"` (or `"drawer"` for the fallback path) plus a `closet_preview` field showing the line that surfaced them.
|
|
|
|
If no closets exist (palace created before this feature) — or all closet hits get filtered out by `max_distance` — search falls back to direct drawer search. Closets are created on next mine.
|
|
|
|
> **BM25 hybrid re-rank** is on the roadmap (deferred to a follow-up PR alongside generic `LLM_*` env-var support); the current closet search ranks purely by ChromaDB cosine distance against the closet text.
|
|
|
|
## Limits
|
|
|
|
| Setting | Value | Reason |
|
|
|---------|-------|--------|
|
|
| Max closet size | 1,500 chars (`CLOSET_CHAR_LIMIT`) | Leaves buffer under ChromaDB's working limit |
|
|
| Source content scanned | 5,000 chars (`CLOSET_EXTRACT_WINDOW`) | Caps regex extraction cost on long files; back-of-file content is currently invisible to closet extraction (tracked for follow-up) |
|
|
| Max topics per file | 12 | Keeps closets focused |
|
|
| Max quotes per file | 3 | Most relevant only |
|
|
| Max entities per pointer | 5 | Top names by frequency, after stoplist filtering |
|
|
|
|
## For developers
|
|
|
|
Closet functions live in `mempalace/palace.py`:
|
|
- `get_closets_collection()` — get the closets ChromaDB collection
|
|
- `build_closet_lines()` — extract topics/entities/quotes into pointer lines
|
|
- `upsert_closet_lines()` — write lines to closets respecting the char limit (overwrites existing IDs; does not append — call `purge_file_closets` first when re-mining)
|
|
- `purge_file_closets()` — delete every closet for a given source file before rebuild
|
|
- `CLOSET_CHAR_LIMIT` / `CLOSET_EXTRACT_WINDOW` — size constants
|
|
|
|
The closet-first search path lives in `mempalace/searcher.py`:
|
|
- `_extract_drawer_ids_from_closet()` — parse `→drawer_a,drawer_b` pointers out of a closet document
|
|
- `_closet_first_hits()` — query closets, parse pointers, hydrate matching drawers, return chunk-level hits or `None` to fall back
|
|
|
|
Note: only the project miner (`miner.py::process_file`) builds closets today. Conversation-mined wings (Claude Code JSONL, ChatGPT export, etc.) will keep using direct drawer search via the searcher fallback until the convo-closet PR lands.
|