Files

T

Igor Lins e Silva 21d4a23430 merge: develop + harden closet layer for production

Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION,
#784 file locking) and addresses six concerns surfaced during PR review
of the closet feature:

1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to
   existing closets (mismatched the doc's "fully replaced" promise). With
   NORMALIZE_VERSION rebuilds on develop, this would have stacked stale
   v1 topics on top of fresh v2 content forever. Fix:
   - Drop the read-and-append branch from upsert_closet_lines (now a pure
     numbered-id overwrite).
   - Add purge_file_closets(closets_col, source_file) helper that wipes
     every closet for a source file by where-filter.
   - process_file calls purge_file_closets before upsert on every mine,
     mirroring the existing drawer purge.

2. Searcher returned whole-file blobs from the closet path while the
   direct path returned chunk-level drawers. Refactored:
   - _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b`
     pointers out of closet documents.
   - _closet_first_hits hydrates exactly those drawer IDs (chunk-level),
     not collection.get(where=source_file) (which returned everything).
   - Same hit shape as direct-search path; both now carry matched_via.

3. max_distance was bypassed on the closet path. Now applied per-hit;
   when every closet candidate gets filtered, _closet_first_hits returns
   None and the caller falls through to direct drawer search.

4. Entity extraction caught sentence-starters like "When", "The",
   "After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false
   positives + day/month names + role words). Real names like Igor /
   Milla still survive — covered by tests.

5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but
   code appended; claimed BM25 hybrid that doesn't exist; claimed a
   10K char hydration cap that wasn't enforced). Rewritten to describe
   what actually ships, with explicit notes on the BM25 / convo-closet
   follow-ups.

6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases:
   - build_closet_lines: pointer shape, header extraction, stoplist
     filtering (with regression case for "When/After/The"), real-name
     survival, fallback-line guarantee, drawer-ref slicing.
   - upsert_closet_lines: pure overwrite semantics (regression for the
     append bug), char-limit packing without splitting lines.
   - purge_file_closets: scoped to source_file, doesn't touch others.
   - End-to-end miner rebuild: re-mining a file with fewer topics fully
     purges leftover numbered closets from the larger first run.
   - _extract_drawer_ids_from_closet: parsing + dedup edge cases.
   - search_memories closet-first: fallback when empty, chunk-level
     hits with matched_via, no whole-file glue, max_distance enforced.

Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock
from develop with the closet helpers from this branch. process_file
auto-merged cleanly (closet block sits inside develop's lock body).

724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.

2026-04-13 17:00:55 -03:00

4.5 KiB

Raw Blame History

Closets — The Searchable Index Layer

What closets are

Drawers hold your verbatim content. Closets are the index — compact pointers that tell the searcher which drawers to open.

CLOSET: "built auth system|Ben;Igor|→drawer_api_auth_a1b2c3"
         ↑ topic           ↑ entities  ↑ points to this drawer

An agent searching "who built the auth?" hits the closet first (fast scan of short text), then opens the referenced drawer to get the full verbatim content.

Lifecycle

When are closets created?

Closets are created during mempalace mine. For each file mined:

Content is chunked into drawers (verbatim, ~800 chars each)
Topics, entities, and quotes are extracted from the content
A closet is created with pointer lines to those drawers

What's inside a closet?

Each line is one atomic topic pointer:

topic description|entity1;entity2|→drawer_id_1,drawer_id_2
"verbatim quote from the content"|entity1|→drawer_id_3

Topics are never split across closets. If adding a topic would exceed 1,500 characters, a new closet is created.

When do closets update?

When a file is re-mined (content changed, or NORMALIZE_VERSION was bumped), the miner first deletes every closet for that source file (purge_file_closets) and then writes a fresh set. Stale topics from the prior mine are gone — closets are always a snapshot of the current content, never an accumulation across runs.

What about stale topics?

There are no stale topics: each re-mine is a clean rebuild for that source file. If a file gets larger and produces fewer or more closets than last time, the leftover numbered closets from the larger run are still purged because the delete is done by source_file, not by ID.

Do closets survive palace rebuilds?

Closets are stored in the mempalace_closets ChromaDB collection alongside mempalace_drawers. If you delete and rebuild the palace, closets are recreated during the next mempalace mine.

How search uses closets

Query → search mempalace_closets (fast, small documents)
         ↓
    top closet hits → parse `→drawer_id_a,drawer_id_b` pointers
         ↓
    fetch exactly those drawers from mempalace_drawers (verbatim content)
         ↓
    apply max_distance filter
         ↓
    return chunk-level results (same shape as direct search)

Hits carry matched_via: "closet" (or "drawer" for the fallback path) plus a closet_preview field showing the line that surfaced them.

If no closets exist (palace created before this feature) — or all closet hits get filtered out by max_distance — search falls back to direct drawer search. Closets are created on next mine.

BM25 hybrid re-rank is on the roadmap (deferred to a follow-up PR alongside generic LLM_* env-var support); the current closet search ranks purely by ChromaDB cosine distance against the closet text.

Limits

Setting	Value	Reason
Max closet size	1,500 chars (`CLOSET_CHAR_LIMIT`)	Leaves buffer under ChromaDB's working limit
Source content scanned	5,000 chars (`CLOSET_EXTRACT_WINDOW`)	Caps regex extraction cost on long files; back-of-file content is currently invisible to closet extraction (tracked for follow-up)
Max topics per file	12	Keeps closets focused
Max quotes per file	3	Most relevant only
Max entities per pointer	5	Top names by frequency, after stoplist filtering

For developers

Closet functions live in mempalace/palace.py:

get_closets_collection() — get the closets ChromaDB collection
build_closet_lines() — extract topics/entities/quotes into pointer lines
upsert_closet_lines() — write lines to closets respecting the char limit (overwrites existing IDs; does not append — call purge_file_closets first when re-mining)
purge_file_closets() — delete every closet for a given source file before rebuild
CLOSET_CHAR_LIMIT / CLOSET_EXTRACT_WINDOW — size constants

The closet-first search path lives in mempalace/searcher.py:

_extract_drawer_ids_from_closet() — parse →drawer_a,drawer_b pointers out of a closet document
_closet_first_hits() — query closets, parse pointers, hydrate matching drawers, return chunk-level hits or None to fall back

Note: only the project miner (miner.py::process_file) builds closets today. Conversation-mined wings (Claude Code JSONL, ChatGPT export, etc.) will keep using direct drawer search via the searcher fallback until the convo-closet PR lands.

4.5 KiB Raw Blame History