merge: develop + harden closet layer for production

Merges develop (#820 version sync, #785 strip_noise + NORMALIZE_VERSION,
#784 file locking) and addresses six concerns surfaced during PR review
of the closet feature:

1. Closet append-on-rebuild bug — upsert_closet_lines used to APPEND to
   existing closets (mismatched the doc's "fully replaced" promise). With
   NORMALIZE_VERSION rebuilds on develop, this would have stacked stale
   v1 topics on top of fresh v2 content forever. Fix:
   - Drop the read-and-append branch from upsert_closet_lines (now a pure
     numbered-id overwrite).
   - Add purge_file_closets(closets_col, source_file) helper that wipes
     every closet for a source file by where-filter.
   - process_file calls purge_file_closets before upsert on every mine,
     mirroring the existing drawer purge.

2. Searcher returned whole-file blobs from the closet path while the
   direct path returned chunk-level drawers. Refactored:
   - _extract_drawer_ids_from_closet parses the `→drawer_a,drawer_b`
     pointers out of closet documents.
   - _closet_first_hits hydrates exactly those drawer IDs (chunk-level),
     not collection.get(where=source_file) (which returned everything).
   - Same hit shape as direct-search path; both now carry matched_via.

3. max_distance was bypassed on the closet path. Now applied per-hit;
   when every closet candidate gets filtered, _closet_first_hits returns
   None and the caller falls through to direct drawer search.

4. Entity extraction caught sentence-starters like "When", "The",
   "After" as proper nouns. Added _ENTITY_STOPLIST (~40 common false
   positives + day/month names + role words). Real names like Igor /
   Milla still survive — covered by tests.

5. CLOSETS.md drifted from the code (claimed "replaced via upsert" but
   code appended; claimed BM25 hybrid that doesn't exist; claimed a
   10K char hydration cap that wasn't enforced). Rewritten to describe
   what actually ships, with explicit notes on the BM25 / convo-closet
   follow-ups.

6. Zero tests for ~250 lines. Added tests/test_closets.py with 17 cases:
   - build_closet_lines: pointer shape, header extraction, stoplist
     filtering (with regression case for "When/After/The"), real-name
     survival, fallback-line guarantee, drawer-ref slicing.
   - upsert_closet_lines: pure overwrite semantics (regression for the
     append bug), char-limit packing without splitting lines.
   - purge_file_closets: scoped to source_file, doesn't touch others.
   - End-to-end miner rebuild: re-mining a file with fewer topics fully
     purges leftover numbered closets from the larger first run.
   - _extract_drawer_ids_from_closet: parsing + dedup edge cases.
   - search_memories closet-first: fallback when empty, chunk-level
     hits with matched_via, no whole-file glue, max_distance enforced.

Merge resolutions: miner.py imports combined NORMALIZE_VERSION/mine_lock
from develop with the closet helpers from this branch. process_file
auto-merged cleanly (closet block sits inside develop's lock body).

724/724 tests pass. ruff + format clean under CI-pinned 0.4.x.
This commit is contained in:
Igor Lins e Silva
2026-04-13 17:00:55 -03:00
14 changed files with 1123 additions and 168 deletions
+23 -14
View File
@@ -32,13 +32,11 @@ Topics are never split across closets. If adding a topic would exceed 1,500 char
### When do closets update?
When a file is re-mined (content changed), its drawers are replaced and new closets are built from the fresh content. The old closet content is replaced via upsert.
When a file is re-mined (content changed, or `NORMALIZE_VERSION` was bumped), the miner first deletes every closet for that source file (`purge_file_closets`) and then writes a fresh set. Stale topics from the prior mine are gone — closets are always a snapshot of the current content, never an accumulation across runs.
### What about stale topics?
If a file's content changes and a topic no longer exists, the closet is rebuilt entirely from the new content — stale topics are gone. Closets are tied to source files, not to individual topics.
If you add content to an existing file (e.g., a daily diary growing throughout the day), new topics are appended to the existing closet until the 1,500-char limit, then a new closet is created.
There are no stale topics: each re-mine is a clean rebuild for that source file. If a file gets larger and produces fewer or more closets than last time, the leftover numbered closets from the larger run are still purged because the delete is done by `source_file`, not by ID.
### Do closets survive palace rebuilds?
@@ -49,31 +47,42 @@ Closets are stored in the `mempalace_closets` ChromaDB collection alongside `mem
```
Query → search mempalace_closets (fast, small documents)
top closet hits → extract drawer IDs from pointer lines
top closet hits → parse `→drawer_id_a,drawer_id_b` pointers
fetch drawers from mempalace_drawers (full verbatim content)
fetch exactly those drawers from mempalace_drawers (verbatim content)
BM25 hybrid re-rank (keyword match + vector similarity)
apply max_distance filter
return results to user
return chunk-level results (same shape as direct search)
```
If no closets exist (palace created before this feature), search falls back to direct drawer search. Closets are created on next mine.
Hits carry `matched_via: "closet"` (or `"drawer"` for the fallback path) plus a `closet_preview` field showing the line that surfaced them.
If no closets exist (palace created before this feature) — or all closet hits get filtered out by `max_distance` — search falls back to direct drawer search. Closets are created on next mine.
> **BM25 hybrid re-rank** is on the roadmap (deferred to a follow-up PR alongside generic `LLM_*` env-var support); the current closet search ranks purely by ChromaDB cosine distance against the closet text.
## Limits
| Setting | Value | Reason |
|---------|-------|--------|
| Max closet size | 1,500 chars | Leaves buffer under ChromaDB's working limit |
| Max closet size | 1,500 chars (`CLOSET_CHAR_LIMIT`) | Leaves buffer under ChromaDB's working limit |
| Source content scanned | 5,000 chars (`CLOSET_EXTRACT_WINDOW`) | Caps regex extraction cost on long files; back-of-file content is currently invisible to closet extraction (tracked for follow-up) |
| Max topics per file | 12 | Keeps closets focused |
| Max quotes per file | 3 | Most relevant only |
| Max entities per pointer | 5 | Top names by frequency |
| Max response chars | 10,000 | Prevents hydration blowup on large files |
| Max entities per pointer | 5 | Top names by frequency, after stoplist filtering |
## For developers
Closet functions live in `mempalace/palace.py`:
- `get_closets_collection()` — get the closets ChromaDB collection
- `build_closet_lines()` — extract topics/entities/quotes into pointer lines
- `upsert_closet_lines()` — write lines to closets respecting the char limit
- `CLOSET_CHAR_LIMIT` — the 1,500 char limit constant
- `upsert_closet_lines()` — write lines to closets respecting the char limit (overwrites existing IDs; does not append — call `purge_file_closets` first when re-mining)
- `purge_file_closets()` — delete every closet for a given source file before rebuild
- `CLOSET_CHAR_LIMIT` / `CLOSET_EXTRACT_WINDOW` — size constants
The closet-first search path lives in `mempalace/searcher.py`:
- `_extract_drawer_ids_from_closet()` — parse `→drawer_a,drawer_b` pointers out of a closet document
- `_closet_first_hits()` — query closets, parse pointers, hydrate matching drawers, return chunk-level hits or `None` to fall back
Note: only the project miner (`miner.py::process_file`) builds closets today. Conversation-mined wings (Claude Code JSONL, ChatGPT export, etc.) will keep using direct drawer search via the searcher fallback until the convo-closet PR lands.
+5 -1
View File
@@ -133,6 +133,10 @@ Example output:
[14:40:01] Session abc123: 18 exchanges, 3 since last save
```
## Known Limitations
**Hooks require session restart after install.** Claude Code loads hooks from `settings.json` at session start only. If you run `mempalace init` or manually edit hook config mid-session, the hooks won't fire until you restart Claude Code. This is a Claude Code limitation.
## Cost
**Zero extra tokens.** The hooks are bash scripts that run locally. They don't call any API. The only "cost" is the AI spending a few seconds organizing memories at each checkpoint — and it's doing that with context it already has loaded.
**Zero extra tokens.** The hooks notify the AI that saves happened in the background — the AI doesn't need to write anything in the chat. All filing is handled automatically. Previous versions asked the AI to write diary entries and drawer content in the chat window, which cost ~$1/session in retransmitted tokens.
+3 -3
View File
@@ -68,10 +68,10 @@ if [ -n "$MEMPAL_DIR" ] && [ -d "$MEMPAL_DIR" ]; then
python3 -m mempalace mine "$MEMPAL_DIR" >> "$STATE_DIR/hook.log" 2>&1
fi
# Always block — compaction = save everything
# Notify — compaction is about to happen but filing is handled in background
cat << 'HOOKJSON'
{
"decision": "block",
"reason": "COMPACTION IMMINENT. Save ALL topics, decisions, quotes, code, and important context from this session to your memory system. Be thorough — after compaction, detailed context will be lost. Organize into appropriate categories. Use verbatim quotes where possible. Save everything, then allow compaction to proceed."
"decision": "allow",
"reason": "MemPalace pre-compaction save. Your full conversation has been saved verbatim in the background — no action needed. Compaction can proceed safely."
}
HOOKJSON
+7 -4
View File
@@ -140,12 +140,15 @@ if [ "$SINCE_LAST" -ge "$SAVE_INTERVAL" ] && [ "$EXCHANGE_COUNT" -gt 0 ]; then
python3 -m mempalace mine "$MEMPAL_DIR" >> "$STATE_DIR/hook.log" 2>&1 &
fi
# Block the AI and tell it to save
# The "reason" becomes a system message the AI sees and acts on
# Notify the AI that a checkpoint happened — but do NOT ask it to write
# anything in chat. All filing happens in the background via the pipeline.
# The old version asked the agent to write diary entries, add drawers, and
# add KG triples in the chat window — that cost ~$1/session in retransmitted
# tokens and cluttered the conversation.
cat << 'HOOKJSON'
{
"decision": "block",
"reason": "AUTO-SAVE checkpoint. Save key topics, decisions, quotes, and code from this session to your memory system. Organize into appropriate categories. Use verbatim quotes where possible. Continue conversation after saving."
"decision": "allow",
"reason": "MemPalace auto-save checkpoint. Your conversation is being saved verbatim in the background — no action needed from you. Continue working."
}
HOOKJSON
else
+74 -35
View File
@@ -16,7 +16,13 @@ from datetime import datetime
from collections import defaultdict
from .normalize import normalize
from .palace import SKIP_DIRS, get_collection, file_already_mined, mine_lock
from .palace import (
NORMALIZE_VERSION,
SKIP_DIRS,
file_already_mined,
get_collection,
mine_lock,
)
# File types that might contain conversations
@@ -51,6 +57,7 @@ def _register_file(collection, source_file: str, wing: str, agent: str):
"added_by": agent,
"filed_at": datetime.now().isoformat(),
"ingest_mode": "registry",
"normalize_version": NORMALIZE_VERSION,
}
],
)
@@ -272,6 +279,62 @@ def scan_convos(convo_dir: str) -> list:
# =============================================================================
def _file_chunks_locked(collection, source_file, chunks, wing, room, agent, extract_mode):
"""Lock the source file, purge stale drawers, and upsert fresh chunks.
Combines the per-file serialization that prevents concurrent agents from
duplicating work (via mine_lock) with the normalize-version rebuild
contract (purge-before-insert so pre-v2 drawers don't survive).
Returns (drawers_added, room_counts_delta, skipped).
"""
room_counts_delta: dict = defaultdict(int)
drawers_added = 0
with mine_lock(source_file):
# Re-check after lock — another agent may have just finished this file
# at the current schema. A stale-version hit here returns False, so we
# still fall through to the purge+rebuild path below.
if file_already_mined(collection, source_file):
return 0, room_counts_delta, True
# Purge stale drawers first. When the normalize schema bumps,
# file_already_mined() returned False for pre-v2 drawers — clean
# them out so the source doesn't end up with mixed old/new drawers.
try:
collection.delete(where={"source_file": source_file})
except Exception:
pass
for chunk in chunks:
chunk_room = chunk.get("memory_type", room) if extract_mode == "general" else room
if extract_mode == "general":
room_counts_delta[chunk_room] += 1
drawer_id = f"drawer_{wing}_{chunk_room}_{hashlib.sha256((source_file + str(chunk['chunk_index'])).encode()).hexdigest()[:24]}"
try:
collection.upsert(
documents=[chunk["content"]],
ids=[drawer_id],
metadatas=[
{
"wing": wing,
"room": chunk_room,
"source_file": source_file,
"chunk_index": chunk["chunk_index"],
"added_by": agent,
"filed_at": datetime.now().isoformat(),
"ingest_mode": "convos",
"extract_mode": extract_mode,
"normalize_version": NORMALIZE_VERSION,
}
],
)
drawers_added += 1
except Exception as e:
if "already exists" not in str(e).lower():
raise
return drawers_added, room_counts_delta, False
def mine_convos(
convo_dir: str,
palace_path: str,
@@ -375,40 +438,16 @@ def mine_convos(
if extract_mode != "general":
room_counts[room] += 1
# File each chunk — lock to prevent concurrent agents duplicating
drawers_added = 0
with mine_lock(source_file):
# Re-check after lock — another agent may have just finished this file
if file_already_mined(collection, source_file):
files_skipped += 1
continue
for chunk in chunks:
chunk_room = chunk.get("memory_type", room) if extract_mode == "general" else room
if extract_mode == "general":
room_counts[chunk_room] += 1
drawer_id = f"drawer_{wing}_{chunk_room}_{hashlib.sha256((source_file + str(chunk['chunk_index'])).encode()).hexdigest()[:24]}"
try:
collection.upsert(
documents=[chunk["content"]],
ids=[drawer_id],
metadatas=[
{
"wing": wing,
"room": chunk_room,
"source_file": source_file,
"chunk_index": chunk["chunk_index"],
"added_by": agent,
"filed_at": datetime.now().isoformat(),
"ingest_mode": "convos",
"extract_mode": extract_mode,
}
],
)
drawers_added += 1
except Exception as e:
if "already exists" not in str(e).lower():
raise
# Lock + purge stale + file fresh chunks. Lock serializes concurrent
# agents; purge removes pre-v2 drawers so the schema bump applies.
drawers_added, room_delta, skipped = _file_chunks_locked(
collection, source_file, chunks, wing, room, agent, extract_mode
)
if skipped:
files_skipped += 1
continue
for r, n in room_delta.items():
room_counts[r] += n
total_drawers += drawers_added
print(f" ✓ [{i:4}/{len(files)}] {filepath.name[:50]:50} +{drawers_added}")
+30 -12
View File
@@ -16,8 +16,15 @@ from datetime import datetime
from collections import defaultdict
from .palace import (
SKIP_DIRS, get_collection, get_closets_collection,
file_already_mined, mine_lock, build_closet_lines, upsert_closet_lines,
NORMALIZE_VERSION,
SKIP_DIRS,
build_closet_lines,
file_already_mined,
get_closets_collection,
get_collection,
mine_lock,
purge_file_closets,
upsert_closet_lines,
)
READABLE_EXTENSIONS = {
@@ -384,6 +391,7 @@ def add_drawer(
"chunk_index": chunk_index,
"added_by": agent,
"filed_at": datetime.now().isoformat(),
"normalize_version": NORMALIZE_VERSION,
}
# Store file mtime so we can detect modifications later.
try:
@@ -470,22 +478,32 @@ def process_file(
if added:
drawers_added += 1
# Build closet — the searchable index pointing to these drawers
# Each topic line is atomic — never split across closets
# Build closet — the searchable index pointing to these drawers.
# Purge first: a re-mine (mtime change or normalize_version bump) must
# fully replace the prior closets, not append to them.
if closets_col and drawers_added > 0:
drawer_ids = [
f"drawer_{wing}_{room}_{hashlib.sha256((source_file + str(c['chunk_index'])).encode()).hexdigest()[:24]}"
for c in chunks
]
closet_lines = build_closet_lines(source_file, drawer_ids, content, wing, room)
closet_id_base = f"closet_{wing}_{room}_{hashlib.sha256(source_file.encode()).hexdigest()[:24]}"
upsert_closet_lines(closets_col, closet_id_base, closet_lines, {
"wing": wing,
"room": room,
"source_file": source_file,
"drawer_count": drawers_added,
"filed_at": datetime.now().isoformat(),
})
closet_id_base = (
f"closet_{wing}_{room}_{hashlib.sha256(source_file.encode()).hexdigest()[:24]}"
)
purge_file_closets(closets_col, source_file)
upsert_closet_lines(
closets_col,
closet_id_base,
closet_lines,
{
"wing": wing,
"room": room,
"source_file": source_file,
"drawer_count": drawers_added,
"filed_at": datetime.now().isoformat(),
"normalize_version": NORMALIZE_VERSION,
},
)
return drawers_added, room
+93 -2
View File
@@ -16,10 +16,93 @@ No API key. No internet. Everything local.
import json
import os
import re
from pathlib import Path
from typing import Optional
# ─── Noise stripping ─────────────────────────────────────────────────────
# Claude Code and other tools inject system tags, hook output, and UI chrome
# into transcripts. These waste drawer space and pollute search results.
#
# Verbatim is sacred — every pattern here is anchored to line boundaries and
# refuses to cross blank lines, so a stray unclosed tag in one message can
# never eat content from neighboring messages. When in doubt, leave text
# alone.
_NOISE_TAGS = (
"system-reminder",
"command-message",
"command-name",
"task-notification",
"user-prompt-submit-hook",
"hook_output",
)
def _tag_pattern(name: str) -> "re.Pattern[str]":
# Opening tag must begin a line (optionally after a `> ` blockquote marker,
# since _messages_to_transcript prefixes lines with `> `). Body is lazy but
# forbidden from crossing a blank line, so a dangling open tag can't span
# multiple messages. Closing tag eats optional trailing whitespace + newline.
return re.compile(
rf"(?m)^(?:> )?<{name}(?:\s[^>]*)?>" rf"(?:(?!\n\s*\n)[\s\S])*?" rf"</{name}>[ \t]*\n?"
)
_NOISE_TAG_PATTERNS = [_tag_pattern(t) for t in _NOISE_TAGS]
# Strings that identify an entire noise line when found at its start.
# Matched case-sensitively and anchored to line-start so user prose mentioning
# e.g. "current time:" in a sentence is untouched.
_NOISE_LINE_PREFIXES = (
"CURRENT TIME:",
"VERIFIED FACTS (do not contradict)",
"AGENT SPECIALIZATION:",
"Checking verified facts...",
"Injecting timestamp...",
"Starting background pipeline...",
"Checking emotional weights...",
"Auto-save reminder...",
"Checking pipeline...",
"MemPalace auto-save checkpoint.",
)
_NOISE_LINE_PATTERNS = [
re.compile(rf"(?m)^(?:> )?{re.escape(p)}.*\n?") for p in _NOISE_LINE_PREFIXES
]
# Claude Code TUI hook-run chrome, e.g. "Ran 2 Stop hook", "Ran 1 PreCompact hook".
# Line-anchored, case-sensitive, explicit hook names — prose like
# "our CI has a stop hook" stays intact.
_HOOK_LINE_RE = re.compile(
r"(?m)^(?:> )?Ran \d+ (?:Stop|PreCompact|PreToolUse|PostToolUse|UserPromptSubmit|Notification|SessionStart|SessionEnd) hook[s]?.*\n?"
)
# "… +N lines" collapsed-output marker, line-anchored.
_COLLAPSED_LINES_RE = re.compile(r"(?m)^(?:> )?…\s*\+\d+ lines.*\n?")
def strip_noise(text: str) -> str:
"""Remove system tags, hook output, and Claude Code UI chrome from text.
All patterns are line-anchored. User prose that happens to mention these
strings inline (e.g., documenting them) is preserved verbatim.
"""
for pat in _NOISE_TAG_PATTERNS:
text = pat.sub("", text)
for pat in _NOISE_LINE_PATTERNS:
text = pat.sub("", text)
text = _HOOK_LINE_RE.sub("", text)
text = _COLLAPSED_LINES_RE.sub("", text)
# Strip the Claude Code collapsed-output chrome "[N tokens] (ctrl+o to expand)".
# Narrow shape — a bare "(ctrl+o to expand)" in user prose stays intact.
text = re.sub(r"\s*\[\d+\s+tokens?\]\s*\(ctrl\+o to expand\)", "", text)
# Collapse runs of blank lines created by the removals
text = re.sub(r"\n{4,}", "\n\n\n", text)
return text.strip()
def normalize(filepath: str) -> str:
"""
Load a file and normalize to transcript format if it's a chat export.
@@ -40,12 +123,14 @@ def normalize(filepath: str) -> str:
if not content.strip():
return content
# Already has > markers — pass through
# Already has > markers — pass through unchanged.
lines = content.split("\n")
if sum(1 for line in lines if line.strip().startswith(">")) >= 3:
return content
# Try JSON normalization
# Try JSON normalization. strip_noise is applied inside the Claude Code
# JSONL parser (the only format that injects system tags/hook chrome);
# other formats pass through verbatim.
ext = Path(filepath).suffix.lower()
if ext in (".json", ".jsonl") or content.strip()[:1] in ("{", "["):
normalized = _try_normalize_json(content)
@@ -112,6 +197,10 @@ def _try_claude_code_jsonl(content: str) -> Optional[str]:
isinstance(b, dict) and b.get("type") == "tool_result" for b in msg_content
)
text = _extract_content(msg_content, tool_use_map=tool_use_map)
# Strip Claude Code system-injected noise per message, never across
# message boundaries — prevents span-eating.
if text:
text = strip_noise(text)
if text:
if is_tool_only and messages and messages[-1][0] == "assistant":
# Append tool results to the previous assistant message
@@ -121,6 +210,8 @@ def _try_claude_code_jsonl(content: str) -> Optional[str]:
messages.append(("user", text))
elif msg_type == "assistant":
text = _extract_content(msg_content, tool_use_map=tool_use_map)
if text:
text = strip_noise(text)
if text:
# If previous message is also assistant (multi-turn tool loop),
# merge into the same assistant turn
+117 -27
View File
@@ -38,6 +38,16 @@ SKIP_DIRS = {
_DEFAULT_BACKEND = ChromaBackend()
# Schema version for drawer normalization. Bump when the normalization
# pipeline changes in a way that existing drawers should be rebuilt to pick up
# (e.g., new noise-stripping rules). `file_already_mined` treats drawers with
# a missing or stale `normalize_version` as "not mined", so the next mine pass
# silently rebuilds them — users don't need to manually erase + re-mine.
#
# v2 (2026-04): introduced strip_noise() for Claude Code JSONL; previous
# drawers stored system tags / hook chrome verbatim.
NORMALIZE_VERSION = 2
def get_collection(
palace_path: str,
@@ -58,6 +68,66 @@ def get_closets_collection(palace_path: str, create: bool = True):
CLOSET_CHAR_LIMIT = 1500 # fill closet until ~1500 chars, then start a new one
CLOSET_EXTRACT_WINDOW = 5000 # how many chars of source content to scan for entities/topics
# Common capitalized words that look like proper nouns but are usually
# sentence-starters or filler. Filtered out of entity extraction.
_ENTITY_STOPLIST = frozenset(
{
"The",
"This",
"That",
"These",
"Those",
"When",
"Where",
"What",
"Why",
"Who",
"Which",
"How",
"After",
"Before",
"Then",
"Now",
"Here",
"There",
"And",
"But",
"Or",
"Yet",
"So",
"If",
"Else",
"Yes",
"No",
"Maybe",
"Okay",
"User",
"Assistant",
"System",
"Tool",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December",
}
)
def build_closet_lines(source_file, drawer_ids, content, wing, room):
@@ -72,11 +142,15 @@ def build_closet_lines(source_file, drawer_ids, content, wing, room):
from pathlib import Path
drawer_ref = ",".join(drawer_ids[:3])
window = content[:CLOSET_EXTRACT_WINDOW]
# Extract proper nouns (capitalized words, 2+ occurrences)
words = re.findall(r"\b[A-Z][a-z]{2,}\b", content[:5000])
# Extract proper nouns (capitalized words, 2+ occurrences). Filter out
# common sentence-starters that aren't real entities.
words = re.findall(r"\b[A-Z][a-z]{2,}\b", window)
word_freq = {}
for w in words:
if w in _ENTITY_STOPLIST:
continue
word_freq[w] = word_freq.get(w, 0) + 1
entities = sorted(
[w for w, c in word_freq.items() if c >= 2],
@@ -89,15 +163,15 @@ def build_closet_lines(source_file, drawer_ids, content, wing, room):
for pattern in [
r"(?:built|fixed|wrote|added|pushed|tested|created|decided|migrated|reviewed|deployed|configured|removed|updated)\s+[\w\s]{3,40}",
]:
topics.extend(re.findall(pattern, content[:5000], re.IGNORECASE))
topics.extend(re.findall(pattern, window, re.IGNORECASE))
# Also grab section headers if present
for header in re.findall(r"^#{1,3}\s+(.{5,60})$", content[:5000], re.MULTILINE):
for header in re.findall(r"^#{1,3}\s+(.{5,60})$", window, re.MULTILINE):
topics.append(header.strip())
# Dedupe preserving order
topics = list(dict.fromkeys(t.strip().lower() for t in topics))[:12]
# Extract quotes
quotes = re.findall(r'"([^"]{15,150})"', content[:5000])
quotes = re.findall(r'"([^"]{15,150})"', window)
# Build pointer lines — each one is atomic, never split
lines = []
@@ -114,17 +188,31 @@ def build_closet_lines(source_file, drawer_ids, content, wing, room):
return lines
def upsert_closet_lines(closets_col, closet_id_base, lines, metadata):
"""Add topic lines to closets. Never splits a topic mid-line.
def purge_file_closets(closets_col, source_file: str) -> None:
"""Delete every closet associated with ``source_file``.
If adding a line WHOLE would exceed CLOSET_CHAR_LIMIT, a new closet
is created. Some closets may have less than 1500 chars — that's fine.
Every topic is complete and readable.
Call this before ``upsert_closet_lines`` on a re-mine so stale topics
from a prior schema/version don't survive in the closet collection.
Mirrors the drawer-purge step in process_file().
"""
try:
closets_col.delete(where={"source_file": source_file})
except Exception:
pass
def upsert_closet_lines(closets_col, closet_id_base, lines, metadata):
"""Write topic lines to closets, packed greedily without splitting a line.
Closets are deterministically numbered (``..._01``, ``..._02``, …) and
each ``upsert`` fully overwrites the prior content at that ID. Callers
are expected to ``purge_file_closets`` first when re-mining a source
file so stale-numbered closets from larger prior runs don't leak.
Returns the number of closets written.
"""
closet_num = 1
current_lines = []
current_lines: list = []
current_chars = 0
closets_written = 0
@@ -134,17 +222,6 @@ def upsert_closet_lines(closets_col, closet_id_base, lines, metadata):
return
closet_id = f"{closet_id_base}_{closet_num:02d}"
text = "\n".join(current_lines)
# Check if closet already has content — append if room
try:
existing = closets_col.get(ids=[closet_id])
if existing.get("ids") and existing["documents"][0]:
old = existing["documents"][0]
if len(old) + len(text) + 1 <= CLOSET_CHAR_LIMIT:
text = old + "\n" + text
except Exception:
pass
closets_col.upsert(documents=[text], ids=[closet_id], metadatas=[metadata])
closets_written += 1
@@ -152,7 +229,6 @@ def upsert_closet_lines(closets_col, closet_id_base, lines, metadata):
line_len = len(line)
# Would this line fit whole in the current closet?
if current_chars > 0 and current_chars + line_len + 1 > CLOSET_CHAR_LIMIT:
# Doesn't fit — flush current closet, start new one
_flush()
closet_num += 1
current_lines = []
@@ -182,18 +258,22 @@ def mine_lock(source_file: str):
try:
if os.name == "nt":
import msvcrt
msvcrt.locking(lf.fileno(), msvcrt.LK_LOCK, 1)
else:
import fcntl
fcntl.flock(lf, fcntl.LOCK_EX)
yield
finally:
try:
if os.name == "nt":
import msvcrt
msvcrt.locking(lf.fileno(), msvcrt.LK_UNLCK, 1)
else:
import fcntl
fcntl.flock(lf, fcntl.LOCK_UN)
except Exception:
pass
@@ -203,16 +283,26 @@ def mine_lock(source_file: str):
def file_already_mined(collection, source_file: str, check_mtime: bool = False) -> bool:
"""Check if a file has already been filed in the palace.
When check_mtime=True (used by project miner), returns False if the file
has been modified since it was last mined, so it gets re-mined.
When check_mtime=False (used by convo miner), just checks existence.
Returns False (so the file gets re-mined) when:
- no drawers exist for this source_file
- the stored `normalize_version` is missing or older than the current
schema (triggers silent rebuild after a normalization upgrade)
- `check_mtime=True` and the file's mtime differs from the stored one
When check_mtime=True (used by project miner), also re-mines on content
change. When check_mtime=False (used by convo miner), transcripts are
assumed immutable, so only the version gate triggers a rebuild.
"""
try:
results = collection.get(where={"source_file": source_file}, limit=1)
if not results.get("ids"):
return False
stored_meta = results.get("metadatas", [{}])[0] or {}
# Pre-v2 drawers have no version field — treat them as stale.
stored_version = stored_meta.get("normalize_version", 1)
if stored_version < NORMALIZE_VERSION:
return False
if check_mtime:
stored_meta = results.get("metadatas", [{}])[0]
stored_mtime = stored_meta.get("source_mtime")
if stored_mtime is None:
return False
+135 -65
View File
@@ -7,9 +7,14 @@ Returns verbatim text — the actual words, never summaries.
"""
import logging
import re
from pathlib import Path
from .palace import get_collection, get_closets_collection
from .palace import get_closets_collection, get_collection
# Closet pointer line format: "topic|entities|→drawer_id_a,drawer_id_b"
# Multiple lines may join with newlines inside one closet document.
_CLOSET_DRAWER_REF_RE = re.compile(r"→([\w,]+)")
logger = logging.getLogger("mempalace_mcp")
@@ -29,6 +34,116 @@ def build_where_filter(wing: str = None, room: str = None) -> dict:
return {}
def _extract_drawer_ids_from_closet(closet_doc: str) -> list:
"""Parse all `→drawer_id_a,drawer_id_b` pointers out of a closet document.
Preserves order and dedupes.
"""
seen: dict = {}
for match in _CLOSET_DRAWER_REF_RE.findall(closet_doc):
for did in match.split(","):
did = did.strip()
if did and did not in seen:
seen[did] = None
return list(seen.keys())
def _closet_first_hits(
palace_path: str,
query: str,
where: dict,
drawers_col,
n_results: int,
max_distance: float,
):
"""Run a closet-first search and return chunk-level drawer hits.
Returns:
non-empty list of hits when the closet path produced usable matches.
``None`` when the closet collection is empty/missing OR when every
candidate drawer was filtered out (e.g. by max_distance); the
caller should fall back to direct drawer search.
"""
try:
closets_col = get_closets_collection(palace_path, create=False)
except Exception:
return None
try:
ckwargs = {
"query_texts": [query],
"n_results": max(n_results * 2, 5),
"include": ["documents", "metadatas", "distances"],
}
if where:
ckwargs["where"] = where
closet_results = closets_col.query(**ckwargs)
except Exception:
return None
closet_docs = closet_results["documents"][0] if closet_results["documents"] else []
if not closet_docs:
return None
closet_metas = closet_results["metadatas"][0]
closet_dists = closet_results["distances"][0]
# Collect candidate drawer IDs in closet-rank order, dedupe, remember
# which closet (and its distance/preview) introduced each one.
drawer_id_order: list = []
drawer_provenance: dict = {}
for cdoc, cmeta, cdist in zip(closet_docs, closet_metas, closet_dists):
for did in _extract_drawer_ids_from_closet(cdoc):
if did in drawer_provenance:
continue
drawer_provenance[did] = (cdist, cdoc, cmeta)
drawer_id_order.append(did)
if not drawer_id_order:
return None
# Hydrate exactly those drawers — chunk-level, not whole-file.
try:
fetched = drawers_col.get(
ids=drawer_id_order,
include=["documents", "metadatas"],
)
except Exception:
return None
fetched_ids = fetched.get("ids") or []
fetched_docs = fetched.get("documents") or []
fetched_metas = fetched.get("metadatas") or []
fetched_map = {
did: (doc, meta) for did, doc, meta in zip(fetched_ids, fetched_docs, fetched_metas)
}
hits: list = []
for did in drawer_id_order:
if did not in fetched_map:
continue # closet pointed to a drawer that no longer exists
doc, meta = fetched_map[did]
cdist, cdoc, _ = drawer_provenance[did]
if max_distance > 0.0 and cdist > max_distance:
continue
hits.append(
{
"text": doc,
"wing": meta.get("wing", "unknown"),
"room": meta.get("room", "unknown"),
"source_file": Path(meta.get("source_file", "?")).name,
"similarity": round(max(0.0, 1 - cdist), 3),
"distance": round(cdist, 4),
"matched_via": "closet",
"closet_preview": cdoc[:200],
}
)
if len(hits) >= n_results:
break
return hits if hits else None
def search(query: str, palace_path: str, wing: str = None, room: str = None, n_results: int = 5):
"""
Search the palace. Returns verbatim drawer content.
@@ -127,71 +242,25 @@ def search_memories(
where = build_where_filter(wing, room)
# Try closet-first search: search the compact index, then hydrate drawers
closet_hits = []
try:
closets_col = get_closets_collection(palace_path, create=False)
ckwargs = {
"query_texts": [query],
"n_results": n_results * 2, # over-fetch closets to find best drawers
"include": ["documents", "metadatas", "distances"],
# Closet-first search: scan the compact index, parse drawer pointers
# from each matching line, then hydrate exactly those drawers. This
# keeps the result shape chunk-level (consistent with direct search)
# and applies the same max_distance filter.
closet_hits = _closet_first_hits(
palace_path=palace_path,
query=query,
where=where,
drawers_col=drawers_col,
n_results=n_results,
max_distance=max_distance,
)
if closet_hits is not None:
return {
"query": query,
"filters": {"wing": wing, "room": room},
"total_before_filter": len(closet_hits),
"results": closet_hits,
}
if where:
ckwargs["where"] = where
closet_results = closets_col.query(**ckwargs)
if closet_results["documents"][0]:
closet_hits = list(zip(
closet_results["documents"][0],
closet_results["metadatas"][0],
closet_results["distances"][0],
))
except Exception:
pass # no closets yet — fall through to direct drawer search
# If closets found results, hydrate the referenced drawers
if closet_hits:
import re
seen_sources = set()
hits = []
for closet_doc, closet_meta, closet_dist in closet_hits:
source = closet_meta.get("source_file", "")
if source in seen_sources:
continue
seen_sources.add(source)
# Find drawers for this source file
try:
drawer_results = drawers_col.get(
where={"source_file": source},
include=["documents", "metadatas"],
)
if drawer_results.get("ids"):
# Combine all drawer content for this file
full_text = "\n\n".join(drawer_results["documents"])
meta = drawer_results["metadatas"][0]
hits.append({
"text": full_text,
"wing": meta.get("wing", "unknown"),
"room": meta.get("room", "unknown"),
"source_file": Path(source).name,
"similarity": round(max(0.0, 1 - closet_dist), 3),
"distance": round(closet_dist, 4),
"matched_via": "closet",
"closet_preview": closet_doc[:200],
})
except Exception:
pass
if len(hits) >= n_results:
break
if hits:
return {
"query": query,
"filters": {"wing": wing, "room": room},
"total_before_filter": len(closet_hits),
"results": hits,
}
# Fallback: direct drawer search (no closets yet, or closets empty)
try:
@@ -224,6 +293,7 @@ def search_memories(
"source_file": Path(meta.get("source_file", "?")).name,
"similarity": round(max(0.0, 1 - dist), 3),
"distance": round(dist, 4),
"matched_via": "drawer",
}
)
+1 -1
View File
@@ -1,3 +1,3 @@
"""Single source of truth for the MemPalace package version."""
__version__ = "3.1.0"
__version__ = "3.2.0"
+316
View File
@@ -0,0 +1,316 @@
"""
test_closets.py — Tests for the closet (searchable index) layer.
Covers:
* build_closet_lines — pointer-line shape, entity extraction, stoplist,
quote/header pickup, and the "always emit one line" guarantee.
* upsert_closet_lines — pure overwrite (no append), char-limit packing,
atomic-line guarantee.
* purge_file_closets — wipes prior closets so a re-mine starts clean.
* The end-to-end rebuild: re-mining a file fully replaces its closets,
including when the prior run produced more numbered closets.
* search_memories closet-first path — returns chunk-level hits parsed
from `→drawer_ids` pointers, falls back when closets are empty,
respects max_distance.
"""
from mempalace.miner import mine
from mempalace.palace import (
CLOSET_CHAR_LIMIT,
build_closet_lines,
get_closets_collection,
purge_file_closets,
upsert_closet_lines,
)
from mempalace.searcher import _extract_drawer_ids_from_closet, search_memories
# ── build_closet_lines ─────────────────────────────────────────────────
class TestBuildClosetLines:
def test_emits_pointer_line_shape(self, tmp_path):
content = (
"# Auth rewrite\n\n"
"Decided we need to migrate to passkeys. "
"Built the prototype with WebAuthn. "
"Reviewed the API surface."
)
lines = build_closet_lines(
"/proj/auth.md",
["drawer_proj_backend_aaa", "drawer_proj_backend_bbb"],
content,
wing="proj",
room="backend",
)
assert lines, "should always emit at least one line"
for line in lines:
assert "" in line, f"line missing pointer arrow: {line!r}"
parts = line.split("|")
assert len(parts) == 3, f"expected topic|entities|→refs, got {line!r}"
assert parts[2].startswith("")
def test_extracts_section_headers_as_topics(self):
content = "# First Header\nbody\n## Second Header\nmore body"
lines = build_closet_lines("/x.md", ["d1"], content, "w", "r")
joined = "\n".join(lines).lower()
assert "first header" in joined
assert "second header" in joined
def test_entity_stoplist_filters_sentence_starters(self):
# "When", "After", "The" repeat 3+ times — old code would index them
# as entities. New code's stoplist drops them.
content = (
"When the pipeline ran, the result was good. "
"When the user logged in, the token was issued. "
"After the migration, the latency dropped. "
"After the rollback, the latency rose. "
"The new flow is stable. The audit cleared."
)
lines = build_closet_lines("/x.md", ["d1"], content, "w", "r")
# Entities sit between the two pipes
entity_segments = [line.split("|")[1] for line in lines]
for seg in entity_segments:
tokens = set(seg.split(";")) if seg else set()
assert "When" not in tokens
assert "After" not in tokens
assert "The" not in tokens
def test_real_proper_nouns_survive_stoplist(self):
content = (
"Igor reviewed the diff. Milla wrote the spec. "
"Igor pushed the fix. Milla approved the PR. "
"Igor and Milla shipped together."
)
lines = build_closet_lines("/x.md", ["d1"], content, "w", "r")
entity_segments = [line.split("|")[1] for line in lines]
joined_entities = ";".join(entity_segments)
assert "Igor" in joined_entities
assert "Milla" in joined_entities
def test_emits_fallback_line_when_nothing_extractable(self):
# No headers, no action verbs, no quotes, no repeated capitalized words
content = "lorem ipsum dolor sit amet consectetur adipiscing elit"
lines = build_closet_lines("/x/notes.txt", ["d1"], content, "wing", "room")
assert len(lines) == 1
assert "wing/room/notes" in lines[0]
assert "→d1" in lines[0]
def test_pointer_references_first_three_drawers(self):
ids = [f"drawer_{i}" for i in range(10)]
lines = build_closet_lines("/x.md", ids, "# A\n# B", "w", "r")
assert all("→drawer_0,drawer_1,drawer_2" in line for line in lines)
# ── upsert_closet_lines ───────────────────────────────────────────────
class TestUpsertClosetLines:
def test_overwrites_existing_closet_does_not_append(self, palace_path):
col = get_closets_collection(palace_path)
base = "closet_test_room_abc"
meta = {"wing": "test", "room": "room", "source_file": "/x.md"}
# First mine — three short lines.
upsert_closet_lines(col, base, ["alpha|;|→d1", "beta|;|→d2", "gamma|;|→d3"], meta)
first = col.get(ids=[f"{base}_01"])
assert "alpha" in first["documents"][0]
assert "beta" in first["documents"][0]
# Second mine — entirely different lines. Must replace, not append.
upsert_closet_lines(col, base, ["delta|;|→d4", "epsilon|;|→d5"], meta)
second = col.get(ids=[f"{base}_01"])
doc = second["documents"][0]
assert "delta" in doc
assert "epsilon" in doc
assert "alpha" not in doc, "old closet line leaked into rebuild"
assert "beta" not in doc
def test_packs_into_multiple_closets_without_splitting_lines(self, palace_path):
col = get_closets_collection(palace_path)
base = "closet_pack_room_def"
meta = {"wing": "test", "room": "room", "source_file": "/y.md"}
# Build lines that approach but never exceed the limit.
line = "x" * 600 # well under CLOSET_CHAR_LIMIT
n_written = upsert_closet_lines(col, base, [line, line, line, line], meta)
# 4 lines @ 600+1 chars = 2404 — should pack into 2 closets (≤1500 each)
assert n_written == 2
for i in range(1, n_written + 1):
doc = col.get(ids=[f"{base}_{i:02d}"])["documents"][0]
# Every line is intact (never split mid-line)
for chunk in doc.split("\n"):
assert len(chunk) == 600, f"line was truncated in closet {i}"
# Closet stays under the cap
assert len(doc) <= CLOSET_CHAR_LIMIT
# ── purge_file_closets ────────────────────────────────────────────────
class TestPurgeFileClosets:
def test_deletes_only_the_targeted_source(self, palace_path):
col = get_closets_collection(palace_path)
col.upsert(
ids=["closet_a_01", "closet_b_01"],
documents=["a|;|→d1", "b|;|→d2"],
metadatas=[
{"source_file": "/keep.md", "wing": "w", "room": "r"},
{"source_file": "/drop.md", "wing": "w", "room": "r"},
],
)
purge_file_closets(col, "/drop.md")
remaining_ids = set(col.get()["ids"])
assert "closet_a_01" in remaining_ids
assert "closet_b_01" not in remaining_ids
# ── End-to-end rebuild via the project miner ──────────────────────────
class TestMinerClosetRebuild:
def test_remine_replaces_closets_completely(self, tmp_path):
import yaml
project = tmp_path / "proj"
project.mkdir()
(project / "mempalace.yaml").write_text(
yaml.dump({"wing": "proj", "rooms": [{"name": "general", "description": "x"}]})
)
target = project / "doc.md"
# First mine — long content produces multiple numbered closets.
first_topics = "\n\n".join(f"# Topic {i}\n" + ("filler text " * 30) for i in range(15))
target.write_text(first_topics)
palace = tmp_path / "palace"
mine(str(project), str(palace), wing_override="proj", agent="test")
col = get_closets_collection(str(palace))
first_pass = col.get(where={"source_file": str(target)})
assert first_pass["ids"], "first mine should have written closets"
first_ids = set(first_pass["ids"])
assert any("topic 0" in (d or "").lower() for d in first_pass["documents"])
# Touch mtime so file_already_mined doesn't short-circuit, and
# rewrite with fewer topics (so the rebuild produces fewer closets
# than the first run).
import os
import time
target.write_text("# Only Topic Now\n" + ("short body " * 5))
new_mtime = os.path.getmtime(target) + 60
os.utime(target, (new_mtime, new_mtime))
time.sleep(0.01) # ensure mtime delta is visible
mine(str(project), str(palace), wing_override="proj", agent="test")
col = get_closets_collection(str(palace))
second_pass = col.get(where={"source_file": str(target)})
second_docs = "\n".join(second_pass["documents"]).lower()
assert "only topic now" in second_docs
for i in range(15):
assert (
f"topic {i}\n" not in second_docs
), f"stale 'Topic {i}' from first mine survived the rebuild"
# Numbered closets that existed only in the larger first run must be gone.
leftover = first_ids - set(second_pass["ids"])
for stale_id in leftover:
assert not col.get(ids=[stale_id])[
"ids"
], f"orphan closet {stale_id} from larger first run survived purge"
# ── _extract_drawer_ids_from_closet ───────────────────────────────────
class TestExtractDrawerIds:
def test_parses_single_pointer(self):
assert _extract_drawer_ids_from_closet("topic|;|→drawer_x") == ["drawer_x"]
def test_parses_multiple_pointers_per_line(self):
line = "topic|ent|→drawer_a,drawer_b,drawer_c"
assert _extract_drawer_ids_from_closet(line) == [
"drawer_a",
"drawer_b",
"drawer_c",
]
def test_dedupes_across_lines(self):
doc = "one|;|→drawer_a,drawer_b\ntwo|;|→drawer_b,drawer_c"
assert _extract_drawer_ids_from_closet(doc) == [
"drawer_a",
"drawer_b",
"drawer_c",
]
def test_empty_doc_returns_empty(self):
assert _extract_drawer_ids_from_closet("") == []
assert _extract_drawer_ids_from_closet("no arrows here") == []
# ── search_memories closet-first path ────────────────────────────────
class TestSearchMemoriesClosetFirst:
def test_falls_back_to_direct_when_no_closets(self, palace_path, seeded_collection):
# seeded_collection populates only mempalace_drawers, not closets.
result = search_memories("JWT authentication", palace_path)
assert result["results"], "should still find drawer hits via fallback"
for hit in result["results"]:
assert hit.get("matched_via") == "drawer"
def test_closet_first_returns_chunk_level_hits(self, palace_path, seeded_collection):
# Build a closet that points at the JWT drawer specifically.
closets = get_closets_collection(palace_path)
closets.upsert(
ids=["closet_proj_backend_aaa_01"],
documents=["JWT auth tokens|;|→drawer_proj_backend_aaa"],
metadatas=[
{
"wing": "project",
"room": "backend",
"source_file": "auth.py",
}
],
)
result = search_memories("JWT authentication", palace_path)
assert result["results"], "closet-first search should hydrate the drawer"
top = result["results"][0]
assert top["matched_via"] == "closet"
# Must be the chunk-level drawer text, not a concatenation of every
# drawer in the file.
assert "JWT" in top["text"]
assert (
"Database migrations" not in top["text"]
), "closet path should not glue unrelated drawers together"
assert "closet_preview" in top
assert "→drawer_proj_backend_aaa" in top["closet_preview"]
def test_max_distance_filters_closet_hits(self, palace_path, seeded_collection):
closets = get_closets_collection(palace_path)
closets.upsert(
ids=["closet_proj_backend_aaa_01"],
documents=["JWT auth tokens|;|→drawer_proj_backend_aaa"],
metadatas=[
{
"wing": "project",
"room": "backend",
"source_file": "auth.py",
}
],
)
# max_distance=0.001 is essentially "must match exactly". The closet
# path should reject everything and the caller falls back to direct
# search (which also filters with the same threshold).
result = search_memories(
"completely unrelated query about quantum gardening",
palace_path,
max_distance=0.001,
)
# Either no results, or every result respected the threshold.
for hit in result["results"]:
assert hit["distance"] <= 0.001
+83
View File
@@ -75,3 +75,86 @@ def test_mine_convos_does_not_reprocess_empty_chunk_files(capsys):
assert "Files skipped (already filed): 1" in out2
finally:
shutil.rmtree(tmpdir, ignore_errors=True)
def test_mine_convos_rebuilds_stale_drawers_after_schema_bump(capsys):
"""When stored drawers have an older normalize_version, the next mine
silently purges them and refiles — no manual erase required.
This is what makes the strip_noise upgrade apply to existing corpora:
users just run `mempalace mine` again and old noise-filled drawers get
replaced with clean ones."""
from mempalace.palace import NORMALIZE_VERSION
tmpdir = tempfile.mkdtemp()
try:
convo_path = Path(tmpdir) / "chat.txt"
convo_path.write_text(
"> What is memory?\nMemory is persistence.\n\n"
"> Why does it matter?\nIt enables continuity.\n\n"
"> How do we build it?\nWith structured storage.\n"
)
palace_path = os.path.join(tmpdir, "palace")
# First mine — stamps drawers with NORMALIZE_VERSION
mine_convos(tmpdir, palace_path, wing="test")
capsys.readouterr()
client = chromadb.PersistentClient(path=palace_path)
col = client.get_collection("mempalace_drawers")
resolved = str(Path(tmpdir).resolve() / "chat.txt")
first_pass = col.get(where={"source_file": resolved})
first_ids = set(first_pass["ids"])
assert first_ids, "first mine should produce drawers"
for meta in first_pass["metadatas"]:
assert meta.get("normalize_version") == NORMALIZE_VERSION
# Simulate pre-v2 drawers: rewrite metadata to an older version,
# and replace content with "noise" so we can see it get cleaned up.
stale_metas = []
for meta in first_pass["metadatas"]:
stale = dict(meta)
stale["normalize_version"] = 1
stale_metas.append(stale)
col.update(
ids=list(first_pass["ids"]),
documents=["STALE NOISE"] * len(first_pass["ids"]),
metadatas=stale_metas,
)
# Add an extra orphan drawer that should also be purged.
col.add(
ids=["orphan_drawer"],
documents=["OLD ORPHAN"],
metadatas=[
{
"wing": "test",
"room": "default",
"source_file": resolved,
"chunk_index": 999,
"normalize_version": 1,
}
],
)
del col, client
# Second mine — version gate should trigger rebuild
mine_convos(tmpdir, palace_path, wing="test")
out = capsys.readouterr().out
assert (
"Files skipped (already filed): 0" in out
), "stale drawers should force a rebuild, not a skip"
client = chromadb.PersistentClient(path=palace_path)
col = client.get_collection("mempalace_drawers")
rebuilt = col.get(where={"source_file": resolved})
# Orphan is gone
assert "orphan_drawer" not in rebuilt["ids"]
# No stale content survived
assert all("STALE NOISE" not in d for d in rebuilt["documents"])
assert all("OLD ORPHAN" not in d for d in rebuilt["documents"])
# All rebuilt drawers carry the current version
for meta in rebuilt["metadatas"]:
assert meta.get("normalize_version") == NORMALIZE_VERSION
del col, client
finally:
shutil.rmtree(tmpdir, ignore_errors=True)
+90 -4
View File
@@ -7,7 +7,7 @@ import chromadb
import yaml
from mempalace.miner import mine, scan_project, status
from mempalace.palace import file_already_mined
from mempalace.palace import NORMALIZE_VERSION, file_already_mined
def write_file(path: Path, content: str):
@@ -227,11 +227,17 @@ def test_file_already_mined_check_mtime():
assert file_already_mined(col, test_file) is False
assert file_already_mined(col, test_file, check_mtime=True) is False
# Add it with mtime
# Add it with mtime + current normalize_version
col.add(
ids=["d1"],
documents=["hello world"],
metadatas=[{"source_file": test_file, "source_mtime": str(mtime)}],
metadatas=[
{
"source_file": test_file,
"source_mtime": str(mtime),
"normalize_version": NORMALIZE_VERSION,
}
],
)
# Already mined (no mtime check)
@@ -253,7 +259,12 @@ def test_file_already_mined_check_mtime():
col.add(
ids=["d2"],
documents=["other"],
metadatas=[{"source_file": "/fake/no_mtime.txt"}],
metadatas=[
{
"source_file": "/fake/no_mtime.txt",
"normalize_version": NORMALIZE_VERSION,
}
],
)
assert file_already_mined(col, "/fake/no_mtime.txt", check_mtime=True) is False
finally:
@@ -296,3 +307,78 @@ def test_status_missing_palace_does_not_create_empty_collection(tmp_path, capsys
out = capsys.readouterr().out
assert "No palace found" in out
assert not palace_path.exists()
# ── normalize_version schema gate ───────────────────────────────────────
#
# When the normalization pipeline changes shape (e.g., strip_noise lands),
# `NORMALIZE_VERSION` is bumped so pre-existing drawers can be silently
# rebuilt on the next mine. These tests pin that contract.
def test_file_already_mined_returns_false_for_stale_normalize_version():
"""Pre-v2 drawers (no field, or older integer) must not short-circuit."""
tmpdir = tempfile.mkdtemp()
try:
palace_path = os.path.join(tmpdir, "palace")
os.makedirs(palace_path)
client = chromadb.PersistentClient(path=palace_path)
col = client.get_or_create_collection("mempalace_drawers")
# Pre-v2 drawer: no normalize_version field at all
col.add(
ids=["d_old"],
documents=["old"],
metadatas=[{"source_file": "/fake/old.jsonl"}],
)
assert file_already_mined(col, "/fake/old.jsonl") is False
# Explicitly older version
col.add(
ids=["d_v1"],
documents=["v1"],
metadatas=[{"source_file": "/fake/v1.jsonl", "normalize_version": 1}],
)
assert file_already_mined(col, "/fake/v1.jsonl") is False
# Current version — short-circuits
col.add(
ids=["d_current"],
documents=["cur"],
metadatas=[
{
"source_file": "/fake/current.jsonl",
"normalize_version": NORMALIZE_VERSION,
}
],
)
assert file_already_mined(col, "/fake/current.jsonl") is True
finally:
del col, client
shutil.rmtree(tmpdir, ignore_errors=True)
def test_add_drawer_stamps_normalize_version(tmp_path):
"""Fresh drawers carry the current schema version so future upgrades work."""
from mempalace.miner import add_drawer
palace_path = tmp_path / "palace"
palace_path.mkdir()
client = chromadb.PersistentClient(path=str(palace_path))
col = client.get_or_create_collection("mempalace_drawers")
try:
added = add_drawer(
collection=col,
wing="test",
room="notes",
content="hello",
source_file=str(tmp_path / "src.md"),
chunk_index=0,
agent="unit",
)
assert added is True
stored = col.get(limit=1)
meta = stored["metadatas"][0]
assert meta["normalize_version"] == NORMALIZE_VERSION
finally:
del col, client
+146
View File
@@ -13,6 +13,7 @@ from mempalace.normalize import (
_try_normalize_json,
_try_slack_json,
normalize,
strip_noise,
)
@@ -1048,3 +1049,148 @@ def test_normalize_rejects_large_file():
assert False, "Should have raised IOError"
except IOError as e:
assert "too large" in str(e).lower()
# ── strip_noise() — verbatim-safety boundary tests ─────────────────────
#
# The "Verbatim always" design principle requires that we never delete
# user-authored text. These tests pin down the boundary between system
# noise (which we strip) and user prose that happens to mention the same
# strings (which must survive untouched).
class TestStripNoisePreservesUserContent:
"""User prose that mentions noise strings inline must be preserved."""
def test_user_discusses_stop_hook_in_prose(self):
# Regression: original regex with IGNORECASE + `.*\n?` ate the second
# sentence from real user commentary.
text = (
"> User:\n"
"> Our CI has a stop hook that rejects merges after 5pm. "
"Ran 2 stop hooks last week.\n"
"> Assistant:\n"
"> Got it."
)
assert strip_noise(text) == text.strip()
def test_user_mentions_system_reminder_inline(self):
# Inline <system-reminder> tags inside user prose (e.g. documenting
# Claude Code behavior) must not be stripped.
text = (
"> User:\n"
"> Here is what Claude Code emits: "
"<system-reminder>Auto-save reminder...</system-reminder>"
" — I want to ignore it."
)
assert strip_noise(text) == text.strip()
def test_ctrl_o_hint_in_prose_preserved(self):
# Regression: original `.*\(ctrl\+o to expand\).*\n?` nuked the whole
# line whenever a user documented the TUI shortcut.
text = (
"> User:\n"
"> In the TUI you hit (ctrl+o to expand) to see more. "
"That is the shortcut I want to document."
)
assert strip_noise(text) == text.strip()
def test_current_time_inline_in_prose(self):
text = "> User:\n> At CURRENT TIME: the meeting starts, not before."
assert strip_noise(text) == text.strip()
def test_plus_n_lines_marker_inline(self):
text = "> User:\n> The log showed … +50 lines of stack trace, useful."
assert strip_noise(text) == text.strip()
def test_dangling_open_tag_does_not_span_messages(self):
# THE span-eating bug: a stray unclosed <system-reminder> in one
# message must NOT merge with a closing tag in another message and
# silently delete everything in between.
text = (
"> User 1: normal content <system-reminder>A\n"
"> Assistant: reply\n"
"> User 2: more content</system-reminder> tail"
)
out = strip_noise(text)
assert "Assistant: reply" in out
assert "User 2: more content" in out
assert "User 1: normal content" in out
class TestStripNoiseRemovesSystemChrome:
"""System-injected noise with standalone/line-anchored shape must be stripped."""
def test_strips_line_anchored_system_reminder_block(self):
text = (
"> User:\n"
"<system-reminder>\n"
"Auto-save reminder...\n"
"</system-reminder>\n"
"> Real message."
)
out = strip_noise(text)
assert "system-reminder" not in out
assert "Auto-save reminder" not in out
assert "Real message." in out
def test_strips_system_reminder_with_blockquote_prefix(self):
# _messages_to_transcript prefixes lines with "> ", so the line
# anchor must also accept that shape.
text = "> User:\n" "> <system-reminder>Injected noise</system-reminder>\n" "> Real message."
out = strip_noise(text)
assert "Injected noise" not in out
assert "Real message." in out
def test_strips_standalone_ran_hook_line(self):
text = "Ran 2 Stop hook\n> User: real content"
out = strip_noise(text)
assert "Ran 2 Stop hook" not in out
assert "real content" in out
def test_strips_known_hook_names(self):
for hook in ("Stop", "PreCompact", "PreToolUse", "PostToolUse", "UserPromptSubmit"):
text = f"Ran 1 {hook} hook\n> User: content"
assert hook not in strip_noise(text)
def test_strips_current_time_standalone(self):
text = "CURRENT TIME: 2026-04-13 10:00 UTC\n> User: Hello"
out = strip_noise(text)
assert "CURRENT TIME" not in out
assert "Hello" in out
def test_strips_collapsed_lines_marker(self):
text = "… +42 lines\n> User: Hello"
out = strip_noise(text)
assert "+42 lines" not in out
assert "Hello" in out
def test_strips_token_count_ctrl_o_chrome(self):
# Claude Code's actual collapsed-output chrome: "[N tokens] (ctrl+o to expand)"
text = "> Assistant: some output [5 tokens] (ctrl+o to expand)\n> User: ok"
out = strip_noise(text)
assert "(ctrl+o to expand)" not in out
assert "[5 tokens]" not in out
assert "some output" in out
def test_strips_each_known_noise_tag(self):
for tag in (
"system-reminder",
"command-message",
"command-name",
"task-notification",
"user-prompt-submit-hook",
"hook_output",
):
text = f"> User:\n<{tag}>junk</{tag}>\n> Real."
out = strip_noise(text)
assert tag not in out, f"{tag} leaked into output"
assert "Real." in out
def test_collapses_excessive_blank_lines(self):
text = "line one\n\n\n\n\n\nline two"
out = strip_noise(text)
assert "line one" in out
assert "line two" in out
# Should collapse to no more than 3 newlines
assert "\n\n\n\n" not in out