merge: develop + harden entity metadata, BM25, and diary ingest for production
Merges develop (closet hardening #826, strip_noise #785, lock #784) and replaces every sub-feature in this PR with a correct, tested implementation. Shippable now. ## 1. Real Okapi-BM25 (searcher.py) The prior `_bm25_score()` hardcoded `idf = log(2.0)` for every term — it was really a scaled TF, not BM25, and couldn't tell a discriminative term from a generic one. Replaced with `_bm25_scores(query, documents)` that computes proper IDF over the provided candidate corpus using the Lucene smoothed formula `log((N - df + 0.5) / (df + 0.5) + 1)`. Well- defined for re-ranking vector-retrieval candidates — IDF there measures how discriminative each term is *within the candidate set*, exactly the signal we want. `_hybrid_rank` also fixed: - Vector normalization is now absolute `max(0, 1 - dist)`, not `1 - dist/max_dist` — adding/removing a candidate no longer reshuffles the others. - BM25 is min-max normalized within candidates (bounded [0, 1]). - Closet path now re-ranks too (was previously returning closet-order hits without hybrid scoring). - `_hybrid_score` internal field stripped from output; `bm25_score` exposed for debugging. ## 2. Entity metadata (miner.py) - Reuses `_ENTITY_STOPLIST` from palace.py so sentence-starters like "When", "After", "The" no longer land as entities (regression test covers this). - Known-entity registry is cached at module level, keyed by the registry file's mtime — no more disk read per drawer. - File handle now uses a context manager. - Truncates the entity LIST (to 25) before joining — never splits a name in the middle. ## 3. Diary ingest (diary_ingest.py) - State file now lives at `~/.mempalace/state/diary_ingest_<hash>.json`, keyed by (palace_path, diary_dir). No more pollution of the user's content directory. - Drawer IDs now hash `(wing, date_str)` — a user with personal + work diaries on the same day no longer silently clobbers. - Each day's upsert runs inside `mine_lock(source_file)` so concurrent ingest from two terminals can't race. - `force=True` now calls `purge_file_closets` before rebuild so leftover numbered closets from a longer prior day don't orphan. ## 4. Tests (tests/test_closets.py) Merged this PR's MineLock/Entity/BM25/Diary tests with develop's hardened Build/Upsert/Purge/Rebuild/SearchClosetFirst tests. Added specific regression tests for every fix above: - entity stoplist applies (no "When/After/The") - entity list capped before join (no partial tokens) - registry cached by mtime (mock-verified zero re-reads) - BM25 IDF downweights terms present in every doc (real BM25 evidence) - hybrid rank absolute normalization stable against outliers - diary state file outside user's diary dir - diary wing-prefixed IDs prevent cross-wing date collisions 35/35 closet tests pass; full suite 743/743. ruff + format clean under CI-pinned 0.4.x.
This commit is contained in:
+110
-74
@@ -2,10 +2,14 @@
|
||||
diary_ingest.py — Ingest daily summary files into the palace.
|
||||
|
||||
Architecture:
|
||||
- ONE drawer per day — full verbatim content, upserted as the day grows
|
||||
- Closets pack topics up to 1500 chars, never split mid-topic
|
||||
- Only new entries are processed (tracks entry count in state file)
|
||||
- Entities extracted and stamped on metadata for filterable search
|
||||
- ONE drawer per (wing, day) — full verbatim content, upserted as the day grows.
|
||||
- Closets pack topics up to CLOSET_CHAR_LIMIT, never split mid-topic.
|
||||
- A re-ingest fully purges the prior day's closets before rebuilding so a
|
||||
shorter day never leaves orphans behind.
|
||||
- Only new entries are processed by default (tracks entry count in a state
|
||||
file under ``~/.mempalace/state/`` — never inside the user's diary dir).
|
||||
- Per-file ``mine_lock`` so concurrent ingest from two terminals can't race.
|
||||
- Entities extracted and stamped on metadata for filterable search.
|
||||
|
||||
Usage:
|
||||
python -m mempalace.diary_ingest --dir ~/daily_summaries --palace ~/.mempalace/palace
|
||||
@@ -19,19 +23,32 @@ import re
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from .palace import (
|
||||
get_collection,
|
||||
get_closets_collection,
|
||||
build_closet_lines,
|
||||
upsert_closet_lines,
|
||||
CLOSET_CHAR_LIMIT,
|
||||
)
|
||||
from .miner import _extract_entities_for_metadata
|
||||
|
||||
from .palace import (
|
||||
build_closet_lines,
|
||||
get_closets_collection,
|
||||
get_collection,
|
||||
mine_lock,
|
||||
purge_file_closets,
|
||||
upsert_closet_lines,
|
||||
)
|
||||
|
||||
DIARY_ENTRY_RE = re.compile(r"^## .+", re.MULTILINE)
|
||||
|
||||
|
||||
def _state_file_for(palace_path: str, diary_dir: Path) -> Path:
|
||||
"""Return the per-(palace, diary-dir) state-file path under ~/.mempalace/state.
|
||||
|
||||
Keyed by sha256 of (palace_path, diary_dir) so multiple diary folders
|
||||
pointing at the same palace each get an independent state file. The
|
||||
state file is *never* written inside the user's diary directory.
|
||||
"""
|
||||
state_root = Path(os.path.expanduser("~")) / ".mempalace" / "state"
|
||||
state_root.mkdir(parents=True, exist_ok=True)
|
||||
key = hashlib.sha256(f"{palace_path}|{diary_dir}".encode()).hexdigest()[:24]
|
||||
return state_root / f"diary_ingest_{key}.json"
|
||||
|
||||
|
||||
def _split_entries(text):
|
||||
"""Split diary text into (header, body) pairs per ## entry."""
|
||||
parts = DIARY_ENTRY_RE.split(text)
|
||||
@@ -43,6 +60,18 @@ def _split_entries(text):
|
||||
return entries
|
||||
|
||||
|
||||
def _diary_drawer_id(wing: str, date_str: str) -> str:
|
||||
"""Stable, wing-scoped drawer ID. Two diaries (e.g. 'work' vs 'personal')
|
||||
sharing the same date never collide."""
|
||||
suffix = hashlib.sha256(f"{wing}|{date_str}".encode()).hexdigest()[:24]
|
||||
return f"drawer_diary_{suffix}"
|
||||
|
||||
|
||||
def _diary_closet_id_base(wing: str, date_str: str) -> str:
|
||||
suffix = hashlib.sha256(f"{wing}|{date_str}".encode()).hexdigest()[:24]
|
||||
return f"closet_diary_{suffix}"
|
||||
|
||||
|
||||
def ingest_diaries(
|
||||
diary_dir,
|
||||
palace_path,
|
||||
@@ -51,24 +80,29 @@ def ingest_diaries(
|
||||
):
|
||||
"""Ingest daily summary files into the palace.
|
||||
|
||||
Each date file gets ONE drawer (upserted as day grows) and
|
||||
closets that pack topics atomically up to 1500 chars.
|
||||
Each date file gets ONE drawer keyed by ``(wing, date)`` and closets that
|
||||
pack topics atomically up to ``CLOSET_CHAR_LIMIT``. ``force=True`` rebuilds
|
||||
every entry's closets from scratch (purging stale ones); the default
|
||||
incremental mode only processes entries appended since the last run.
|
||||
"""
|
||||
diary_dir = Path(diary_dir).expanduser().resolve()
|
||||
if not diary_dir.exists():
|
||||
print(f"Diary directory not found: {diary_dir}")
|
||||
return
|
||||
return {"days_updated": 0, "closets_created": 0}
|
||||
|
||||
diary_files = sorted(diary_dir.glob("*.md"))
|
||||
if not diary_files:
|
||||
print(f"No .md files in {diary_dir}")
|
||||
return
|
||||
return {"days_updated": 0, "closets_created": 0}
|
||||
|
||||
# State tracks which entries have been closeted per file
|
||||
state_file = diary_dir / ".diary_ingest_state.json"
|
||||
state = {} if force else (
|
||||
json.loads(state_file.read_text()) if state_file.exists() else {}
|
||||
)
|
||||
state_file = _state_file_for(str(palace_path), diary_dir)
|
||||
if force or not state_file.exists():
|
||||
state: dict = {}
|
||||
else:
|
||||
try:
|
||||
state = json.loads(state_file.read_text())
|
||||
except Exception:
|
||||
state = {}
|
||||
|
||||
drawers_col = get_collection(palace_path)
|
||||
closets_col = get_closets_collection(palace_path)
|
||||
@@ -87,70 +121,72 @@ def ingest_diaries(
|
||||
date_str = date_match.group(1)
|
||||
|
||||
# Skip if content hasn't changed
|
||||
prev_size = state.get(diary_path.name, {}).get("size", 0)
|
||||
state_key = f"{wing}|{diary_path.name}"
|
||||
prev_size = state.get(state_key, {}).get("size", 0)
|
||||
curr_size = len(text)
|
||||
if curr_size == prev_size and not force:
|
||||
continue
|
||||
|
||||
now_iso = datetime.now(timezone.utc).isoformat()
|
||||
drawer_id = f"drawer_diary_{date_str}"
|
||||
|
||||
# Extract entities from full day text
|
||||
drawer_id = _diary_drawer_id(wing, date_str)
|
||||
entities = _extract_entities_for_metadata(text)
|
||||
source_file = str(diary_path)
|
||||
|
||||
# UPSERT the day's drawer (full verbatim, replaces as day grows)
|
||||
drawer_meta = {
|
||||
"date": date_str,
|
||||
"wing": wing,
|
||||
"room": "daily",
|
||||
"source_file": str(diary_path),
|
||||
"source_session": "daily_diary",
|
||||
"filed_at": now_iso,
|
||||
}
|
||||
if entities:
|
||||
drawer_meta["entities"] = entities
|
||||
drawers_col.upsert(
|
||||
documents=[text],
|
||||
ids=[drawer_id],
|
||||
metadatas=[drawer_meta],
|
||||
)
|
||||
# Serialize per source — two terminals running ingest at once must
|
||||
# not interleave the upsert + closet-rebuild.
|
||||
with mine_lock(source_file):
|
||||
drawer_meta = {
|
||||
"date": date_str,
|
||||
"wing": wing,
|
||||
"room": "daily",
|
||||
"source_file": source_file,
|
||||
"source_session": "daily_diary",
|
||||
"filed_at": now_iso,
|
||||
}
|
||||
if entities:
|
||||
drawer_meta["entities"] = entities
|
||||
drawers_col.upsert(
|
||||
documents=[text],
|
||||
ids=[drawer_id],
|
||||
metadatas=[drawer_meta],
|
||||
)
|
||||
|
||||
# Split into entries and find new ones
|
||||
entries = _split_entries(text)
|
||||
prev_entry_count = state.get(diary_path.name, {}).get("entry_count", 0)
|
||||
new_entries = entries[prev_entry_count:] if not force else entries
|
||||
entries = _split_entries(text)
|
||||
prev_entry_count = state.get(state_key, {}).get("entry_count", 0)
|
||||
new_entries = entries if force else entries[prev_entry_count:]
|
||||
|
||||
if new_entries:
|
||||
# Build closet lines from new entries
|
||||
all_lines = []
|
||||
for header, body in new_entries:
|
||||
entry_text = f"{header}\n{body}"
|
||||
entry_lines = build_closet_lines(
|
||||
str(diary_path), [drawer_id], entry_text, wing, "daily"
|
||||
)
|
||||
all_lines.extend(entry_lines)
|
||||
if new_entries:
|
||||
all_lines = []
|
||||
for header, body in new_entries:
|
||||
entry_text = f"{header}\n{body}"
|
||||
entry_lines = build_closet_lines(
|
||||
source_file, [drawer_id], entry_text, wing, "daily"
|
||||
)
|
||||
all_lines.extend(entry_lines)
|
||||
|
||||
if all_lines:
|
||||
closet_id_base = f"closet_diary_{date_str}"
|
||||
closet_meta = {
|
||||
"date": date_str,
|
||||
"wing": wing,
|
||||
"room": "daily",
|
||||
"source_file": str(diary_path),
|
||||
"filed_at": now_iso,
|
||||
}
|
||||
if entities:
|
||||
closet_meta["entities"] = entities
|
||||
n = upsert_closet_lines(
|
||||
closets_col, closet_id_base, all_lines, closet_meta
|
||||
)
|
||||
closets_created += n
|
||||
if all_lines:
|
||||
closet_id_base = _diary_closet_id_base(wing, date_str)
|
||||
closet_meta = {
|
||||
"date": date_str,
|
||||
"wing": wing,
|
||||
"room": "daily",
|
||||
"source_file": source_file,
|
||||
"filed_at": now_iso,
|
||||
}
|
||||
if entities:
|
||||
closet_meta["entities"] = entities
|
||||
# On a force rebuild, wipe any leftover numbered closets
|
||||
# from a longer prior run before re-writing.
|
||||
if force:
|
||||
purge_file_closets(closets_col, source_file)
|
||||
n = upsert_closet_lines(closets_col, closet_id_base, all_lines, closet_meta)
|
||||
closets_created += n
|
||||
|
||||
state[diary_path.name] = {
|
||||
"size": curr_size,
|
||||
"entry_count": len(entries),
|
||||
"ingested_at": now_iso,
|
||||
}
|
||||
state[state_key] = {
|
||||
"size": curr_size,
|
||||
"entry_count": len(entries),
|
||||
"ingested_at": now_iso,
|
||||
}
|
||||
days_updated += 1
|
||||
|
||||
state_file.write_text(json.dumps(state, indent=2))
|
||||
|
||||
Reference in New Issue
Block a user