The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.
The sweeper is a second, cooperating miner that works at MESSAGE
granularity:
- Parses Claude Code .jsonl line by line, yielding only
user/assistant records (filters progress, file-history-snapshot,
etc. noise).
- For each session_id, queries the palace for max(timestamp) and
treats that as the cursor.
- Ingests only messages newer than the cursor, as one small drawer
per exchange (never hits a size cap — each drawer is 1-5 KB).
- Deterministic drawer IDs from session_id + message UUID make
reruns idempotent; crash mid-sweep is safe.
Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.
Smoke test on a real Claude Code transcript:
1st run: +39 drawers, 0 already present
2nd run: +0 drawers, 39 already present (perfect idempotence)
Opt-in via:
mempalace sweep <file.jsonl>
mempalace sweep <transcript-dir>
No changes to existing miners. No schema migration. Purely additive.
Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the miner.py fix in this same branch. convo_miner.py had the
exact same 10 MB cap at line 58 that silently dropped long transcripts
via continue. Long Claude Code sessions, multi-year ChatGPT exports,
and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern,
different file.
Raised to 500 MB to match miner.py for consistency; downstream chunking
means source file size does not affect storage or embedding cost.
Tests: tests/test_convo_miner_size_cap.py (1 test)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long Claude Code sessions routinely produce transcripts larger than 10
MB. The previous cap at miner.py:65 silently dropped them at line 732
with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same
silent-failure pattern as the .jsonl extension bug.
The cap exists as a safety rail against pathological binaries, not as
a limit on legitimate text. Downstream chunking at 800 chars per drawer
means source file size does not affect storage or embedding cost.
500 MB leaves headroom for year-long continuous transcripts while still
catching accidental multi-GB binary mines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:
if filepath.suffix.lower() not in READABLE_EXTENSIONS:
continue
Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.
Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.
Tests: tests/test_miner_jsonl_visibility.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.
Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
promise with a verifiable one; byte_preserving adapters declare
the empty set, declared_lossy adapters enumerate. Existing
miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
closet_hints. Closes existing gap where convo drawers get no
closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
added_by, normalize_version, entities, ingest_mode all preserved.
Spec adds adapter_name, adapter_version, privacy_class
§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).
Tracking issue: #989
Extract 2002-line monolith into landing/ subfolder:
- 8 section components (FolioHeader, HeroSection, ForgettingSection, AnatomySection, DialectSection, MechanicsSection, InstallSection, CatalogFooter)
- useLandingEffects.js composable for all vanilla-JS effects
- landing.css for all styles
- Landing.vue reduced to 28-line orchestrator
Also restores upstream hero lede text ("permanent. Designed for total recall.").
- Landing: replace nonexistent `mempalace remember` CLI demo with real
`mempalace mine ./notes`
- Landing: soften unverifiable absolutes ("forever available",
"100% recall by design", "<50 ms", "90%+ compression",
"two-thousand-year-old", "tens of thousands of entries")
- MCP tool count: 19 → 29 across mcp-integration, claude-code, openclaw,
and modules; expand tool overview with Drawers, Tunnels, and System
categories to match mcp_server.py
- Wake-up token range: ~170–900 → ~600–900 in cli/api-reference/python-api
to match cli.py help text and concept docs
- Gemini CLI: move `--scope user` before target name and add `--`
separator so `-m mempalace.mcp_server` isn't parsed as Gemini flags
Also fix miner.py checkmark and box-drawing/arrow chars (─, →) in
both miner.py and split_mega_files.py that would crash on cp1251/cp1252.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Windows terminals using cp1251/cp1252 crash on the Unicode ✓ (U+2713)
in progress output. Replace with ASCII + in convo_miner.py and
split_mega_files.py.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.
5 files, 8 call sites fixed.
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.
This adds entity sections for both locales:
- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
surnames covering >95% of Taiwanese / PRC names), length capped
at {1,2} trailing chars so greedy matches don't swallow the
trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK
transitions. This is the same mechanism used for Devanagari,
applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
name with no whitespace, so patterns are written as `{name}說`,
`{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
question words, conjunctions, UI nouns, and politeness forms.
**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.
**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.
Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`
Full suite: 957 passed, 0 failed.
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.
Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.
Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.
Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.
Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.
Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.
Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
_wrap_candidate, _collect_entity_section; candidate_patterns are now
returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
(name extraction with/without boundary_chars, person-verb firing,
English regression)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.
The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.
Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.
Closes#927
Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit.
CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).
Follows the i18n entity framework from #911.