Commit Graph

444 Commits

Author SHA1 Message Date
MSL fed69935d3 Add tandem sweeper: message-level safety net for dropped transcripts
The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.

The sweeper is a second, cooperating miner that works at MESSAGE
granularity:

  - Parses Claude Code .jsonl line by line, yielding only
    user/assistant records (filters progress, file-history-snapshot,
    etc. noise).
  - For each session_id, queries the palace for max(timestamp) and
    treats that as the cursor.
  - Ingests only messages newer than the cursor, as one small drawer
    per exchange (never hits a size cap — each drawer is 1-5 KB).
  - Deterministic drawer IDs from session_id + message UUID make
    reruns idempotent; crash mid-sweep is safe.

Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.

Smoke test on a real Claude Code transcript:
  1st run: +39 drawers, 0 already present
  2nd run: +0 drawers, 39 already present  (perfect idempotence)

Opt-in via:
  mempalace sweep <file.jsonl>
  mempalace sweep <transcript-dir>

No changes to existing miners. No schema migration. Purely additive.

Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:06 -03:00
MSL 6f33d52681 Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB
Mirrors the miner.py fix in this same branch. convo_miner.py had the
exact same 10 MB cap at line 58 that silently dropped long transcripts
via continue. Long Claude Code sessions, multi-year ChatGPT exports,
and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern,
different file.

Raised to 500 MB to match miner.py for consistency; downstream chunking
means source file size does not affect storage or embedding cost.

Tests: tests/test_convo_miner_size_cap.py (1 test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL d137d12313 Raise MAX_FILE_SIZE cap from 10 MB to 500 MB
Long Claude Code sessions routinely produce transcripts larger than 10
MB. The previous cap at miner.py:65 silently dropped them at line 732
with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same
silent-failure pattern as the .jsonl extension bug.

The cap exists as a safety rail against pathological binaries, not as
a limit on legitimate text. Downstream chunking at 800 chars per drawer
means source file size does not affect storage or embedding cost.

500 MB leaves headroom for year-long continuous transcripts while still
catching accidental multi-GB binary mines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL 560fdbdc9f Fix silent drop of .jsonl files in project miner
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:

    if filepath.suffix.lower() not in READABLE_EXTENSIONS:
        continue

Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.

Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.

Tests: tests/test_miner_jsonl_visibility.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
Igor Lins e Silva e4a2cd48a2 Merge pull request #984 from domiscd/feat/landing-page-update
feat/landing-page: Improve landing page readability
2026-04-17 19:47:39 -03:00
Dominique Deschatre 2e3e0b979c Update landing.css 2026-04-17 19:40:25 -03:00
Dominique Deschatre 9e8281aab5 (landing) svg icons animations 2026-04-17 19:37:30 -03:00
Dominique Deschatre e5f5009f80 (landing) added Closets section 2026-04-17 19:18:10 -03:00
Dominique Deschatre 89f0eb5cb3 refactor(website): split Landing.vue into section components
Extract 2002-line monolith into landing/ subfolder:
- 8 section components (FolioHeader, HeroSection, ForgettingSection, AnatomySection, DialectSection, MechanicsSection, InstallSection, CatalogFooter)
- useLandingEffects.js composable for all vanilla-JS effects
- landing.css for all styles
- Landing.vue reduced to 28-line orchestrator

Also restores upstream hero lede text ("permanent. Designed for total recall.").
2026-04-17 18:49:41 -03:00
Dominique Deschatre 8c3d1ba86c Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update
Co-authored-by: Copilot <copilot@github.com>
2026-04-17 17:00:47 -03:00
Dominique Deschatre 28d4f67ba2 landing hero container 2026-04-17 15:53:50 -03:00
Igor Lins e Silva 41bff266a4 Merge pull request #918 from almirus/develop
feat(cli): add version display and version flag to CLI
2026-04-17 00:29:55 -03:00
Igor Lins e Silva 596f3d3a8e Merge pull request #964 from MemPalace/fix/website-false-claims
fix(website): correct false claims and stale numbers in live docs
2026-04-16 23:38:08 -03:00
Igor Lins e Silva 0cb9ee5c58 fix(website): correct false claims and stale numbers in live docs
- Landing: replace nonexistent `mempalace remember` CLI demo with real
  `mempalace mine ./notes`
- Landing: soften unverifiable absolutes ("forever available",
  "100% recall by design", "<50 ms", "90%+ compression",
  "two-thousand-year-old", "tens of thousands of entries")
- MCP tool count: 19 → 29 across mcp-integration, claude-code, openclaw,
  and modules; expand tool overview with Drawers, Tunnels, and System
  categories to match mcp_server.py
- Wake-up token range: ~170–900 → ~600–900 in cli/api-reference/python-api
  to match cli.py help text and concept docs
- Gemini CLI: move `--scope user` before target name and add `--`
  separator so `-m mempalace.mcp_server` isn't parsed as Gemini flags
2026-04-16 23:31:35 -03:00
Igor Lins e Silva 51919fef0c Merge pull request #963 from domiscd/feat/landing-page-update
feat(website): update landing page
2026-04-16 22:37:16 -03:00
Dominique Deschatre c8727b3a2d chore(website): add Google Analytics 2026-04-16 22:34:37 -03:00
Dominique Deschatre 44c525ddd3 Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update
# Conflicts:
#	website/index.md
2026-04-16 22:31:22 -03:00
Dominique Deschatre d8ac4c3abb new landing page pt 2 2026-04-16 22:24:15 -03:00
Dominique Deschatre 9893fa2383 new landing page 2026-04-16 21:46:03 -03:00
Igor Lins e Silva 55a004fe1e Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata
fix: use i18n candidate patterns for entity extraction in miner and palace
2026-04-16 15:54:01 -03:00
Igor Lins e Silva c5e249bba8 Merge pull request #946 from mvalentsev/fix/utf8-read-text
fix: add explicit UTF-8 encoding to read_text() calls (#776)
2026-04-16 15:52:42 -03:00
Igor Lins e Silva 65f99ad7e6 Merge pull request #928 from arnoldwender/fix/i18n-lang-case-insensitive
fix(i18n): resolve language codes case-insensitively (#927)
2026-04-16 15:44:36 -03:00
Igor Lins e Silva 29112fab82 Merge pull request #778 from dominosaurs/feat/id-lang
feat: add Indonesian language support
2026-04-16 15:44:26 -03:00
Igor Lins e Silva 4215be3926 Merge pull request #773 from tejasashinde/feat/add-i18n-hindi
feat: add Hindi language support to i18n module
2026-04-16 15:44:08 -03:00
mvalentsev 09fe2dda3c fix: add explicit UTF-8 encoding to read_text() calls (#776)
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
2026-04-16 16:00:29 +05:00
🍕 939d4c1e74 feat: Update Indonesian translations
Refine AAAK instruction and expand entity detection patterns.
2026-04-16 17:43:51 +08:00
🍕 88f5b5fa0e Add Indonesian language support
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
2026-04-16 16:15:47 +08:00
mvalentsev cde0f5b9e7 remove unnecessary comment 2026-04-16 10:38:38 +05:00
mvalentsev 973bd62a9a fix: use pre-wrapped candidate patterns after #932 refactor 2026-04-16 10:37:18 +05:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
tejasashinde 21da870bd0 fix(i18n/hi): add boundary_chars and update action_pattern for Devanagari-aware matching 2026-04-16 09:21:21 +05:30
Igor Lins e Silva d4c942417a Merge pull request #932 from MemPalace/fix/entity-detector-non-latin-boundaries
fix(entity_detector): script-aware word boundaries for combining-mark scripts
2026-04-15 22:38:59 -03:00
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Arnold Wender 6caac50138 fix(i18n): use Optional[str] for Python 3.9 compatibility
PEP 604 union syntax (str | None) requires Python 3.10+. The project
supports 3.9 per CI matrix, so use typing.Optional instead.
2026-04-15 23:37:12 +02:00
Arnold Wender 0174b93d0f fix(i18n): resolve language codes case-insensitively (#927)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.

The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.

Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.

Closes #927
2026-04-15 23:33:42 +02:00
Igor Lins e Silva 122ce38811 Merge pull request #907 from Archetipo95/feat/italian-i18n-support
feat: add Italian language support
2026-04-15 18:05:13 -03:00
Igor Lins e Silva 57b0b14192 Merge pull request #156 from mvalentsev/feat/pt-br-entity-detection
feat: add Brazilian Portuguese support to entity_detector (closes #117)
2026-04-15 17:53:30 -03:00
almirus 10cdd93cec feat(cli): add version display and version flag to CLI
Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit.
2026-04-15 21:44:20 +03:00
mvalentsev 4221589df2 fix(i18n): address review feedback on pt-br.json
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching)
- entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives
- pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
2026-04-15 23:32:31 +05:00
mvalentsev 3d13a72ae0 feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117)
CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).

Follows the i18n entity framework from #911.
2026-04-15 23:32:31 +05:00
Tejas Shinde 33a98fb9d1 Updated hi.json to support infra for entity,pronoun_patterns,dialogue_patterns,direct_address_pattern, project_verb_patterns and stopwords 2026-04-15 23:33:24 +05:30
Tejas Shinde ce3ae0a668 Merge branch 'MemPalace:develop' into feat/add-i18n-hindi 2026-04-15 23:19:57 +05:30
Martin Masevski 69453b2180 feat: add italian entity patterns 2026-04-15 19:18:23 +02:00
Martin Masevski 2e998db0b9 feat: add italian i18n support 2026-04-15 19:15:55 +02:00
Igor Lins e Silva 73a2f82d5b Merge pull request #760 from mvalentsev/feat/i18n-russian
feat: add Russian language support (ru.json)
2026-04-15 13:46:04 -03:00
Igor Lins e Silva 312b3b5f0e Merge pull request #758 from mvalentsev/fix/i18n-review-issues
fix: address i18n review issues from PR #718
2026-04-15 13:45:49 -03:00
mvalentsev 4b998de77a feat(i18n): expand Russian entity stopwords with prepositions and conjunctions
Adds 34 prepositions and conjunctions to reduce false positives
in entity detection when these words appear sentence-initial.

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 21:14:51 +05:00
mvalentsev 3e49522a42 fix(i18n): apply review feedback on ru.json (#760)
- mine_skip: "повторной раскопки" -> "повторной обработки"
- quote_pattern: add Russian guillemet quotes «»

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 20:17:16 +05:00
mvalentsev d6bd7de5f6 feat(i18n): add entity detection section to Russian locale
Cyrillic candidate/multi-word patterns, person-verb patterns
(сказал, спросил, ответил, etc.), pronoun patterns, dialogue
markers, direct address, and Russian stopwords.

Follows the i18n entity framework from #911.
2026-04-15 18:16:25 +05:00
mvalentsev b87ada3c96 feat: add Russian language support to i18n module
Add ru.json with full Russian translations for CLI strings, palace
terminology, AAAK compression instruction, and regex patterns for
topic/action extraction with Cyrillic character classes.

No code changes needed -- the i18n module auto-discovers language
files via *.json glob in the i18n directory.
2026-04-15 18:15:15 +05:00