Commit Graph

443 Commits

Author SHA1 Message Date
MSL 6f33d52681 Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB
Mirrors the miner.py fix in this same branch. convo_miner.py had the
exact same 10 MB cap at line 58 that silently dropped long transcripts
via continue. Long Claude Code sessions, multi-year ChatGPT exports,
and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern,
different file.

Raised to 500 MB to match miner.py for consistency; downstream chunking
means source file size does not affect storage or embedding cost.

Tests: tests/test_convo_miner_size_cap.py (1 test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL d137d12313 Raise MAX_FILE_SIZE cap from 10 MB to 500 MB
Long Claude Code sessions routinely produce transcripts larger than 10
MB. The previous cap at miner.py:65 silently dropped them at line 732
with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same
silent-failure pattern as the .jsonl extension bug.

The cap exists as a safety rail against pathological binaries, not as
a limit on legitimate text. Downstream chunking at 800 chars per drawer
means source file size does not affect storage or embedding cost.

500 MB leaves headroom for year-long continuous transcripts while still
catching accidental multi-GB binary mines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL 560fdbdc9f Fix silent drop of .jsonl files in project miner
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:

    if filepath.suffix.lower() not in READABLE_EXTENSIONS:
        continue

Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.

Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.

Tests: tests/test_miner_jsonl_visibility.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
Igor Lins e Silva e4a2cd48a2 Merge pull request #984 from domiscd/feat/landing-page-update
feat/landing-page: Improve landing page readability
2026-04-17 19:47:39 -03:00
Dominique Deschatre 2e3e0b979c Update landing.css 2026-04-17 19:40:25 -03:00
Dominique Deschatre 9e8281aab5 (landing) svg icons animations 2026-04-17 19:37:30 -03:00
Dominique Deschatre e5f5009f80 (landing) added Closets section 2026-04-17 19:18:10 -03:00
Dominique Deschatre 89f0eb5cb3 refactor(website): split Landing.vue into section components
Extract 2002-line monolith into landing/ subfolder:
- 8 section components (FolioHeader, HeroSection, ForgettingSection, AnatomySection, DialectSection, MechanicsSection, InstallSection, CatalogFooter)
- useLandingEffects.js composable for all vanilla-JS effects
- landing.css for all styles
- Landing.vue reduced to 28-line orchestrator

Also restores upstream hero lede text ("permanent. Designed for total recall.").
2026-04-17 18:49:41 -03:00
Dominique Deschatre 8c3d1ba86c Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update
Co-authored-by: Copilot <copilot@github.com>
2026-04-17 17:00:47 -03:00
Dominique Deschatre 28d4f67ba2 landing hero container 2026-04-17 15:53:50 -03:00
Igor Lins e Silva 41bff266a4 Merge pull request #918 from almirus/develop
feat(cli): add version display and version flag to CLI
2026-04-17 00:29:55 -03:00
Igor Lins e Silva 596f3d3a8e Merge pull request #964 from MemPalace/fix/website-false-claims
fix(website): correct false claims and stale numbers in live docs
2026-04-16 23:38:08 -03:00
Igor Lins e Silva 0cb9ee5c58 fix(website): correct false claims and stale numbers in live docs
- Landing: replace nonexistent `mempalace remember` CLI demo with real
  `mempalace mine ./notes`
- Landing: soften unverifiable absolutes ("forever available",
  "100% recall by design", "<50 ms", "90%+ compression",
  "two-thousand-year-old", "tens of thousands of entries")
- MCP tool count: 19 → 29 across mcp-integration, claude-code, openclaw,
  and modules; expand tool overview with Drawers, Tunnels, and System
  categories to match mcp_server.py
- Wake-up token range: ~170–900 → ~600–900 in cli/api-reference/python-api
  to match cli.py help text and concept docs
- Gemini CLI: move `--scope user` before target name and add `--`
  separator so `-m mempalace.mcp_server` isn't parsed as Gemini flags
2026-04-16 23:31:35 -03:00
Igor Lins e Silva 51919fef0c Merge pull request #963 from domiscd/feat/landing-page-update
feat(website): update landing page
2026-04-16 22:37:16 -03:00
Dominique Deschatre c8727b3a2d chore(website): add Google Analytics 2026-04-16 22:34:37 -03:00
Dominique Deschatre 44c525ddd3 Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update
# Conflicts:
#	website/index.md
2026-04-16 22:31:22 -03:00
Dominique Deschatre d8ac4c3abb new landing page pt 2 2026-04-16 22:24:15 -03:00
Dominique Deschatre 9893fa2383 new landing page 2026-04-16 21:46:03 -03:00
Igor Lins e Silva 55a004fe1e Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata
fix: use i18n candidate patterns for entity extraction in miner and palace
2026-04-16 15:54:01 -03:00
Igor Lins e Silva c5e249bba8 Merge pull request #946 from mvalentsev/fix/utf8-read-text
fix: add explicit UTF-8 encoding to read_text() calls (#776)
2026-04-16 15:52:42 -03:00
Igor Lins e Silva 65f99ad7e6 Merge pull request #928 from arnoldwender/fix/i18n-lang-case-insensitive
fix(i18n): resolve language codes case-insensitively (#927)
2026-04-16 15:44:36 -03:00
Igor Lins e Silva 29112fab82 Merge pull request #778 from dominosaurs/feat/id-lang
feat: add Indonesian language support
2026-04-16 15:44:26 -03:00
Igor Lins e Silva 4215be3926 Merge pull request #773 from tejasashinde/feat/add-i18n-hindi
feat: add Hindi language support to i18n module
2026-04-16 15:44:08 -03:00
mvalentsev 09fe2dda3c fix: add explicit UTF-8 encoding to read_text() calls (#776)
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
2026-04-16 16:00:29 +05:00
🍕 939d4c1e74 feat: Update Indonesian translations
Refine AAAK instruction and expand entity detection patterns.
2026-04-16 17:43:51 +08:00
🍕 88f5b5fa0e Add Indonesian language support
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
2026-04-16 16:15:47 +08:00
mvalentsev cde0f5b9e7 remove unnecessary comment 2026-04-16 10:38:38 +05:00
mvalentsev 973bd62a9a fix: use pre-wrapped candidate patterns after #932 refactor 2026-04-16 10:37:18 +05:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
tejasashinde 21da870bd0 fix(i18n/hi): add boundary_chars and update action_pattern for Devanagari-aware matching 2026-04-16 09:21:21 +05:30
Igor Lins e Silva d4c942417a Merge pull request #932 from MemPalace/fix/entity-detector-non-latin-boundaries
fix(entity_detector): script-aware word boundaries for combining-mark scripts
2026-04-15 22:38:59 -03:00
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Arnold Wender 6caac50138 fix(i18n): use Optional[str] for Python 3.9 compatibility
PEP 604 union syntax (str | None) requires Python 3.10+. The project
supports 3.9 per CI matrix, so use typing.Optional instead.
2026-04-15 23:37:12 +02:00
Arnold Wender 0174b93d0f fix(i18n): resolve language codes case-insensitively (#927)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.

The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.

Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.

Closes #927
2026-04-15 23:33:42 +02:00
Igor Lins e Silva 122ce38811 Merge pull request #907 from Archetipo95/feat/italian-i18n-support
feat: add Italian language support
2026-04-15 18:05:13 -03:00
Igor Lins e Silva 57b0b14192 Merge pull request #156 from mvalentsev/feat/pt-br-entity-detection
feat: add Brazilian Portuguese support to entity_detector (closes #117)
2026-04-15 17:53:30 -03:00
almirus 10cdd93cec feat(cli): add version display and version flag to CLI
Introduces a version label to the command-line interface, displaying the current MemPalace version in the help text. Adds a `--version` flag to allow users to easily check the version and exit.
2026-04-15 21:44:20 +03:00
mvalentsev 4221589df2 fix(i18n): address review feedback on pt-br.json
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching)
- entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives
- pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
2026-04-15 23:32:31 +05:00
mvalentsev 3d13a72ae0 feat(i18n): add Brazilian Portuguese locale with entity detection (closes #117)
CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).

Follows the i18n entity framework from #911.
2026-04-15 23:32:31 +05:00
Tejas Shinde 33a98fb9d1 Updated hi.json to support infra for entity,pronoun_patterns,dialogue_patterns,direct_address_pattern, project_verb_patterns and stopwords 2026-04-15 23:33:24 +05:30
Tejas Shinde ce3ae0a668 Merge branch 'MemPalace:develop' into feat/add-i18n-hindi 2026-04-15 23:19:57 +05:30
Martin Masevski 69453b2180 feat: add italian entity patterns 2026-04-15 19:18:23 +02:00
Martin Masevski 2e998db0b9 feat: add italian i18n support 2026-04-15 19:15:55 +02:00
Igor Lins e Silva 73a2f82d5b Merge pull request #760 from mvalentsev/feat/i18n-russian
feat: add Russian language support (ru.json)
2026-04-15 13:46:04 -03:00
Igor Lins e Silva 312b3b5f0e Merge pull request #758 from mvalentsev/fix/i18n-review-issues
fix: address i18n review issues from PR #718
2026-04-15 13:45:49 -03:00
mvalentsev 4b998de77a feat(i18n): expand Russian entity stopwords with prepositions and conjunctions
Adds 34 prepositions and conjunctions to reduce false positives
in entity detection when these words appear sentence-initial.

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 21:14:51 +05:00
mvalentsev 3e49522a42 fix(i18n): apply review feedback on ru.json (#760)
- mine_skip: "повторной раскопки" -> "повторной обработки"
- quote_pattern: add Russian guillemet quotes «»

Co-Authored-By: almirus <almirus@users.noreply.github.com>
2026-04-15 20:17:16 +05:00
mvalentsev d6bd7de5f6 feat(i18n): add entity detection section to Russian locale
Cyrillic candidate/multi-word patterns, person-verb patterns
(сказал, спросил, ответил, etc.), pronoun patterns, dialogue
markers, direct address, and Russian stopwords.

Follows the i18n entity framework from #911.
2026-04-15 18:16:25 +05:00
mvalentsev b87ada3c96 feat: add Russian language support to i18n module
Add ru.json with full Russian translations for CLI strings, palace
terminology, AAAK compression instruction, and regex patterns for
topic/action extraction with Cyrillic character classes.

No code changes needed -- the i18n module auto-discovers language
files via *.json glob in the i18n directory.
2026-04-15 18:15:15 +05:00
Igor Lins e Silva 3bac3654c4 Merge pull request #911 from MemPalace/refactor/entity-detector-i18n
refactor(entity_detector): make multi-language extensible via i18n JSON
2026-04-15 09:40:36 -03:00