Commit Graph

9 Commits

Author SHA1 Message Date
Igor Lins e Silva 6aebf458ff fix(entity): reduce noise in regex-based detection
The pattern-matching detector had several systematic false positives that
crowded the init review with nonsense. Concrete fixes:

- CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to
  candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are
  visible. Previously `MemPalace` fragmented into `Mem` + `Palace`.
- Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata
  line like `Created: 2026-04-21` was scoring as dialogue and classifying
  `Created` as a person.
- Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b`
  (version-only). The previous `\b{name}[-v]\w+` matched `context-manager`,
  `multi-word`, etc. - every hyphenated compound.
- Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They
  produce pure-English-prose noise (`Contributor`, `Software`, `Covered`,
  `Before`).
- Extra SKIP_DIRS: `.terraform`, `vendor`, `target`.
- Expand stopword list with capitalized participles/descriptors that
  commonly appear at sentence start: `created`, `updated`, `extracted`,
  `processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`,
  `bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc.
- classify_entity: high-pronoun single-category signal now classifies as
  person. A diary's main character gets referenced with pronouns, not
  dialogue markers - requiring two signal categories demoted `Lu` (16
  pronoun hits across 30 mentions) to uncertain. Gate on
  `pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common
  sentence-start words (`Never`, `Before`) with incidental proximity
  stay uncertain.
2026-04-24 00:20:32 -03:00
Lman Chu c88b8a2e17 style: fix ruff format for test_entity_detector.py
Collapse implicit string concatenation to single-line strings
to satisfy ruff format --check in CI.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-17 06:40:41 +08:00
Lman Chu 683e940f70 feat(i18n): add Traditional + Simplified Chinese entity detection
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in #932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.
2026-04-16 17:43:09 +08:00
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Igor Lins e Silva c722c91e2a test: document orphan-locale recovery for _temp_locale helper 2026-04-15 08:54:23 -03:00
Igor Lins e Silva b214aced90 refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
2026-04-15 08:52:42 -03:00
Tal Muskal abd52534bb test: bring coverage to 85%, set threshold to 85, reset version to 3.0.11
- Add tests for config, convo_miner, spellcheck, knowledge_graph
- Fix Windows PermissionError in test cleanup (chromadb file locks)
- Add UTF-8 encoding to split_mega_files, entity_registry, hooks_cli
- Fix mcp_server parse_known_args logging for unknown args
- Set coverage threshold to 85 in pyproject.toml and CI
- Reset all version files to 3.0.11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 21:38:12 +03:00
Tal Muskal 9ca70264f3 style: format test files with ruff
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 21:08:49 +03:00
Tal Muskal 03e9b57108 test: add comprehensive test coverage (35% → 58%, threshold 50%)
Add 180+ new tests across 10 test files covering previously untested modules:
- instructions_cli (0% → 100%), hooks_cli (73% → 96%), spellcheck (28% → 84%)
- palace_graph (9% → 91%), general_extractor (0% → 92%), entity_detector (0% → 69%)
- entity_registry (0% → 70%), room_detector_local (0% → 55%), layers (0% → 28%)
- onboarding (0% → 36%)

Also fixes Windows encoding bug in onboarding.py (write_text without encoding="utf-8").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 20:54:56 +03:00