fix(entity_detector): script-aware word boundaries for combining-mark scripts

Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
This commit is contained in:
Igor Lins e Silva
2026-04-15 22:18:52 -03:00
parent 122ce38811
commit f895bc58e6
3 changed files with 191 additions and 48 deletions
+6 -6
View File
@@ -134,10 +134,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:
counts: defaultdict = defaultdict(int)
# Single-word candidates — one pattern per language
for raw_pat in patterns["candidate_patterns"]:
# Single-word candidates — one pre-wrapped pattern per language
for wrapped_pat in patterns["candidate_patterns"]:
try:
rx = re.compile(rf"\b({raw_pat})\b")
rx = re.compile(wrapped_pat)
except re.error:
continue
for word in rx.findall(text):
@@ -147,10 +147,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:
continue
counts[word] += 1
# Multi-word candidates — one pattern per language
for raw_pat in patterns["multi_word_patterns"]:
# Multi-word candidates — one pre-wrapped pattern per language
for wrapped_pat in patterns["multi_word_patterns"]:
try:
rx = re.compile(rf"\b({raw_pat})\b")
rx = re.compile(wrapped_pat)
except re.error:
continue
for phrase in rx.findall(text):