fix(entity_detector): script-aware word boundaries for combining-mark scripts

Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)
2026-04-15 22:18:52 -03:00
parent 122ce38811
commit f895bc58e6
3 changed files with 191 additions and 48 deletions
@@ -134,10 +134,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:

    counts: defaultdict = defaultdict(int)

-    # Single-word candidates — one pattern per language
-    for raw_pat in patterns["candidate_patterns"]:
+    # Single-word candidates — one pre-wrapped pattern per language
+    for wrapped_pat in patterns["candidate_patterns"]:
        try:
-            rx = re.compile(rf"\b({raw_pat})\b")
+            rx = re.compile(wrapped_pat)
        except re.error:
            continue
        for word in rx.findall(text):
@@ -147,10 +147,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:
                continue
            counts[word] += 1

-    # Multi-word candidates — one pattern per language
-    for raw_pat in patterns["multi_word_patterns"]:
+    # Multi-word candidates — one pre-wrapped pattern per language
+    for wrapped_pat in patterns["multi_word_patterns"]:
        try:
-            rx = re.compile(rf"\b({raw_pat})\b")
+            rx = re.compile(wrapped_pat)
        except re.error:
            continue
        for phrase in rx.findall(text):