fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras) like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w. This means \b splits mid-word on every matra: names like अनीता (Anita) truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b never match because \b fails after the final matra of कहा. Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script whose words contain combining marks. Fix: locales with combining-mark scripts declare a boundary_chars field in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n loader replaces every \b in that locale's patterns with a script-aware lookaround that treats the declared characters as "inside-word", and pre-wraps candidate/multi_word patterns with the same boundary. Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru, it are unchanged. Changes: - mempalace/i18n/__init__.py: add _script_boundary, _expand_b, _wrap_candidate, _collect_entity_section; candidate_patterns are now returned fully-wrapped (boundary + capture group applied) - mempalace/entity_detector.py: extract_candidates compiles pre-wrapped candidate patterns directly instead of re-wrapping with \b - tests/test_entity_detector.py: 5 new tests for Devanagari boundaries (name extraction with/without boundary_chars, person-verb firing, English regression)
This commit is contained in:
@@ -134,10 +134,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:
|
||||
|
||||
counts: defaultdict = defaultdict(int)
|
||||
|
||||
# Single-word candidates — one pattern per language
|
||||
for raw_pat in patterns["candidate_patterns"]:
|
||||
# Single-word candidates — one pre-wrapped pattern per language
|
||||
for wrapped_pat in patterns["candidate_patterns"]:
|
||||
try:
|
||||
rx = re.compile(rf"\b({raw_pat})\b")
|
||||
rx = re.compile(wrapped_pat)
|
||||
except re.error:
|
||||
continue
|
||||
for word in rx.findall(text):
|
||||
@@ -147,10 +147,10 @@ def extract_candidates(text: str, languages=("en",)) -> dict:
|
||||
continue
|
||||
counts[word] += 1
|
||||
|
||||
# Multi-word candidates — one pattern per language
|
||||
for raw_pat in patterns["multi_word_patterns"]:
|
||||
# Multi-word candidates — one pre-wrapped pattern per language
|
||||
for wrapped_pat in patterns["multi_word_patterns"]:
|
||||
try:
|
||||
rx = re.compile(rf"\b({raw_pat})\b")
|
||||
rx = re.compile(wrapped_pat)
|
||||
except re.error:
|
||||
continue
|
||||
for phrase in rx.findall(text):
|
||||
|
||||
Reference in New Issue
Block a user