fix(entity): reduce noise in regex-based detection
The pattern-matching detector had several systematic false positives that
crowded the init review with nonsense. Concrete fixes:
- CamelCase extraction: add `[A-Z][a-z]+(?:[A-Z][a-z]+|[A-Z]{2,})+` to
candidate patterns so `MemPalace`, `ChromaDB`, `OpenAI`, `ChatGPT` are
visible. Previously `MemPalace` fragmented into `Mem` + `Palace`.
- Dialogue `^NAME:\s` requires >=2 matches to count. A single metadata
line like `Created: 2026-04-21` was scoring as dialogue and classifying
`Created` as a person.
- Versioned/hyphenated pattern tightened to `\b{name}[-_]v?\d+(?:\.\d+)*\b`
(version-only). The previous `\b{name}[-v]\w+` matched `context-manager`,
`multi-word`, etc. - every hyphenated compound.
- Skip LICENSE/COPYING/NOTICE/AUTHORS/PATENTS files during scan. They
produce pure-English-prose noise (`Contributor`, `Software`, `Covered`,
`Before`).
- Extra SKIP_DIRS: `.terraform`, `vendor`, `target`.
- Expand stopword list with capitalized participles/descriptors that
commonly appear at sentence start: `created`, `updated`, `extracted`,
`processed`, `total`, `summary`, `auto`, `multi`, `hybrid`, `context`,
`bridge`, `batch`, `local`, `native`, `never`, `before`, `after`, etc.
- classify_entity: high-pronoun single-category signal now classifies as
person. A diary's main character gets referenced with pronouns, not
dialogue markers - requiring two signal categories demoted `Lu` (16
pronoun hits across 30 mentions) to uncertain. Gate on
`pronoun_hits >= 5 AND pronoun_hits / frequency >= 0.2` so common
sentence-start words (`Never`, `Before`) with incidental proximity
stay uncertain.
This commit is contained in:
@@ -148,6 +148,33 @@ def test_classify_entity_pronoun_only_is_uncertain():
|
||||
assert result["type"] == "uncertain"
|
||||
|
||||
|
||||
def test_classify_entity_high_pronoun_signal_is_person():
|
||||
"""A diary's main character hit by many pronouns should still classify
|
||||
as a person even with only the pronoun signal category. Example from
|
||||
real data: `Lu` has 16 pronoun hits out of 30 mentions."""
|
||||
scores = {
|
||||
"person_score": 32,
|
||||
"project_score": 0,
|
||||
"person_signals": ["pronoun nearby (16x)"],
|
||||
"project_signals": [],
|
||||
}
|
||||
result = classify_entity("Lu", 30, scores)
|
||||
assert result["type"] == "person"
|
||||
|
||||
|
||||
def test_classify_entity_low_pronoun_proximity_is_uncertain():
|
||||
"""Common sentence-start words (Never, Before) get a few pronouns nearby
|
||||
incidentally. The ratio stays low (<20%), so they stay uncertain."""
|
||||
scores = {
|
||||
"person_score": 4,
|
||||
"project_score": 0,
|
||||
"person_signals": ["pronoun nearby (2x)"],
|
||||
"project_signals": [],
|
||||
}
|
||||
result = classify_entity("Never", 21, scores)
|
||||
assert result["type"] == "uncertain"
|
||||
|
||||
|
||||
def test_classify_entity_mixed_signals():
|
||||
scores = {
|
||||
"person_score": 5,
|
||||
|
||||
Reference in New Issue
Block a user