Commit Graph

147 Commits

Author SHA1 Message Date
eldar702 4949aab68b fix: guard None metadata/doc in tool_check_duplicate and Layer1/Layer2
Chroma 1.5.x can return ``None`` inside the ``metadatas`` / ``documents``
lists of a query/get result for partially-flushed rows. The codebase
already has a systemic None-guard pattern (merged #999, #1013, #1019)
but three call sites were still unguarded:

* ``mcp_server.tool_check_duplicate`` (``mcp_server.py:487-488``) —
  ``meta = results["metadatas"][0][i]`` followed by ``meta.get(...)``
  raises ``AttributeError: 'NoneType' object has no attribute 'get'``.
  The broad ``except Exception`` wrapper (line 504) swallows it and
  returns an uninformative ``"Duplicate check failed"``.

* ``layers.Layer1.generate`` (``layers.py:126``) — iterates
  ``zip(docs, metas)`` and calls ``meta.get(key)`` in the importance
  loop. A single None metadata blows up the entire wake-up render.

* ``layers.Layer2.retrieve`` (``layers.py:224``) — same pattern, same
  crash path for the on-demand render.

Apply the same ``meta = meta or {}`` / ``doc = doc or ""`` idiom used
by the merged guards in the search path. Three-line additions, no
behaviour change on well-formed results.

Tests added:

* ``test_check_duplicate_handles_none_metadata`` — mocks the collection
  query to return ``None`` for one metadata and document, asserts the
  call does not crash and the sentinel-rendered entry has wing/room "?"
  and empty content.
* ``test_layer1_handles_none_metadata`` / ``_handles_none_document``
* ``test_layer2_handles_none_metadata``

Relationship to other open PRs:

* **#1019** guarded ``searcher.py`` loops. This PR extends the same
  guard to the three call sites #1019 did not touch.
* **#979** fixed ``tool_check_duplicate`` negative similarity but left
  the None-metadata path unguarded.
* Does not overlap **#1013** (``Layer3.search_raw``) or **#999**.
2026-04-19 11:13:50 +03:00
Igor Lins e Silva 66090b2bcb Merge pull request #1014 from MemPalace/refactor/rfc-002-sources-scaffolding
refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext
2026-04-18 18:44:52 -03:00
Ben Sigman 1b89b49b78 Merge pull request #999 from jphein/fix/searcher-none-metadata
fix(searcher): guard against None metadata in CLI print path
2026-04-18 13:41:52 -07:00
Igor Lins e Silva 89904ed03f fix(sources): address Copilot review on #1014
Five findings from the automated review, fixed with targeted tests where
behavior changed:

1. Transformation Protocol (transforms.py). The registry mixed a bytes-to-str
   transform (utf8_replace_invalid) with str-to-str transforms under a single
   Callable[..., str] type, misleading static type checkers and adapter
   authors. Introduced a Transformation Protocol with __call__(data: bytes|str)
   -> str and retyped the registry + get_transformation return.

2. Drawer-id collision risk (context.py). Switched _build_drawer_id from
   sha1[:16]=64 bits to sha256[:24]=96 bits. 64 bits sits uncomfortably
   close to the birthday bound for palace-sized corpora; 96 bits keeps the
   collision probability negligible while preserving the existing
   <prefix>_<chunk> layout adapters rely on.

3. Fresh-schema KG columns (knowledge_graph.py). source_drawer_id and
   adapter_name now live in the canonical CREATE TABLE so new palaces don't
   take an ALTER round-trip on first open. _migrate_schema stays for legacy
   palaces (SQLite has no ADD COLUMN IF NOT EXISTS, so PRAGMA introspection
   is still needed there).

4. Identity-shim comment (transforms.py). Comment said the adapter-specific
   transforms "raise if invoked without adapter context" but they return
   the input unchanged. Updated the comment to match the actual identity-
   shim behavior Copilot suggested.

5. Test docstring (test_sources.py). Comment mentioned default_factory=list
   but SourceRef.options uses default_factory=dict. Corrected.

Tests: 1020 passed (up from 1018), +2 new tests for the sha256 id shape
and the fresh-schema column presence on new palaces.
2026-04-18 17:17:50 -03:00
jp 3f0cfd5ed4 fix(mcp): guard tool_status/list_wings/list_rooms/get_taxonomy against None metadata
Four more MCP handlers iterate a metadata list and call m.get(...)
unconditionally. When the cache contains a None entry (drawers with no
metadata, common on older mining paths), the try block catches the
AttributeError and marks the response "partial: true" with an
error message — visible as {"error": "'NoneType' object has no
attribute 'get'", "partial": true} returned from mempalace_status even
though the palace data is otherwise fetchable.

Same m = m or {} guard we applied to searcher.py (d3a2d22, a51c3c2)
and miner.status() (66f08a1). None-metadata drawers now roll up under
the existing "unknown" fallback bucket instead of poisoning the
response with a misleading partial flag.

Regression test: mock the metadata cache with a None in the middle,
assert tool_status returns clean counts and no error/partial fields.
Verified the test fails without the guard.

998 tests pass.
2026-04-18 12:38:23 -07:00
Igor Lins e Silva 552e9927b7 refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext
Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
2026-04-18 16:05:32 -03:00
Igor Lins e Silva 2b9f17c401 Merge pull request #995 from MemPalace/refactor/rfc-001-cleanup
refactor(backends): RFC 001 §10 cleanup — typed results, PalaceRef, registry
2026-04-18 15:56:12 -03:00
jp 7690574dde fix(searcher): guard API path + closet loop against None metadata too
Per Copilot review on the CLI-only PR (#999): search_memories() has the
same vulnerability in two additional spots, since ChromaDB can return
None entries in the inner metadatas list for either the drawer query or
the closets query. Without guards, the API path crashes with:

    AttributeError: 'NoneType' object has no attribute 'get'

at either \`cmeta.get("source_file", "")\` in the closet boost lookup or
\`meta.get("source_file", "") or ""\` in the drawer scoring loop.

Applies the matching \`meta = meta or {}\` / \`cmeta = cmeta or {}\`
guard at both sites and adds an API-path regression test that mocks a
drawer query result with a None metadata entry and asserts both hits
render — the None-metadata hit with the existing \`"unknown"\` sentinel
values the scoring loop already writes for missing keys.

Verified both the new API test and the existing CLI test fail without
the guards (AttributeError) and pass with them.
2026-04-18 10:37:05 -07:00
jp feba7e8043 fix(miner): same None-metadata guard for status() histogram loop
`status()` walks `col.get(include=["metadatas"])` and buckets each drawer
into a `wing_rooms[wing][room]` histogram. The same ChromaDB return shape
fixed in the search print path — `None` entries in the `metadatas` list
for drawers with no stored metadata — crashes the status command with:

    AttributeError: 'NoneType' object has no attribute 'get'

Applies the matching ``m = m or {}`` guard so None-metadata drawers roll
up under the existing `?/?` fallback bucket instead of killing the
command mid-tally. Reproduced on a 135K-drawer palace where two drawers
had `metadata=None`; both now show under `WING: ? / ROOM: ?` in the
tally while the command prints the full histogram as designed.

Adds a regression test that feeds `status()` a fake collection whose
`get()` returns a `None` in the middle of the metadatas list and asserts
both the fallback bucket and the real wing render.
2026-04-18 10:26:11 -07:00
jp a3c778210b fix(searcher): guard against None metadata in CLI print path
`col.query(...)` can return `None` entries in the inner ``metadatas`` list
for drawers whose metadata was never set (older palaces, rows written
outside the normal mining path). The CLI `search()` function would render
earlier results successfully and then crash mid-loop with:

    AttributeError: 'NoneType' object has no attribute 'get'

at ``searcher.py:286`` — ``meta.get("source_file", "?")``. The user sees
partial output followed by a traceback, with no indication of which
drawers rendered OK and which were skipped.

Guard with ``meta = meta or {}`` inside the loop so entries with missing
metadata fall back to the existing ``"?"`` defaults instead of crashing,
matching the hit dict assembly in ``search_memories()`` which already
uses ``meta.get("wing", "unknown")`` etc. against the same data.

Adds a regression test that mocks a ChromaDB result with a ``None``
metadata entry in the middle of the inner list and asserts both result
blocks render to stdout.
2026-04-18 10:00:59 -07:00
Igor Lins e Silva efaa39bea9 test(backends): dedup update-length-validation tests
24bf97b (network-download fix) and my earlier Copilot-review commit both
added tests for the same ValueError. Keep the broader one that covers
both 'documents length' and 'metadatas length' mismatches; drop the
narrower duplicate.
2026-04-18 13:53:46 -03:00
Igor Lins e Silva 61dd6e7d9c test(backends): fix Windows file-lock in cache-invalidation test
PermissionError [WinError 32] on Windows when Path.unlink() runs while
chromadb.PersistentClient still holds a handle on chroma.sqlite3. Rewrite
test_chroma_cache_invalidates_when_db_file_missing to prime
backend._clients/_freshness with a sentinel object instead of opening a
real PersistentClient, so the unlink runs against an unheld file.

The assertion is also corrected: after invalidation, ChromaBackend's
_client rebuilds a fresh PersistentClient which re-creates chroma.sqlite3
and re-stats it, so freshness ends up at the post-rebuild stat (not
(0, 0.0) as the assertion previously expected). The meaningful invariant
is "freshness advanced past the pre-unlink value AND the sentinel was
replaced", which the test now checks.

Ref: Windows CI failure on 995.
2026-04-18 13:52:56 -03:00
copilot-swe-agent[bot] 24bf97bb65 fix(tests): avoid ONNX network download in update-length validation tests
test_base_collection_update_default_validates_list_lengths and
test_base_collection_update_default_rejects_mismatched_lengths were
spinning up a real ChromaBackend and calling add(documents=...), which
triggered ChromaDB's default ONNX embedding function and attempted a
network download — failing in offline/sandboxed CI.

BaseCollection.update() validates list lengths before any DB access, so
no items need to be pre-loaded for the length-check to fire. Switch both
tests to use _FakeCollection (same as the rest of the unit tests in this
file) so they are pure in-memory and network-free.

Also fixes a structural bug in test 1: collection._collection.add() was
accidentally placed inside the pytest.raises(ValueError) block, masking
the real assertion.

Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-18 16:23:58 +00:00
Igor Lins e Silva 4a088ea8e1 Address Copilot review: cursor tie-break, honest metrics, accurate comments
Six items from the automated review on PR #998:

1. **Cursor tie-break bug (correctness).** The skip condition was
   `rec.timestamp <= cursor`; if multiple messages share the max
   timestamp and only some were ingested before a crash, the rest
   would be lost forever. Changed to `< cursor`, relying on
   deterministic drawer IDs for safe re-attempt at the boundary.
   Regression test
   `test_sweep_recovers_untaken_message_at_cursor_timestamp`.

2. **`drawers_added` counted upserts, not adds.** Added a pre-flight
   `collection.get(ids=batch)` to distinguish new rows from already-
   present ones. Return value now carries `drawers_added`,
   `drawers_already_present`, `drawers_upserted`, and `drawers_skipped`
   separately. Dict-compatible access (`existing.get("ids")`) keeps it
   working on both the raw Chroma return and the typed `GetResult`.

3. **`sweep_directory` hid failures in the summary.** `files_processed`
   used to exclude failed files. Replaced with `files_attempted` (all
   discovered) + `files_succeeded` (subset that completed); CLI output
   shows `succeeded/attempted`.

4. **Coordination claim was overstated.** The primary miners don't
   stamp `session_id`/`timestamp` metadata, so the sweeper coordinates
   only with its own prior runs. Softened docstrings on module and CLI
   command. Uniform cross-miner metadata is flagged as a follow-up.

5. **MAX_FILE_SIZE comments were misleading.** Said source size "does
   not affect storage or embedding cost" — true per-drawer, but source
   size still scales drawer count, embedding work, and memory usage
   (files are read in full, not streamed). Corrected in both
   `miner.py` and `convo_miner.py`.

6. Added the tie-break regression test that reproduces the correctness
   bug from (1).

Tests: 970 passed (was 969), ruff + pre-commit clean.

Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
2026-04-18 13:22:18 -03:00
Igor Lins e Silva 42b940d263 fix(backends): address Copilot review on #995
Four defects surfaced by the automated review, fixed with targeted tests:

1. BaseCollection.update() default now validates that documents / metadatas /
   embeddings lengths match ids, raising ValueError instead of silently
   misaligning pairs or raising IndexError (base.py).

2. ChromaCollection.query() now rejects the two ambiguous input shapes up
   front — neither or both of query_texts / query_embeddings, and empty input
   lists — with clear ValueError messages rather than delegating to chromadb's
   less-obvious errors (chroma.py).

3. QueryResult.empty() accepts embeddings_requested=True to preserve the
   outer-query dimension with empty hit lists when the caller asked for
   embeddings, matching the spec rule that included fields carry the outer
   shape even when empty (base.py). ChromaCollection.query() threads this
   through on the empty-result path (chroma.py).

4. ChromaBackend cache-freshness check now matches the semantics from
   mcp_server._get_client (merged via #757) on three edge cases Copilot
   called out: (a) invalidate when chroma.sqlite3 disappears while a cached
   client is held, (b) treat a 0→nonzero stat transition as a change so a
   cache built when the DB did not yet exist is refreshed, (c) re-stat
   after PersistentClient constructs the DB lazily so freshness reflects
   the post-creation state (chroma.py).

Tests: 978 passed (up from 970), 8 new tests covering the fixes.
2026-04-18 13:19:18 -03:00
Igor Lins e Silva 29ce7c7135 Harden sweeper for production: verbatim tool blocks, full session_id, logged failures
Four changes on top of the proposal's initial sweeper draft, driven by
the CLAUDE.md design principles:

1. Drop the 500-char truncation on tool_use / tool_result content in
   _flatten_content. The "verbatim always" principle forbids lossy
   compression of user-adjacent data; a long code-edit diff handed to
   the assistant must round-trip intact. Unknown block types now also
   serialize their full payload instead of just a type marker. New test
   test_parse_preserves_tool_blocks_verbatim covers a 5000-char input.

2. Use the full session_id in drawer IDs (not session_id[:12]). Rules
   out cross-session collisions if a transcript source ever uses
   non-UUID session identifiers or shared prefixes.

3. Replace silent `except Exception: return None` in get_palace_cursor
   with a logger.warning — the exact anti-pattern this PR otherwise
   criticizes in miner.py. The fallback behavior is still safe
   (deterministic IDs make a missed cursor recover on the next run),
   but the failure is now discoverable.

4. sweep_directory now collects per-file failures into the result dict
   and the CLI exits non-zero when any file failed, so a partial-sweep
   outcome is visible rather than swallowed.

Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
2026-04-18 13:14:32 -03:00
MSL fed69935d3 Add tandem sweeper: message-level safety net for dropped transcripts
The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.

The sweeper is a second, cooperating miner that works at MESSAGE
granularity:

  - Parses Claude Code .jsonl line by line, yielding only
    user/assistant records (filters progress, file-history-snapshot,
    etc. noise).
  - For each session_id, queries the palace for max(timestamp) and
    treats that as the cursor.
  - Ingests only messages newer than the cursor, as one small drawer
    per exchange (never hits a size cap — each drawer is 1-5 KB).
  - Deterministic drawer IDs from session_id + message UUID make
    reruns idempotent; crash mid-sweep is safe.

Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.

Smoke test on a real Claude Code transcript:
  1st run: +39 drawers, 0 already present
  2nd run: +0 drawers, 39 already present  (perfect idempotence)

Opt-in via:
  mempalace sweep <file.jsonl>
  mempalace sweep <transcript-dir>

No changes to existing miners. No schema migration. Purely additive.

Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:06 -03:00
MSL 6f33d52681 Raise convo_miner MAX_FILE_SIZE cap 10 MB → 500 MB
Mirrors the miner.py fix in this same branch. convo_miner.py had the
exact same 10 MB cap at line 58 that silently dropped long transcripts
via continue. Long Claude Code sessions, multi-year ChatGPT exports,
and lifetime Slack dumps all exceed 10 MB. Same silent-drop pattern,
different file.

Raised to 500 MB to match miner.py for consistency; downstream chunking
means source file size does not affect storage or embedding cost.

Tests: tests/test_convo_miner_size_cap.py (1 test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL d137d12313 Raise MAX_FILE_SIZE cap from 10 MB to 500 MB
Long Claude Code sessions routinely produce transcripts larger than 10
MB. The previous cap at miner.py:65 silently dropped them at line 732
with `if filepath.stat().st_size > MAX_FILE_SIZE: continue` — same
silent-failure pattern as the .jsonl extension bug.

The cap exists as a safety rail against pathological binaries, not as
a limit on legitimate text. Downstream chunking at 800 chars per drawer
means source file size does not affect storage or embedding cost.

500 MB leaves headroom for year-long continuous transcripts while still
catching accidental multi-GB binary mines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
MSL 560fdbdc9f Fix silent drop of .jsonl files in project miner
mempalace/miner.py:READABLE_EXTENSIONS contained `.json` but not
`.jsonl`. Every jsonl file encountered in a mined directory was
silently skipped at miner.py:722:

    if filepath.suffix.lower() not in READABLE_EXTENSIONS:
        continue

Claude Code transcripts, ChatGPT exports, and every other tool writing
line-delimited JSON ship as `.jsonl`. Users running `mempalace mine`
against a directory of transcripts saw the command complete with no
error and no log line — and their conversations never reached the
palace. Silent data loss.

Adding `.jsonl` to the whitelist alongside `.json`. jsonl is text
line-by-line; the existing chunking pipeline handles it the same way
it handles any other text file.

Tests: tests/test_miner_jsonl_visibility.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:52:01 -03:00
Igor Lins e Silva a17a8b734a refactor(backends): typed QueryResult/GetResult, PalaceRef, BaseBackend registry (RFC 001 §10)
Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres,
#700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target
to align against.

Scope (this PR):

- Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at
  the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps
  existing callers working while the attribute-access migration proceeds.
- BaseCollection is now kwargs-only across add/upsert/query/get/delete/update
  with ABC defaults for estimated_count/close/health and a non-atomic default
  update() (§1.1–1.2).
- PalaceRef replaces raw path strings at the backend boundary (§2.2).
- BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3).
- mempalace.backends entry-point group + in-tree registry with
  resolve_backend_for_palace priority order matching §3.2–3.3.
- ChromaCollection normalizes chroma returns into typed results; unknown
  where-clause operators raise UnsupportedFilterError (no silent drop, §1.4).
- ChromaBackend absorbs the inode/mtime client-cache freshness check
  previously duplicated in mcp_server._get_client() (§10 + PR #757).
- searcher.py migrated to typed-attribute access as the reference call
  site; remaining callers land in a follow-up.
- pyproject: chroma registered via [project.entry-points."mempalace.backends"].

Out of scope (explicit follow-ups):

- Full caller migration off the dict-compat shim across palace.py,
  mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py,
  palace_graph.py, cli.py, closet_llm.py.
- Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5).
- maintenance_state() / run_maintenance() benchmark hooks (§7.3).
- AbstractBackendContractSuite full coverage (§7.1–7.2).
- mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8).

Tests: 970 passed (up from 967 on develop); new coverage for typed results,
empty-result outer-shape preservation, \$regex rejection, registry lookup,
priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection.

Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).
2026-04-18 12:45:16 -03:00
Igor Lins e Silva 55a004fe1e Merge pull request #931 from mvalentsev/fix/i18n-entity-metadata
fix: use i18n candidate patterns for entity extraction in miner and palace
2026-04-16 15:54:01 -03:00
Igor Lins e Silva c5e249bba8 Merge pull request #946 from mvalentsev/fix/utf8-read-text
fix: add explicit UTF-8 encoding to read_text() calls (#776)
2026-04-16 15:52:42 -03:00
Igor Lins e Silva 65f99ad7e6 Merge pull request #928 from arnoldwender/fix/i18n-lang-case-insensitive
fix(i18n): resolve language codes case-insensitively (#927)
2026-04-16 15:44:36 -03:00
mvalentsev 09fe2dda3c fix: add explicit UTF-8 encoding to read_text() calls (#776)
On Windows with non-UTF-8 locale (e.g. GBK), Path.read_text() defaults
to platform encoding, breaking onboarding tests and any source code that
reads JSON/markdown with non-ASCII content.

5 files, 8 call sites fixed.
2026-04-16 16:00:29 +05:00
🍕 88f5b5fa0e Add Indonesian language support
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
2026-04-16 16:15:47 +08:00
mvalentsev 8bf940f861 fix: use i18n candidate patterns for entity extraction in miner and palace
entity_detector.py was refactored in #911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
2026-04-16 10:35:40 +05:00
Igor Lins e Silva f895bc58e6 fix(entity_detector): script-aware word boundaries for combining-mark scripts
Python's \b is a \w/non-\w transition. Devanagari vowel signs (matras)
like ा ी ु are Unicode category Mc (Mark, Spacing Combining) — not \w.
This means \b splits mid-word on every matra: names like अनीता (Anita)
truncate to अनीत, and person-verb patterns like \bराज\s+ने\s+कहा\b
never match because \b fails after the final matra of कहा.

Same issue affects Arabic, Hebrew, Thai, Tamil, and every other script
whose words contain combining marks.

Fix: locales with combining-mark scripts declare a boundary_chars field
in their entity section (e.g. "\\w\\u0900-\\u097F" for Hindi). The i18n
loader replaces every \b in that locale's patterns with a script-aware
lookaround that treats the declared characters as "inside-word", and
pre-wraps candidate/multi_word patterns with the same boundary.

Default behavior (no boundary_chars) keeps standard \b — en, pt-br, ru,
it are unchanged.

Changes:
- mempalace/i18n/__init__.py: add _script_boundary, _expand_b,
  _wrap_candidate, _collect_entity_section; candidate_patterns are now
  returned fully-wrapped (boundary + capture group applied)
- mempalace/entity_detector.py: extract_candidates compiles pre-wrapped
  candidate patterns directly instead of re-wrapping with \b
- tests/test_entity_detector.py: 5 new tests for Devanagari boundaries
  (name extraction with/without boundary_chars, person-verb firing,
  English regression)
2026-04-15 22:18:52 -03:00
Arnold Wender 0174b93d0f fix(i18n): resolve language codes case-insensitively (#927)
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.

The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.

Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.

Closes #927
2026-04-15 23:33:42 +02:00
Igor Lins e Silva 312b3b5f0e Merge pull request #758 from mvalentsev/fix/i18n-review-issues
fix: address i18n review issues from PR #718
2026-04-15 13:45:49 -03:00
Igor Lins e Silva c722c91e2a test: document orphan-locale recovery for _temp_locale helper 2026-04-15 08:54:23 -03:00
Igor Lins e Silva b214aced90 refactor(entity_detector): make multi-language extensible via i18n JSON
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
2026-04-15 08:52:42 -03:00
Igor Lins e Silva 56b6a6360f Merge pull request #908 from fatkobra/test/palace-graph-tunnels
test: add palace_graph tunnel helper coverage
2026-04-15 08:23:18 -03:00
fatkobra 966937d620 test: add palace_graph tunnel helper coverage
Adds focused tests for explicit tunnel helpers in `mempalace/palace_graph.py`.

Covered:
- `_load_tunnels`
- `_save_tunnels`
- `create_tunnel`
- `list_tunnels`
- `delete_tunnel`
- `follow_tunnels`
2026-04-15 11:38:18 +02:00
Marcio E. Heiderscheidt e61dc2adf8 fix: add provenance header and speaker IDs to Slack transcript imports (#815)
* fix: add provenance header and speaker IDs to Slack transcript imports

Slack exports are multi-party chats where no speaker is inherently
the "user" or "assistant". The parser previously assigned these roles
purely by position, allowing a crafted export to place attacker text
in the "user" role — making it appear as the memory owner's words
in all future retrieval (data poisoning via stored memory).

Changes:
- Add provenance header marking Slack transcripts as multi-party
  with positional (unverified) role assignment
- Prefix each message with the original speaker ID ([U1], [U2], etc.)
  so downstream consumers can distinguish authors
- Keep user/assistant role alternation for exchange-pair chunking
  compatibility with convo_miner.py

Tests:
- Provenance header presence and content
- Speaker ID preservation in output
- Attacker-first-message attribution verification

Refs: MemPalace/mempalace#809

* fix: move Slack provenance to footer, sanitize speaker IDs, extract constant

- Move provenance notice from header to footer to prevent it becoming
  a standalone ChromaDB drawer via paragraph chunking on exports
  with fewer than 3 exchange pairs (violates verbatim-always principle)
- Sanitize speaker user_id/username: strip brackets, newlines, and
  control characters to prevent chunk-boundary injection via crafted
  Slack exports
- Extract header string to _SLACK_PROVENANCE_FOOTER module constant,
  consistent with _TOOL_RESULT_* constants pattern; tests import it
  instead of duplicating the literal

Refs: MemPalace/mempalace#809
2026-04-15 00:27:01 -07:00
sha2fiddy a15094ce60 feat: include created_at timestamp in search results (#846)
* feat: include created_at timestamp in search results (closes #465)

Surface the existing filed_at metadata as created_at in search result
objects returned by search_memories(). Enables temporal reasoning over
search hits without additional queries.

* Feat: add fallback for missing filed_at metadata
2026-04-15 00:26:57 -07:00
Mikhail Valentsev ecd44f7cb7 fix(hooks): stop precompact hook from blocking compaction (#856, #858) (#863)
* fix(hooks): stop precompact hook from blocking compaction

The precompact hook unconditionally returned {"decision": "block"},
which in Claude Code means "cancel compaction" with no retry mechanism.
This made /compact permanently broken for all plugin users.

Changed hook_precompact() to mine the transcript synchronously (so data
lands before compaction) and return {"decision": "allow"}. This matches
the standalone bash hook in hooks/ which already uses allow.

Also extracted _get_mine_dir() and _mine_sync() helpers so precompact
can mine from the transcript directory, not just MEMPAL_DIR.

Stop hook behavior is unchanged -- left for #673 which implements the
full silent save path.

Closes #856, closes #858.

* fix: use empty JSON instead of invalid \"allow\" decision value

Claude Code only recognizes \"block\" as a top-level decision value.
\"allow\" is a permissionDecision value for PreToolUse hooks, not a
valid top-level decision. The correct way to not block is to return
empty JSON. Caught by #872.
2026-04-15 00:26:54 -07:00
Arnold Wender b226251ddf fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225) (#864)
* fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225)

Fixes #225.

Several transitive dependencies (chromadb, onnxruntime, posthog) print
banners and warnings to stdout — sometimes at the C level — during the
mcp_server import chain. Because the MCP protocol multiplexes JSON-RPC
over stdio, any non-JSON output on stdout corrupted the message stream
and broke Claude Desktop's parser with errors like:

  MCP mempalace: Unexpected token '*', "**********"... is not valid JSON
  MCP mempalace: Unexpected token 'E', "EP Error D"... is not valid JSON
  MCP mempalace: Unexpected token 'F', "Falling ba"... is not valid JSON

Reproduced on Windows 11 with mempalace 3.0.0 / Python 3.10 / Claude
Desktop 1.1062.0.

Fix: at module load, redirect stdout to stderr at both the Python level
(sys.stdout = sys.stderr) and the file-descriptor level (os.dup2(2, 1))
to catch C-level prints, while preserving the real stdout for later
restore. main() calls _restore_stdout() right before entering the
protocol loop so JSON-RPC responses still go to the real stdout.

Adds tests/test_mcp_stdio_protection.py with three regression tests:
- module-level redirect is in place after import
- _restore_stdout() restores the original stdout (idempotent)
- 'python -m mempalace.mcp_server' with empty stdin emits no stdout

* style: reformat with ruff 0.4 (CI version) for #225
2026-04-15 00:26:51 -07:00
Arnold Wender 0aee6f3ed9 fix(init): auto-add per-project files to .gitignore in git repos (#185) (#866)
Partially addresses #185.

`mempalace init <dir>` writes `mempalace.yaml` and `entities.json` into
the project root. When <dir> is a git repository, those files have no
default protection and risk being committed by accident — the loudest
concern in the original report.

This PR adds `_ensure_mempalace_files_gitignored()` which runs at the
end of cmd_init: if <dir>/.git exists, append the two filenames to
.gitignore (creating it if necessary) under a clearly-marked block.

The helper is conservative:
- only runs when <dir>/.git is present (no-op for non-git projects)
- skips entries already present (no duplicates)
- preserves existing .gitignore content
- handles files without trailing newlines

This does NOT relocate the files to ~/.mempalace/wings/<wing>/ as the
issue's 'Expected' section proposes — that's a behavioral change with
miner/config implications and warrants a separate design discussion.
The gitignore safeguard removes the immediate risk without breaking any
existing flow.

Tests: 5 cases in tests/test_init_gitignore_protection.py covering
no-op, fresh creation, partial append, idempotency, and missing-newline
edge case.
2026-04-15 00:26:41 -07:00
Arnold Wender 6a73eb2e20 fix(searcher): guard against empty ChromaDB query results (#195) (#865)
Fixes #195.

When ChromaDB returns no documents (empty palace, or wing/room filter
that excludes everything), it returns the shape:

    {"documents": [], "metadatas": [], "distances": []}

Indexing `results["documents"][0]` blindly raises IndexError instead of
the expected 'no results' response. Affected: searcher.search(),
searcher.search_memories() (drawer + closet branches plus the
total_before_filter aggregate), and Layer3.search() / Layer3.search_raw().

Adds a tiny private helper `searcher._first_or_empty(results, key)` that
safely extracts the inner list, returning [] for any of: missing key,
empty outer list, [None], or [[]]. layers.py imports the same helper to
avoid duplicating the guard.

Tests: tests/test_empty_chromadb_results.py covers all observed shapes
plus a documentation-style test that pins the original IndexError so
future readers understand why the helper exists.
2026-04-15 00:26:38 -07:00
Mikhail Valentsev 54a386d925 fix: return empty status instead of error on cold-start palace (#830) (#831)
tool_status() called _get_collection() with the default create=False,
which throws when the ChromaDB collection does not exist yet (valid
palace, zero drawers). The exception was swallowed and status returned
"No palace found" even though init had completed successfully.

Switching to create=True bootstraps an empty collection on first
status call, matching what the write path already does.

Fix suggested by @hkevinchu in the issue.
2026-04-15 00:26:35 -07:00
Marcio E. Heiderscheidt f20f45a2da fix: make entity_registry.research() local-only by default (#811)
* fix: make entity_registry.research() local-only by default

research() previously called _wikipedia_lookup() unconditionally,
sending entity names to en.wikipedia.org on every uncached lookup.
This violates the project's local-first and privacy-by-architecture
principles documented in CLAUDE.md.

Changes:
- research() now returns "unknown" for uncached words by default
- New allow_network=True parameter required for Wikipedia lookups
- Wikipedia 404 now returns "unknown" instead of asserting "person"
  with 0.70 confidence, preventing entity registry poisoning
- Added privacy warning docstring to _wikipedia_lookup()
- Added tests for local-only default, opt-in network, 404 handling,
  and cache-not-persisted-on-local-only behaviour

Refs: MemPalace/mempalace#809

* fix: improve research() cache read path and deduplicate test mocks

- Use .get() instead of .setdefault() for cache reads in research()
  so the local-only path never mutates _data unnecessarily
- Move .setdefault() to the network-write path only
- Use result.setdefault() for word/confirmed keys to ensure
  consistent return shape across all _wikipedia_lookup error paths
- Extract duplicated mock_result dict into _MOCK_SAOIRSE_PERSON
  constant shared by 3 test functions
2026-04-15 00:26:24 -07:00
mvalentsev d565718922 fix: address i18n review issues from PR #718
Three issues flagged by bensig on the i18n PR before merge:

1. ko.json: status_drawers used {drawers} instead of {count}, causing
   the Korean UI to show the raw template string instead of the actual
   drawer count.  All other 7 languages use {count}.

2. Test file was shipped inside the package at mempalace/i18n/test_i18n.py
   with a sys.path.insert hack.  Moved to tests/test_i18n.py per the
   project convention in AGENTS.md.

3. Dialect.from_config() passed lang=config.get("lang") which defaults
   to None, causing __init__ to inherit whatever language was loaded
   earlier via module-level state.  Now defaults to "en" explicitly so
   from_config is deterministic regardless of prior load_lang() calls.

Added two regression tests for the ko.json fix and the state leak.
2026-04-15 11:03:28 +05:00
Igor Lins e Silva 107685930d docs+tests: fix CI after README slim (#875)
The regression-guard tests added in #835 were pinned to the old
README shape (tool table + file-reference table). When #897 slimmed
the README and moved that content to the website, three tests
started failing:

  TestReadmeToolsExistInCode.test_every_readme_tool_exists_in_tools_dict
  TestNoUnlistedTools.test_no_undocumented_tools
  TestReadmeDialectNotLossless.test_readme_dialect_line_not_lossless

Changes in this commit:

1. Update the 3 tests to track the new canonical docs surfaces
   - Tool list -> website/reference/mcp-tools.md
     (tests parse `### \`mempalace_xxx\`` headings instead of
     markdown table rows).
   - dialect.py lossless disclaimer -> website/reference/modules.md
     (any line mentioning dialect.py must not also say "lossless").

2. Fix the website to make "no undocumented tools" true
   Add the 10 tools that existed in TOOLS but were missing from
   website/reference/mcp-tools.md (create_tunnel, delete_tunnel,
   follow_tunnels, list_tunnels, get_drawer, list_drawers,
   update_drawer, hook_settings, memories_filed_away, reconnect).
   Page header now correctly says "all 29 MCP tools".

3. Align pre-commit ruff pin to match CI (0.4.x)
   .pre-commit-config.yaml was pinning ruff v0.9.0, while
   .github/workflows/ci.yml installs ruff>=0.4.0,<0.5. The two
   formatters produce incompatible output (e.g. v0.9.0 reformats
   `assert (x), msg` -> `assert x, (msg)` in a way v0.4.x rejects),
   which would cause the pre-commit hook to modify files that CI
   then flags as unformatted. Pinning the hook to v0.4.10 keeps
   the dev loop and CI in lock-step.

Full suite: 887 passed, 0 failed.
2026-04-14 21:59:55 -03:00
MSL 3094c0bd10 fix: add missing self._lock to KnowledgeGraph.close()
TDD: test first, failed, fixed, passed.

Igor fixed query_relationship/timeline/stats in an earlier commit.
close() was the last method touching self._connection without
holding the lock.

Closes #883.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:09:10 -07:00
Igor Lins e Silva c9b3245994 Merge pull request #880 from MemPalace/perf-optimize-regex-compilation-15578943484596502942
 Optimize regex compilation in entity extraction
2026-04-14 15:10:34 -03:00
Milla J 3ac75d0fdb feat: add MEMPAL_VERBOSE toggle — developers see diaries in chat (#871)
export MEMPAL_VERBOSE=true  → hook blocks, agent writes diary in chat
export MEMPAL_VERBOSE=false → silent background save (default)

Developers need to see code and diaries being written.
Regular users want zero chat clutter. Now both work.

TDD: tests written first, failed, code fixed, tests pass.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:55:56 -07:00
google-labs-jules[bot] 21793cfb48 perf: optimize regex compilation in entity extraction
Move regular expression compilation to the module level in `dialect.py` to prevent repeated parsing during loop execution.

Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
2026-04-14 17:43:26 +00:00
Igor Lins e Silva 4741bc0055 Merge pull request #873 from sha2fiddy/feature/455/kg-sanitize-punctuation
fix: use permissive validator for KG entity values
2026-04-14 14:15:33 -03:00
Igor Lins e Silva e1d24d8087 Merge pull request #812 from Kesshite/fix/security-hook-injection
fix: harden hooks against shell injection, path traversal, and arithmetic injection
2026-04-14 14:10:33 -03:00