A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per #1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).
Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
via a monkeypatched `_get_client`, second attempt succeeds; assert the
caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
fail, assert we exhaust the loop and return None (no infinite retry).
Surgical extraction from PR #1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.
Co-authored-by: Jeffrey Hein <jp@jphein.com>
- B904: chain OSError/collection errors with "raise ... from e" in
normalize.py and searcher.py so the original traceback is preserved.
- B007: rename unused loop variables to _name in dedup, dialect, layers,
and room_detector_local.
- S110/S112: replace bare "try/except/pass" and "try/except/continue"
with logger.debug(..., exc_info=True) in mcp_server, searcher,
palace, palace_graph, miner, convo_miner, and fact_checker so
background failures are observable without changing behaviour.
A module-level logger ("mempalace_mcp", matching mcp_server/searcher)
is added to the five files that didn't already have one. Configured
ruff checks (E/F/W/C901) and ruff --select B, S110, S112 all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without ensure_ascii=False, non-ASCII characters (e.g. Chinese) in tool
results and JSON-RPC responses are escaped as \uXXXX, which causes
downstream MCP clients to receive escaped text instead of the original
characters. This affects all platforms, not just Windows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the MCP client sends a malformed or null top-level request, prevent the AttributeError on request.get() by explicitly validating that the request is a dictionary. Returns standard JSON-RPC Error -32600 (Invalid Request) instead of crashing the server.
Race scenario: a KG tool handler calls _get_kg() and gets a live
KnowledgeGraph; another thread fires tool_reconnect() between that
return and the handler's kg.add_triple()/kg.query_entity()/etc call.
tool_reconnect drains _kg_by_path and closes the underlying
sqlite3.Connection; the handler then raises sqlite3.ProgrammingError:
'Cannot operate on a closed database', which surfaces as a -32000
to the MCP client even though the user just asked for a reconnect.
New _call_kg(op) helper wraps each handler's kg call in a one-shot
retry: catch exactly sqlite3.ProgrammingError, evict the stale entry
(only if the cache slot still points at the closed instance — another
thread may have already replaced it), and rerun op against a fresh
_get_kg(). Beyond one retry give up so a sustained close-stream
surfaces clearly instead of looping.
All five KG handlers (tool_kg_query, tool_kg_add, tool_kg_invalidate,
tool_kg_timeline, tool_kg_stats) now route through _call_kg.
Tests pin the contract:
* retries with a fresh KG and returns the second result
* non-ProgrammingError exceptions propagate without retry
* gives up after exactly one retry on sustained close
tool_reconnect cleared ChromaDB caches but left _kg_by_path entries
intact. After an external replacement of knowledge_graph.sqlite3 the
server kept serving the old open sqlite3.Connection, returning stale
results.
Now iterate _kg_by_path under _kg_cache_lock, call close() best-effort,
and clear the dict so the next tool call reopens the KG from disk.
Two new tests in TestKGLazyCache verify cache invalidation and that a
failing close() does not block the clear.
Inline comments referencing #1136 and #540 add no information the
identifiers do not already convey. PR description carries the context;
code stays quiet.
Swap the module-level KnowledgeGraph singleton for a lazy, per-path
cache keyed by the resolved sqlite path. Import no longer creates a
sqlite file as a side effect, and MCP servers started with --palace
now route KG calls to the correct tenant when MEMPALACE_PALACE_PATH
changes between calls, matching the per-call behavior of _get_client()
on the ChromaDB side.
Default-path behavior is preserved: without --palace at startup, KG
stays on DEFAULT_KG_PATH regardless of env var. The "no --palace but
env var set" case is #540's scope and is not changed here.
The repo's anti-jargon meta-test bans §N markers outside the
sources/backends allowlist. mcp_server.py isn't allowlisted, so the
"RFC 002 §5.5" references added in this PR turned the test red.
Trim to "RFC 002" — section number isn't load-bearing for the description.
The MCP `mempalace_get_drawer` tool returned the entire raw drawer
metadata blob to any connected client, and the `source_file` field
in that blob is the absolute filesystem path written by the miners
(`miner.py`, `convo_miner.py` — `source_file = str(filepath)`). On
a single-user local deployment this is self-disclosure, but in
nested-agent or multi-server MCP topologies the client is a separate
trust domain and the host's directory layout has no documented
client-side use.
Mirror the mitigation that `searcher.search_memories()` already applies
on its own return path: reduce `source_file` to its basename via
`Path(source_file).name` before handing the metadata to the client.
Citations still work — the directory layout does not leak.
Companion to #1 (omit palace_path from tool_status). Same threat class,
different surface:
- mempalace_status — palace dir path → fixed in #1
- mempalace_get_drawer — per-drawer source_file path → this PR
Other read tools were audited and do not leak host paths:
- mempalace_search — already basenames source_file
- mempalace_list_drawers — returns wing/room/preview only
- mempalace_diary_read — date/timestamp/topic/content only
- mempalace_reconnect — success/message/drawers only
- mempalace_kg_* — entity/predicate strings, counts
- mempalace_check_duplicate — wing/room/preview only
Changes:
- mempalace/mcp_server.py: tool_get_drawer() now basenames metadata.source_file
- tests/test_mcp_server.py: regression test asserting the absolute path
and its parent directory do not appear anywhere in the response
- website/reference/mcp-tools.md: clarify the documented return shape
The MCP `mempalace_status` tool was returning the server's absolute
`_config.palace_path` to any connected client on both the main
(ChromaDB-backed) path and the sqlite fallback path that runs when
HNSW divergence is detected (#1222). On a single-user local deployment
this is self-disclosure, but in nested-agent or multi-server MCP
topologies the client is a separate trust domain and the absolute
path has no documented client-side use.
Clients that legitimately need the palace path continue to have three
documented channels: the `MEMPALACE_PALACE_PATH` env var (primary) or
its legacy `MEMPAL_PALACE_PATH` alias, the `~/.mempalace/config.json`
file, and the `--palace` CLI flag on most subcommands.
Also corrects stale docs that claimed `mempalace_reconnect` returned a
`palace_path` field; the code returns `{success, message, drawers,
vector_disabled[, vector_disabled_reason]}` on success, plus a no-palace
shape and an exception shape.
- mempalace/mcp_server.py: drop palace_path from tool_status() and
_tool_status_via_sqlite() result dicts
- website/reference/mcp-tools.md: update documented return shapes for
mempalace_status (fix) and mempalace_reconnect (stale-docs correction)
Authored-by: Aaron Salsitz (ICCI LLC, @icciaaron). Claude Code was used
as an authoring and review-orchestration tool, with human-in-the-loop
oversight at every step: Aaron wrote the prompts, reviewed each draft,
called for three independent review passes (drafting / post-rebase
technical / CISA-aligned disclosure-leak), and verified the final patch
behavior before commit.
`tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name`
(which preserves case), while `tool_diary_read` filtered by exact match —
so writing as "Claude" and reading as "claude" silently returned zero rows.
Both endpoints now lowercase `agent_name` immediately after sanitization.
The default per-agent wing slug is also stable across casings since it's
derived from the same normalized form.
Behavior change: entries written prior to this fix under mixed-case agent
names will not match the new lowercase filter; documented under v3.3.5
in CHANGELOG with a `mempalace repair` pointer.
Adds a regression test (`test_diary_read_case_insensitive_agent`) and
updates the existing `test_diary_write_and_read` to assert the new
lowercase agent identity.
Closes#1243
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`tool_kg_add` previously accepted only `valid_from` and `source_closet`,
silently dropping `valid_to`, `source_file`, and `source_drawer_id` at
the MCP boundary. Backfilling already-ended historical facts therefore
collapsed to "still current," and adapter provenance never reached
the SQLite layer even though `KnowledgeGraph.add_triple` already
supported every column.
`tool_kg_invalidate` returned the literal string `"today"` whenever the
caller omitted `ended`, hiding the actual stamped date from anyone trying
to verify what got persisted.
Changes:
- Extend `tool_kg_add` signature + MCP input_schema with `valid_to`,
`source_file`, `source_drawer_id`; forward all of them to
`_kg.add_triple` and to the WAL log.
- Resolve `ended` to `date.today().isoformat()` in `tool_kg_invalidate`
before logging / returning, so the response always reports the actual
date stored in `valid_to`.
- Add regression tests for valid_to round-trip, source_file /
source_drawer_id provenance, and the resolved-ended-date contract.
- Leave TODO(#1283) markers so the open ISO-8601 validation PR can drop
`validate_iso_date` over `valid_from` / `valid_to` / `ended` cleanly.
The underlying `KnowledgeGraph.add_triple` already accepted these
kwargs (RFC 002 §5.5) — only the MCP edge needed wiring up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Resolve the EF inside the two reopen branches that actually call
`client.get_collection` / `client.create_collection`, so warm-cache
reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
persists the EF *identity* (its `name()`) but not the *instance/
configuration* — `mempalace.embedding` documents this and spoofs
`name()` to `"default"` precisely so the identity check passes; the
bug was the *provider list* (lazy ONNX selection) silently differing.
`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.
On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per #1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.
Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.
Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.
Follow-up to #1262 / #1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.
#1262 split `get_or_create_collection` into `get_collection` + fallback
`create_collection` inside `ChromaBackend.get_collection`, fixing the
chromadb 1.5.x Rust-binding SIGSEGV that fires when stored collection
metadata differs from the call-site's `_HNSW_BLOAT_GUARD` payload.
The MCP server's `_get_collection(create=True)` carries the same metadata
payload at `mcp_server.py:287` and routes through chromadb's Python
client directly, bypassing the backend layer. Both `tool_add_drawer`
and `tool_diary_write` reach this site on every invocation, and the
Stop hook fires `mempalace_diary_write` at session end — which was
exactly the crash path #1089 named.
Apply the same try/except split here so legacy palaces whose stored
metadata predates the bloat-guard expansion no longer crash on the
MCP-server reopen path. Regression test patches
`get_or_create_collection` at the chromadb client class level (not the
instance — chromadb's mtime-change detection rebuilds the client between
calls, so an instance-level spy doesn't survive) and asserts the second
`_get_collection(create=True)` call never reaches it.
tool_kg_query (as_of), tool_kg_add (valid_from), and tool_kg_invalidate
(ended) accepted any string and forwarded it to SQLite without format
validation. Parameterized queries prevent SQL injection, but invalid
date strings silently produce empty result sets — callers cannot
distinguish "no fact at this time" from "your date format was
unrecognized." This is especially painful for natural-language LLM
callers that synthesize dates like "March 2026" or "Jan 2025".
Add sanitize_iso_date() in config.py alongside the other input
validators. It accepts YYYY, YYYY-MM, and YYYY-MM-DD forms; passes
through None/empty; and raises ValueError with a field-named message
on anything else. Call it from the three kg MCP tool wrappers before
values reach the storage layer so the caller gets a clear error
instead of a silent miss.
Closes#1164
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:
* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
legacy `create_collection()` path. Preserves existing `hnsw:space`,
`hnsw:num_threads=1` (race fix from #976), and `**ef_kwargs`
(embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
path used when a palace is first opened by the MCP server. Without this
third site, MCP-bootstrapped palaces would skip the guard and could
still trigger the original bloat.
Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.
Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.
Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.
Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the #1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.
Closes#344
Supersedes #346
Co-authored-by: robot-rocket-science <robot-rocket-science@users.noreply.github.com>
#976 protects `mempalace mine`, but MCP/direct backend writers still call
ChromaCollection.add/upsert/update/delete without the palace lock. This
moves the lock boundary to the Chroma backend seam so all Chroma writes
share the same palace-level serialization, with a re-entrant guard for
miner paths that already hold the lock.
mine_palace_lock(palace_path) gains a per-thread re-entrant guard
(threading.local + pid-tag against fork inheritance) so
ChromaCollection write methods can take the lock without
self-deadlocking when called from inside miner.mine()'s outer hold.
ChromaCollection.__init__ accepts an optional palace_path; when set,
add/upsert/update/delete wrap their underlying chromadb call with
mine_palace_lock(palace_path). palace_path=None preserves the legacy
no-lock behaviour for direct callers and tests. ChromaBackend's
get_collection/create_collection pass palace_path through;
mcp_server._get_collection forwards _config.palace_path so all MCP
write tools inherit the wrapping.
Tests: 5 new in tests/test_chroma_collection_lock.py covering opt-in,
writer-blocks-during-mine, re-entrant-inside-mine, two-process
serialization, and a source-level read-path-not-locked pin. Plus 1 new
+ 1 rewritten in tests/test_palace_locks.py for the re-entrant
semantics. 52 passed in 1.01s including the existing test_backends.py
regression suite.
Refs #1161.
Five Copilot review issues + the Python 3.9 CI failure rolled into one
follow-up:
* Replace ``dict | None`` annotated assignment with a type-comment so
module load doesn't evaluate PEP 604 syntax on Python 3.9 (CI red).
* Drop ``mempalace repair rebuild`` — the CLI only ships ``mempalace
repair`` (rebuild) and ``mempalace repair-status``. Updated all
user-facing messages, docstrings, and test assertions.
* Replace ``_get_client()`` in ``tool_search`` with the safe
``_refresh_vector_disabled_flag`` probe so the fallback isn't
defeated by the very chromadb client load it's trying to avoid.
* Short-circuit ``tool_status`` to a pure-sqlite reader
(``_tool_status_via_sqlite``) when divergence is detected so wing /
room counts come back without ever opening the persistent client.
* Wrap the recency-window query in ``_bm25_only_via_sqlite`` with an
``id``-ordered fallback so legacy schemas missing ``created_at``
don't break BM25 search.
New test covers the sqlite-status short-circuit. 1409 tests pass.
When chromadb's HNSW segment freezes at a stale max_elements while
sqlite keeps accumulating embeddings, the next chromadb open segfaults
the MCP server on every tool call. Adds a pure-filesystem capacity probe
(zero chromadb interaction), a `mempalace repair-status` read-only
health check, and a BM25-only sqlite fallback so the palace stays
reachable even when vector search is unavailable.
* `hnsw_capacity_status` reads sqlite + index_metadata.pickle directly
via a tight-allowlist unpickler — no hnswlib import, no segment load.
* MCP server runs the probe at startup and after every reconnect; sets
`_vector_disabled` and routes search to the sqlite FTS5 + BM25 path.
* `tool_status` and `tool_reconnect` surface the fallback state.
* Threshold tuned for chromadb 1.5.x async-flush lag (2× sync_threshold).
Addresses remaining PR #976 review items after rebase on develop.
`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).
In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.
After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.
- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
Addresses the six Copilot review comments on the initial commit.
1) #6 (critical) — mcp_server.py `_get_collection` bypassed ChromaBackend
The MCP server creates its palace collection directly via
`chromadb.PersistentClient.get_or_create_collection` in `_get_collection`,
not through `ChromaBackend.get_collection`. That path was missing the
`hnsw:num_threads=1` metadata, so the primary crash surface for #974
and #965 was untouched by the original patch. Fixed by passing
`hnsw:num_threads=1` at the mcp_server create site too. Documented
in a code comment that the setting is only honored at creation
time — existing palaces created before this fix still need a
`mempalace nuke` + re-mine to gain the protection.
2) #3 — mine_global_lock over-serialized mines across unrelated palaces
Replaced the single global lock file `mine_global.lock` with a
per-palace lock keyed by `sha256(os.path.abspath(palace_path))`
(`mine_palace_<hash>.lock`). Mines against the same palace still
collapse to a single runner (the correctness boundary), but mines
against *different* palaces are now free to run in parallel.
`mine_global_lock` is kept as a backward-compatible alias for
`mine_palace_lock` so any external callers that imported the
previous name keep working.
3) #1 — hook_precompact swallowed OSError but not subprocess.TimeoutExpired
`subprocess.run(..., timeout=60)` raises `TimeoutExpired` on slow
palaces. The previous `except OSError` clause didn't catch it, so
the hook could raise and fail to emit any JSON decision — leaving
the harness without a block/passthrough signal. Fixed by catching
`(OSError, subprocess.TimeoutExpired)` together and always falling
through to the block decision so the hook reliably emits a response.
4) #2 + #4 — tests
- tests/test_hooks_cli.py: added
`test_precompact_first_two_attempts_block`,
`test_precompact_passes_through_after_cap`, and
`test_precompact_counter_is_per_session` to lock in the #955
deadlock fix.
- tests/test_palace_locks.py (new): covers `mine_palace_lock`
single-acquire, reuse-after-release, cross-process serialization
on the same palace, non-interference across different palaces,
path normalization, and the `mine_global_lock` back-compat alias.
5) #5 — known limitation, documented but not auto-fixed
Copilot suggested detecting collections missing `hnsw:num_threads=1`
and calling `collection.modify(metadata=...)` to retrofit existing
palaces. Verified against chromadb 1.5.7: `modify(metadata=...)`
replaces metadata rather than merging, and re-passing
`hnsw:space="cosine"` then raises `ValueError: Changing the
distance function of a collection once it is created is not
supported currently.` The HNSW runtime configuration
(`configuration_json`) also does not expose `num_threads` in
chromadb 1.5.x, so the flag appears to be read only at creation
time. Rather than paper over the limitation with a best-effort
`modify` that silently drops `hnsw:space`, documented in the
mcp_server comment that pre-existing palaces need a
`mempalace nuke` + re-mine to gain the protection. Fresh palaces
are always protected.
Testing
- pytest tests/test_palace_locks.py tests/test_hooks_cli.py
tests/test_backends.py tests/test_cli.py → **98 passed, 0 failed**.
- Runtime validation with two concurrent `mempalace mine` calls:
- Different palaces → both complete in parallel ✓
- Same palace → one completes, the other exits with
"another `mine` is already running against <palace> — exiting
cleanly." ✓
#1097 fixed mempalace_search to treat empty-string wing/room as
no filter, matching how LLM agents default to filling every optional
parameter with ''. The same pattern wasn't applied to diary_read:
passing wing='' defaulted to wing_<agent_name>, siloing away entries
that hooks had written to project-derived wings per #659.
When wing is empty/omitted, filter only on agent + room=diary so
callers get a unified view of the agent's journal across every wing
it has written to. Explicit wing=<name> continues to scope reads
to that wing only.
Adds test covering empty-wing read after writing to both the default
and a non-default wing.
* fix: add wing param to diary_write/diary_read, derive from transcript path
Without a wing override, all diary entries from the stop hook land in
wing_session-hook regardless of which project the session is in, making
per-project diary search impossible.
- tool_diary_write(): add optional `wing` param; sanitize and use it when
provided, fall back to wing_{agent_name} when omitted
- tool_diary_read(): add optional `wing` param for filtering by target wing
- TOOLS dict: expose `wing` in input_schema for both diary tools
- hooks_cli: add _wing_from_transcript_path() helper that extracts the
project name from Claude Code paths like
~/.claude/projects/-home-jp-Projects-kiyo-xhci-fix/... → kiyo-xhci-fix
- hook_stop: derive project wing and append wing= hint to block reason so
Claude writes diary entries to the correct per-project wing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: sanitize wing param, cross-platform paths, tighten test assertions
Addresses Copilot review feedback on #659.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: wing_ prefix + agent filter on diary_read
Addresses bensig's 2-issue review on this PR.
1. _wing_from_transcript_path() was returning bare project names
(e.g. "myproject") while all existing wings follow the wing_*
convention from AAAK_SPEC. Entries landed in wing="myproject"
while diary_read defaulted to wing="wing_<agent_name>" —
orphaning every diary entry written by the stop hook. Now
returns "wing_<project>" and falls back to "wing_sessions".
2. tool_diary_read() did not include agent_name in the ChromaDB
where filter when a custom wing was provided — any caller with
a shared wing could read entries written by other agents.
Add {"agent": agent_name} to the $and clause. Also flagged by
Qudo and left unresolved until now.
Tests updated to expect the wing_ prefix (6 tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The list_drawers response only included count (current page size) with
no total field, making it impossible for callers to know when pagination
is exhausted. A page returning count == limit is ambiguous — it could
be the last exact-fit page or there could be more results.
Add a total field that reports the full number of matching drawers.
For unfiltered requests this uses col.count(); for filtered requests
(wing/room) it uses a lightweight col.get(include=[]) to count
matching IDs without fetching documents.
On Windows, Python defaults sys.stdin/sys.stdout to the system codepage
(e.g. cp1251 on Russian locales, cp1252 on Western European), while MCP
JSON-RPC is always UTF-8. Non-ASCII payloads (Cyrillic, CJK, accented
European) get mis-decoded before reaching handlers, causing json.loads
to fail or tool handlers to receive garbled strings. Both surface to
the client as a generic MCP error -32000.
Reproduction:
1. On Windows with a non-Latin locale, call mempalace_add_drawer or
mempalace_kg_add with Cyrillic/CJK in content or KG object.
2. Server returns: MCP error -32000: Internal tool error.
3. Calling the handler directly from Python works fine -- the bug is
purely in the stdio transport.
Fix:
Reconfigure stdin/stdout to UTF-8 at the start of main(), after
_restore_stdout(). Uses errors="replace" defensively so a lone bad
byte cannot take down the server. Guarded by hasattr(reconfigure)
for exotic stream replacements.
This matches the behaviour of PYTHONUTF8=1 / python -X utf8 without
requiring users to set an env var.
The MCP server config used `python -m mempalace.mcp_server` which fails
when mempalace is installed via pipx or uv, since the system python
cannot find the module in the isolated venv. Adding a `mempalace-mcp`
console_scripts entry point ensures the MCP server works regardless of
installation method (pip, pipx, uv, conda).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chroma 1.5.x can return ``None`` inside the ``metadatas`` / ``documents``
lists of a query/get result for partially-flushed rows. The codebase
already has a systemic None-guard pattern (merged #999, #1013, #1019)
but three call sites were still unguarded:
* ``mcp_server.tool_check_duplicate`` (``mcp_server.py:487-488``) —
``meta = results["metadatas"][0][i]`` followed by ``meta.get(...)``
raises ``AttributeError: 'NoneType' object has no attribute 'get'``.
The broad ``except Exception`` wrapper (line 504) swallows it and
returns an uninformative ``"Duplicate check failed"``.
* ``layers.Layer1.generate`` (``layers.py:126``) — iterates
``zip(docs, metas)`` and calls ``meta.get(key)`` in the importance
loop. A single None metadata blows up the entire wake-up render.
* ``layers.Layer2.retrieve`` (``layers.py:224``) — same pattern, same
crash path for the on-demand render.
Apply the same ``meta = meta or {}`` / ``doc = doc or ""`` idiom used
by the merged guards in the search path. Three-line additions, no
behaviour change on well-formed results.
Tests added:
* ``test_check_duplicate_handles_none_metadata`` — mocks the collection
query to return ``None`` for one metadata and document, asserts the
call does not crash and the sentinel-rendered entry has wing/room "?"
and empty content.
* ``test_layer1_handles_none_metadata`` / ``_handles_none_document``
* ``test_layer2_handles_none_metadata``
Relationship to other open PRs:
* **#1019** guarded ``searcher.py`` loops. This PR extends the same
guard to the three call sites #1019 did not touch.
* **#979** fixed ``tool_check_duplicate`` negative similarity but left
the None-metadata path unguarded.
* Does not overlap **#1013** (``Layer3.search_raw``) or **#999**.
Four more MCP handlers iterate a metadata list and call m.get(...)
unconditionally. When the cache contains a None entry (drawers with no
metadata, common on older mining paths), the try block catches the
AttributeError and marks the response "partial: true" with an
error message — visible as {"error": "'NoneType' object has no
attribute 'get'", "partial": true} returned from mempalace_status even
though the palace data is otherwise fetchable.
Same m = m or {} guard we applied to searcher.py (d3a2d22, a51c3c2)
and miner.status() (66f08a1). None-metadata drawers now roll up under
the existing "unknown" fallback bucket instead of poisoning the
response with a misleading partial flag.
Regression test: mock the metadata cache with a None in the middle,
assert tool_status returns clean counts and no error/partial fields.
Verified the test fails without the guard.
998 tests pass.
agent_name and entry are validated via sanitize_name/sanitize_content,
but topic is stored raw into ChromaDB metadata. Apply the same
sanitize_name guard to reject null bytes, path traversal, and
oversized payloads.
* fix: restrict file permissions on sensitive palace data
On Linux with default umask (022), several files and directories
containing personal data were created world-readable. This patch
applies chmod 0o700 to directories and 0o600 to files immediately
after creation, wrapped in try/except for Windows compatibility.
Files hardened:
- hooks_cli.py: hook_state/ directory and hook.log
- entity_registry.py: entity_registry.json (names, relationships)
- knowledge_graph.py: knowledge_graph.sqlite3 parent directory
- exporter.py: export output directory and wing subdirectories
- config.py: people_map.json (name mappings)
- mcp_server.py: WAL file creation uses atomic os.open (TOCTOU fix)
Refs: MemPalace/mempalace#809
* fix: avoid redundant chmod calls on hot paths
- hooks_cli.py: chmod STATE_DIR and hook.log only on first creation,
not on every _log() call (hooks fire on every Stop event)
- exporter.py: track created wing dirs to skip redundant makedirs +
chmod on the same directory across batches
- mcp_server.py: remove redundant _WAL_FILE.chmod after os.open
already set mode=0o600 atomically
Refs: MemPalace/mempalace#809
* fix(mcp): redirect stdout to stderr during import to protect JSON-RPC channel (#225)
Fixes#225.
Several transitive dependencies (chromadb, onnxruntime, posthog) print
banners and warnings to stdout — sometimes at the C level — during the
mcp_server import chain. Because the MCP protocol multiplexes JSON-RPC
over stdio, any non-JSON output on stdout corrupted the message stream
and broke Claude Desktop's parser with errors like:
MCP mempalace: Unexpected token '*', "**********"... is not valid JSON
MCP mempalace: Unexpected token 'E', "EP Error D"... is not valid JSON
MCP mempalace: Unexpected token 'F', "Falling ba"... is not valid JSON
Reproduced on Windows 11 with mempalace 3.0.0 / Python 3.10 / Claude
Desktop 1.1062.0.
Fix: at module load, redirect stdout to stderr at both the Python level
(sys.stdout = sys.stderr) and the file-descriptor level (os.dup2(2, 1))
to catch C-level prints, while preserving the real stdout for later
restore. main() calls _restore_stdout() right before entering the
protocol loop so JSON-RPC responses still go to the real stdout.
Adds tests/test_mcp_stdio_protection.py with three regression tests:
- module-level redirect is in place after import
- _restore_stdout() restores the original stdout (idempotent)
- 'python -m mempalace.mcp_server' with empty stdin emits no stdout
* style: reformat with ruff 0.4 (CI version) for #225