test_legacy_state_backfills_content_hash failed on test-windows because
Path.write_text without an encoding uses the system locale (cp1252 on
Windows). The em dash was written as 0x97, then read back by
diary_ingest as UTF-8 with errors=replace — round-trip produced
different bytes than the in-Python literal, so the assertion comparing
the persisted hash to sha256(text.encode(utf-8)) diverged.
Pin the write side to encoding=utf-8 so the on-disk bytes match what
diary_ingest decodes. No production change.
Address Copilot review on #925:
- Full closet rebuild whenever the content hash differs from prior
state, not only on entry-count growth. Without this, an in-place
edit (same entry count, different body) updated the drawer but
left the closet/search index stale — defeats the verbatim guarantee
at the search layer even if the drawer is correct.
- Legacy size-only skip path now records the computed content_hash
back into state so subsequent runs use the strict hash check
instead of remaining on the size-only path indefinitely.
- Test updates: typo direction in the regression test now matches the
comment (typo "Teh" → fix "The"), assertion now also checks the
closet collection reflects the edit, and a new test exercises the
legacy-state backfill path.
The skip-if-unchanged check compared byte length only, so any in-place
edit preserving total length (typo fix "teh"→"the", word swap) was
silently dropped — a verbatim-storage violation: the user's actual
words never reached the palace.
Switch the gate to sha256(text). State entries gain a "content_hash"
field; the legacy size-only path is preserved when prev_hash is missing
so a post-upgrade run does not re-ingest every untouched diary.
Closes#925
Address Copilot review on #1403: the test seeked unconditionally to
offset 40960 with only `pre_size > 16384` as a guard. If pre_size sat
between 16384 and 40960 + 16384 = 57344 (e.g., on a chromadb version
that allocated fewer pages on init, or a future schema change), the
seek would extend the file with zero-padding and the original pages
would stay intact — quick_check would still pass on the (untouched)
real data, and the regression guard would silently skip detecting a
preflight-ordering regression.
Compute the offset from pre_size, page-aligned, with explicit asserts
that the file is large enough to mangle 4 pages without truncating
the header or extending past EOF.
#1364 added the SQLite quick_check preflight to rebuild_index, but
placed it AFTER backend.get_collection(...). On a SQLite-corrupt
palace, chromadb's rust binding raises pyo3_runtime.PanicException —
which is not a regular Exception subclass — so it propagates past the
existing `except Exception` handlers and the user sees a 30-line stack
trace instead of the friendly abort message #1364 was designed to
deliver. Reproduced with `mempalace repair --yes` against a palace
whose chroma.sqlite3 has 4 mangled pages: pre-fix, panic; post-fix,
the clean abort message and exit code 1.
Two changes:
- mempalace/cli.py cmd_repair: run sqlite_integrity_errors() right
after the basic palace-existence check, BEFORE the max_seq_id
preflight (which itself opens sqlite3) and BEFORE backend =
ChromaBackend(). Exit non-zero so unattended scripts and CI gates
see the failure.
- mempalace/repair.py rebuild_index: same move at the function level
for direct callers (tests, MCP) that bypass cmd_repair.
The new test test_rebuild_index_runs_sqlite_preflight_before_chromadb_open
uses a real chromadb-built palace (no ChromaBackend mock) plus a
real corrupt SQLite (16 KB of mangled pages) so the ordering is
exercised end-to-end. The previously-shipping test for the abort path
mocked both the backend and sqlite_integrity_errors, which is why the
ordering bug shipped CI-green.
Six existing test_cli.py cmd_repair tests used `(palace_dir /
"chroma.sqlite3").write_text("db")` to fake the SQLite file. The new
preflight correctly fails quick_check on those 2-byte stubs, so the
tests now create empty real SQLite DBs the same way the test_repair.py
fixtures already do.
#1357 (max_seq_id preflight) merged into develop while this branch
was in CI, opening a fresh conflict between the two preflight helpers.
mempalace/repair.py:
- Kept both: this branch's sqlite_integrity_errors() / print_sqlite_
integrity_abort() AND develop's maybe_repair_poisoned_max_seq_id_
before_rebuild() from #1357. They check for distinct corruption
classes and run as separate preflights.
tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors test group and
develop's max_seq_id preflight test group; non-overlapping coverage.
Local: 1623 tests pass, ruff lint+format clean against 0.4.x CI pin.
Conflicts opened by #1285 (temp-staging rebuild) and #1312
(collection_name in recovery paths) merging after this branch was
authored.
mempalace/repair.py:
- Kept this branch's sqlite_integrity_errors() and
print_sqlite_integrity_abort() helpers; took develop's rebuild_index
signature with the collection_name parameter from #1312. Normalized
the helper's print indent to 2 spaces to match the rest of the file.
tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors tests and develop's
rebuild_from_sqlite + configured-collection coverage.
- Replaced 7 sites of sqlite_path.write_text("fake") with
sqlite3.connect(...).close() — write_text("fake") fails PRAGMA
quick_check, so the new preflight aborts before the rebuild logic
the tests actually exercise. An empty real SQLite DB passes
quick_check and lets the tests run as intended.
- Took develop's temp-staging assertion shape (delete/create the
__repair_tmp collection in addition to the live drawers collection)
for the existing test_rebuild_index_success test.
Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
Two conflicts, both because #1339 (bloated link payloads) merged into
develop after this branch was authored:
- mempalace/backends/chroma.py: _segment_appears_healthy now stacks
three checks — bloated-link from #1339 (top), missing-metadata-with-
data-floor from this branch (middle), pickle format sniff (bottom).
All three are complementary; #1339 catches structural payload
corruption, this branch catches pickle truncation, the original
catches pickle protocol-byte corruption.
- tests/test_backends.py: kept both new imports (_segment_appears_healthy
from this branch, quarantine_invalid_hnsw_metadata from #1285).
Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
Three conflicts, all from develop landing #1285/#1310/#1312 after this
branch was authored:
- mempalace/cli.py: keep both import sets — this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
RebuildCollectionError / _close_chroma_handles / _extract_drawers /
_rebuild_collection_via_temp added in #1285.
- mempalace/repair.py: keep this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
develop's rebuild_index signature with the collection_name parameter
added in #1312. Normalized print indent to 2 spaces matching the
rest of the file.
- tests/test_repair.py: keep both this branch's max_seq_id preflight
tests and develop's rebuild_from_sqlite + configured-collection-name
tests; they exercise distinct code paths and don't overlap.
Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
The collection_name plumbing rebase produced a few unformatted blocks
in test_mcp_server.py and test_searcher.py; bringing them in line with
the 0.4.x CI pin so test-windows / lint stay green.
Three small changes that together address the failure modes in #1296:
1. Add pnpm-lock.yaml and yarn.lock to SKIP_FILENAMES, mirroring the
existing package-lock.json rule. A 24K-line pnpm-lock.yaml produced
~1124 chunks in one batch and tripped onnxruntime bad_alloc on
Windows; pnpm/yarn lockfiles are no more useful to mine than npm's.
2. Skip any file that produces more than MAX_CHUNKS_PER_FILE (500)
chunks, with a clear log line. Catches the broader class — generated
CSV/JSON, build artifacts, etc. — that the named-file SKIP list will
never fully cover. The cap is conservative (500 chunks * 800 chars ≈
400 KB of source) so legitimate hand-written content still mines.
3. Print a partial-progress summary on any exception in _mine_impl, not
just KeyboardInterrupt, then re-raise. Without this, an arbitrary
exception (ONNX bad_alloc, chromadb HNSW error, OS fault) propagates
silently — the operator sees only the last progress line and assumes
the mine succeeded. The new path mirrors the KeyboardInterrupt
summary (files_processed, drawers_filed, last_file) plus the
exception type and message, then re-raises so the original traceback
surfaces and the exit code is non-zero.
Tests cover: SKIP_FILENAMES contents, the chunk-cap path returning
(0, room) with no upserts, and the new mine-aborted summary surfacing
both the partial counters and the exception class.
Develop (post-#1162 lock-plumbing era) refactored the per-open quarantine
pass into ChromaBackend._prepare_palace_for_open. This branch's
inline-expansion form added quarantine_invalid_hnsw_metadata as a third
check, plus a "discard from _quarantined_paths on inode swap" guard so
re-opens against a different physical DB re-run quarantine.
Resolution merges both:
- _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata,
gated by the same _quarantined_paths set.
- _client keeps the inode_changed -> _quarantined_paths.discard() guard
before calling the helper, so a fresh inode triggers a fresh pass.
- make_client collapses to a single _prepare_palace_for_open() call.
- test_backends.py keeps both the pickle (#1285) and shutil (develop)
imports — both are used.
The helper opened a chromadb PersistentClient via ChromaBackend and never
closed it, leaving rust-side SQLite/HNSW file locks alive after the
helper returned. On Windows that blocks the in-place archive rename
inside rebuild_from_sqlite with WinError 32 on data_level0.bin,
causing test_rebuild_from_sqlite_in_place_archives_when_opted_in and
test_rebuild_from_sqlite_raises_on_upsert_failure to fail in the
test-windows CI job. No test consumes the returned collection, so
closing the backend in a try/finally is safe and drops the return.
A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per #1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).
Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
via a monkeypatched `_get_client`, second attempt succeeds; assert the
caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
fail, assert we exhaust the loop and return None (no infinite retry).
Surgical extraction from PR #1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.
Co-authored-by: Jeffrey Hein <jp@jphein.com>
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on #1310:
- repair.py: drawers collection name now resolves from
MempalaceConfig().collection_name via _drawers_collection_name() (closets
stays fixed by design — AAAK index references drawer IDs by string).
Lines up with the broader configured-collection work in #1312 so that
PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
_rebuild_one_collection so a Chroma "Collection already exists" failure
surfaces as RebuildPartialError with archive_path, not an unstructured
exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
with backend.close() so PersistentClient handles to dest_palace are
released on every exit path. The from-sqlite path post-dates #1285's
lifecycle hardening of the legacy rebuild, so this needed its own
cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
rebuild_from_sqlite returns {} (validation refusal sentinel), so
unattended scripts/CI distinguish "invalid inputs" from a successful
rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
now asserts every backing segment is scope='METADATA', locking in the
segment-layout assumption against future regressions that point the
JOIN at the VECTOR segment.
New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `--mode legacy` and the inline `cli.cmd_repair` rebuild path
call `Collection.count()` as their first read — the same call that
raises `chromadb.errors.InternalError: Failed to apply logs to the
hnsw segment writer` on the corruption class reported in #1308.
Repair would print "Cannot recover — palace may need to be re-mined
from source files" even though the underlying SQLite tables were
fully intact.
The new `--mode from-sqlite` reads `(id, document, metadata)` rows
directly from `chroma.sqlite3` via `segments` → `embeddings` →
`embedding_metadata` joins, never opens a chromadb client against
the corrupt palace, and re-upserts everything into a fresh palace.
- `--source PATH` extracts from a corrupt palace already moved aside
- `--archive-existing` handles the in-place case by renaming the
existing palace to `<palace>.pre-rebuild-<timestamp>` first
- Partial-rebuild failures raise `RebuildPartialError` with the
archive path so users can recover; CLI exits non-zero
- In-place mode calls `SharedSystemClient.clear_system_cache()` to
drop chromadb's process-wide System registry (cross-palace use
does not, to limit blast radius for library callers)
- Source validation runs before any destructive moves
Verified end-to-end recovering a 52,300-row real-world corrupt
palace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``search_memories`` computes ``effective_dist = dist - boost`` where
``boost`` can be as large as ``CLOSET_RANK_BOOSTS[0] == 0.40`` for a
rank-0 closet hit. When the raw drawer distance is small — any
near-exact match — the subtraction goes negative.
Two downstream effects:
1. Line 418 returns ``round(max(0.0, 1 - effective_dist), 3)`` as
``similarity``. With ``effective_dist = -0.30`` that yields
``similarity = 1.30``, outside the documented ``[0, 1]`` range.
The ``max(0.0, ...)`` only prevents negative similarities; it does
not cap above 1.
2. Line 427 stores ``_sort_key: effective_dist`` and line 435 sorts
``scored`` ascending by that key. A negative key drops *below* the
rest, so the strongest hybrid matches end up sorting after weaker
ones — ranking inversion under the exact conditions hybrid retrieval
is supposed to serve best.
Clamp ``effective_dist`` to the valid cosine-distance range ``[0, 2]``.
The boost still wins (closet-backed hit still ranks first), it just no
longer flips the order.
Test added: mock drawer_col (base dist 0.08 / 0.35 for two sources) +
closet_col (rank-0 closet for the 0.08 source) → assert all hits have
``0 <= similarity <= 1`` and ``0 <= effective_distance <= 2``, and that
the closet-boosted source still ranks first.
Relationship to other PRs:
* **#988** clamps the output ``similarity`` alone. That does not fix
the sort-key inversion or the invalid ``effective_distance`` in the
returned dict. This PR clamps at the arithmetic source so both
downstream users of the value stay in range.
* Orthogonal to **#979** (``tool_check_duplicate`` negative similarity).
The alias was placed below an explanatory comment block introduced by
#1305, which trips ruff E402 (module-level import not at top of file).
Moved next to the existing 'from mempalace.hooks_cli import (...)' line.
CI lint went red on develop after #1305 merged with the failing check;
this re-greens it so subsequent PRs do not inherit the failure.
EntityRegistry.save() called Path.write_text() directly, which truncates
the target file and then writes — so a crash mid-write (power loss, OOM,
filesystem-full mid-flush) leaves an empty or half-written
entity_registry.json. The whole people/projects map is lost; the system
falls back to an empty registry on next load.
Switch to the standard atomic-write pattern: serialize to a sibling
.tmp file in the same directory (so os.replace stays on one filesystem),
fsync, chmod 0o600, then os.replace over the target. The replace is
atomic on POSIX and Windows, so any crash leaves the previous registry
intact instead of a truncated file.
Tests cover: no leftover .tmp on success, and previous content preserved
when os.replace itself raises mid-save.
Race scenario: a KG tool handler calls _get_kg() and gets a live
KnowledgeGraph; another thread fires tool_reconnect() between that
return and the handler's kg.add_triple()/kg.query_entity()/etc call.
tool_reconnect drains _kg_by_path and closes the underlying
sqlite3.Connection; the handler then raises sqlite3.ProgrammingError:
'Cannot operate on a closed database', which surfaces as a -32000
to the MCP client even though the user just asked for a reconnect.
New _call_kg(op) helper wraps each handler's kg call in a one-shot
retry: catch exactly sqlite3.ProgrammingError, evict the stale entry
(only if the cache slot still points at the closed instance — another
thread may have already replaced it), and rerun op against a fresh
_get_kg(). Beyond one retry give up so a sustained close-stream
surfaces clearly instead of looping.
All five KG handlers (tool_kg_query, tool_kg_add, tool_kg_invalidate,
tool_kg_timeline, tool_kg_stats) now route through _call_kg.
Tests pin the contract:
* retries with a fresh KG and returns the second result
* non-ProgrammingError exceptions propagate without retry
* gives up after exactly one retry on sustained close
Previously all three streams reconfigured to UTF-8 with errors='strict'.
That kills 'mempalace search' the moment a drawer carrying a surrogate
half (round-tripped from a filename via surrogateescape) hits print(),
losing the rest of the result block. Same hazard for warning lines on
stderr.
Split the policy:
stdin -> surrogateescape (malformed bytes from a redirected file
survive as lone surrogates instead of crashing the read)
stdout -> replace (drawer text with a stray surrogate becomes U+FFFD
instead of UnicodeEncodeError mid-print)
stderr -> replace (same protection for logger / warning paths)
Applied identically in the cli.py and fact_checker.py helpers; the DRY
extraction into a shared module is a separate cleanup ask, kept out of
this fix to keep the diff narrow.
Tests updated for the new per-stream assertion.
The primary `mempalace` console_script (`cli.py:main()`) reads non-ASCII
arguments via piped stdin and writes verbatim drawer text / wing names
through `print()`. On Windows, Python defaults stdio to the system ANSI
codepage (cp1252/cp1251/cp950), so:
- `mempalace search "..." > out.txt` mojibakes any drawer text containing
non-Latin characters
- `mempalace ... < input.txt` mojibakes piped non-ASCII input
Reconfigure stdin/stdout/stderr to UTF-8 (`errors="strict"`) at the top
of `main()`, mirroring the helper added in this PR for fact_checker's
`__main__` block. Wrapped in try/except so a replaced stream (Jupyter,
test harness) logs a warning and continues rather than crashing the CLI.
The reconfigure cascades through every `mempalace` subcommand
(`init`/`mine`/`search`/`status`/`hook`/etc.) and through the interactive
flows that read non-ASCII names via `input()` (onboarding, entity
detector, room detector). With this commit the package's three
user-facing entry points (`mempalace`, `mempalace-mcp`, and
`python -m mempalace.fact_checker`) all reconfigure stdio identically on
Windows.