The narrowed _fix_blob_seq_ids returned early when safe_rows was empty,
but #1177's marker contract requires the marker to be written on every
successful pass — even no-op — so subsequent opens skip the sqlite3
connection entirely. Without this, palaces that have no genuine 0.6.x
BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every
call, defeating the #1090 corruption guard.
Restructured the conditional so the marker write is unconditional after
a successful sqlite scan, regardless of whether any rows were updated.
Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer
during the develop-rebase of this PR. The author's branch predates the
marker contract from #1177 (merged 2026-04-26), so this is a rebase-edge
fix-up rather than a logic change to their narrowing behaviour.
The BLOB-seq_id migration shim (PR #664) ran int.from_bytes(..., 'big')
over every BLOB in max_seq_id, including chromadb 1.5.x's own native
format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18
integer that silently suppresses every subsequent embeddings_queue write
for the affected segment (queue filter is seq_id > start), causing
silent drawer-write drops after a 1.5.x upgrade.
Two-part fix:
1. Shim narrowing (mempalace/backends/chroma.py)
- Drop max_seq_id from the shim loop. chromadb owns that column's
format; we don't reinterpret it.
- Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the
sysdb-10 b'\x11\x11' prefix rather than misconvert.
2. Recovery command (mempalace/repair.py, mempalace/cli.py)
- mempalace repair --mode max-seq-id [--segment <uuid>]
[--from-sidecar <path>] [--dry-run] [--yes] [--no-backup]
- Detects poisoned rows via threshold (seq_id > 2**53).
- Default heuristic: MAX(embeddings.seq_id) over the collection owning
the poisoned segment. Matches METADATA max exactly; VECTOR segments
get a few seq_ids ahead (queue skips an already-indexed window — an
acceptable loss vs. resetting to 0 and re-processing everything).
- --from-sidecar copies clean values from a pre-corruption sqlite db.
- Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs,
post-repair verification that raises MaxSeqIdVerificationError if
any row is still above threshold.
Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar,
dry-run, segment filter, no-op, backup, rollback-on-verify-failure).
3 new in tests/test_backends.py (max_seq_id untouched by shim,
sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs
still convert). Full suite: 1103 passed.
Three new tests cover the marker contract:
- test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is
written after a successful BLOB → INTEGER conversion.
- test_fix_blob_seq_ids_writes_marker_when_already_integer — marker
is written even on a no-op (already-INTEGER) path. The point of
the marker is to skip sqlite3.connect() on subsequent calls, not
to record that a conversion happened.
- test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies
the load-bearing property via monkeypatched sqlite3.connect: when
the marker exists, the function never opens sqlite. This is the
bug #1090 fix — opening Python's sqlite3 against a live ChromaDB
1.5.x WAL DB corrupts the next PersistentClient call, and we
must never do it again post-migration.
Also folded in @igorls's style nit:
- Path(marker).touch() instead of open(marker, "a").close().
Same effect, reads cleaner. Moved Path import to the top of the
module since it didn't yet exist there.
35/35 backend tests pass. The "Production: fresh MCP server + stop
hook + mempalace mine subprocess" checkbox in the PR body is checked
in the PR reply — the canonical 151K palace has been running this
fix since #1177 was filed (via Syncthing-replicated source on the
disks daemon at v1.7.0); zero PersistentClient corruption since.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.
Plus three nits @igorls flagged:
- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
documented that idempotency of `quarantine_stale_hnsw` is the safety
property (concurrent same-palace calls produce a benign redundant
rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
and a future protocol-0/1 emission would conservatively quarantine
+ lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
conftest reset is the controlled case, and module-level wouldn't
remove the foot-gun, just relocate it. Conftest reset documented in
the existing comment is the right pattern for test isolation.
Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in #1177's territory, not #1173's.
37/37 tests pass on the PR branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in #1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.
Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.
Fix: two-stage gate.
1. mtime gate (existing) — gap > stale_seconds is necessary.
2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
tail) and a non-trivial size, WITHOUT deserializing the file.
Healthy segment with mtime drift → keep in place; truncated /
zero-filled / partial-flush → quarantine.
Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.
The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.
4 new tests in test_backends.py:
- test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
- test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
(drift + valid meta — the production case at 06:24)
- test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
(fresh / never-flushed, no meta file)
- test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
size, partial-flush shape)
Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.
Suite 1366/1366 pass.
Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.
CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).
Root cause is a false-positive thrash in the quarantine heuristic:
- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
chromadb's internal cadence, which can lag minutes behind sqlite
under continuous write load.
Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR #1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.
Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.
Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.
Suite 1362/1362 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After rebasing onto current develop:
- chroma.py: keep develop's quarantine_stale_hnsw + UnsupportedFilterError
validation alongside this PR's _pin_hnsw_threads retrofit.
- tests/test_backends.py: combine quarantine_stale_hnsw and
_pin_hnsw_threads test sections; ruff format.
- miner.py: propagate the new `files=` kwarg (added on develop in #1183
for the init -> mine flow) through _mine_impl so the caller can pass
a pre-scanned file list under the global lock.
Addresses remaining PR #976 review items after rebase on develop.
`get_collection(create=False)` previously returned existing collections without
re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the
unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls
`collection.modify(configuration=UpdateCollectionConfiguration(
hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every
`get_collection` call (including the MCP server's `_get_collection`).
In chromadb 1.5.x the runtime config does not persist to disk across
`PersistentClient` reopens, so the retrofit is re-applied each process start
rather than being a one-shot migration. Fresh palaces keep the metadata-based
pin as primary defense; legacy palaces now also get per-session protection
without requiring `mempalace nuke` + re-mine.
After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and
no longer emits `decision: block`, so the attempt-cap constant was orphaned.
Grep confirms 0 usages in the repo — remove it.
- `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1)
- `_pin_hnsw_threads` swallows all errors (never raises)
- `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace
- 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)
Add a helper that renames HNSW segment directories whose
`data_level0.bin` is significantly older than `chroma.sqlite3`. Drift
between the on-disk HNSW graph and the live embeddings table is the
root cause of a segfault class where the Rust graph-walk dereferences
dangling neighbor pointers for entries in the metadata segment that no
longer exist in the HNSW index, crashing in a background thread on
`count()` or `query()`.
Issue #823 describes the same drift as a silent-staleness symptom
(semantic search returns stale results after `add_drawer` because
`data_level0.bin` lags the sqlite metadata under the default
`sync_threshold=1000`). Under heavier load or after an interrupted
write, the same drift can escalate from "silent stale results" to
"SIGSEGV on next open," which is the failure mode observed at
neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12) and acknowledged at
chroma-core/chroma#2594.
On one 135K-drawer palace where `index_metadata.pickle` claimed 137,813
elements against 135,464 rows in sqlite (2,349-entry drift), fresh
Python processes crashed in `col.count()` 17/20 times; after renaming
the segment dir out of the way and letting ChromaDB rebuild lazily, the
same 20-run check went to 0 crashes.
The recovery path #823 suggests (export / recreate / reimport) is heavy
— it re-embeds every drawer. This helper is lighter: rename the segment
dir so ChromaDB reopens without it, and the indexer rebuilds lazily on
the next write. The original directory is renamed (not deleted) so the
operator can recover if the heuristic misfires.
If `chroma.sqlite3` is more than `stale_seconds` (default 3600) newer
than the segment's `data_level0.bin`, the segment is considered
suspect. One hour is deliberately conservative — normal HNSW flush
cadence is seconds to minutes, so an hour of drift implies a crashed
mid-write, not routine lag.
- Additive: exposes `quarantine_stale_hnsw(palace_path, stale_seconds)`
as a helper. Not wired into `_client()` / startup on this PR — the
goal is to land the primitive first so operators and higher layers
can opt in. A follow-up could call it automatically on palace open
behind an env var or config flag.
- Closes#823 by giving operators a first-class recovery path without
having to install `chromadb-ops` or re-mine.
Four new tests in `tests/test_backends.py`:
- renames drifted segment, preserves original files under `.drift-TS` suffix
- leaves fresh segments alone
- no-op on missing palace path / missing `chroma.sqlite3`
- skips already-quarantined (`.drift-` suffixed) directories
`pytest tests/test_backends.py` → 11 passed. `ruff check` / `ruff format
--check` — clean.
24bf97b (network-download fix) and my earlier Copilot-review commit both
added tests for the same ValueError. Keep the broader one that covers
both 'documents length' and 'metadatas length' mismatches; drop the
narrower duplicate.
PermissionError [WinError 32] on Windows when Path.unlink() runs while
chromadb.PersistentClient still holds a handle on chroma.sqlite3. Rewrite
test_chroma_cache_invalidates_when_db_file_missing to prime
backend._clients/_freshness with a sentinel object instead of opening a
real PersistentClient, so the unlink runs against an unheld file.
The assertion is also corrected: after invalidation, ChromaBackend's
_client rebuilds a fresh PersistentClient which re-creates chroma.sqlite3
and re-stats it, so freshness ends up at the post-rebuild stat (not
(0, 0.0) as the assertion previously expected). The meaningful invariant
is "freshness advanced past the pre-unlink value AND the sentinel was
replaced", which the test now checks.
Ref: Windows CI failure on 995.
test_base_collection_update_default_validates_list_lengths and
test_base_collection_update_default_rejects_mismatched_lengths were
spinning up a real ChromaBackend and calling add(documents=...), which
triggered ChromaDB's default ONNX embedding function and attempted a
network download — failing in offline/sandboxed CI.
BaseCollection.update() validates list lengths before any DB access, so
no items need to be pre-loaded for the length-check to fire. Switch both
tests to use _FakeCollection (same as the rest of the unit tests in this
file) so they are pure in-memory and network-free.
Also fixes a structural bug in test 1: collection._collection.add() was
accidentally placed inside the pytest.raises(ValueError) block, masking
the real assertion.
Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d
Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>
Four defects surfaced by the automated review, fixed with targeted tests:
1. BaseCollection.update() default now validates that documents / metadatas /
embeddings lengths match ids, raising ValueError instead of silently
misaligning pairs or raising IndexError (base.py).
2. ChromaCollection.query() now rejects the two ambiguous input shapes up
front — neither or both of query_texts / query_embeddings, and empty input
lists — with clear ValueError messages rather than delegating to chromadb's
less-obvious errors (chroma.py).
3. QueryResult.empty() accepts embeddings_requested=True to preserve the
outer-query dimension with empty hit lists when the caller asked for
embeddings, matching the spec rule that included fields carry the outer
shape even when empty (base.py). ChromaCollection.query() threads this
through on the empty-result path (chroma.py).
4. ChromaBackend cache-freshness check now matches the semantics from
mcp_server._get_client (merged via #757) on three edge cases Copilot
called out: (a) invalidate when chroma.sqlite3 disappears while a cached
client is held, (b) treat a 0→nonzero stat transition as a change so a
cache built when the DB did not yet exist is refreshed, (c) re-stat
after PersistentClient constructs the DB lazily so freshness reflects
the post-creation state (chroma.py).
Tests: 978 passed (up from 970), 8 new tests covering the fixes.
ChromaDB defaults HNSW index to L2 (Euclidean) distance, but
MemPalace scoring uses 1-distance which requires cosine (range 0-2).
Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test
call sites that were missing it.
Closes#218
Note from code review: (1) silent exception swallow on migration failure means caller proceeds with potentially corrupt DB — consider returning a boolean or re-raising in a follow-up. (2) No blob length validation before int.from_bytes — malformed rows could produce wrong seq_id values. Both are edge cases; the fix is still valuable for the common chromadb 0.6→1.5 migration path.
* refactor: add stage-1 backend abstraction seam
Introduce the first upstreamable storage seam for MemPalace without
bringing in the PostgreSQL spike or any benchmark artifacts.
This change adds a small backend package with:
- BaseCollection as the minimal collection contract
- ChromaBackend/ChromaCollection as the default implementation
It then routes the main runtime collection consumers through that seam:
- palace.py
- searcher.py
- layers.py
- palace_graph.py
- mcp_server.py
- miner.status()
Behavioral constraints kept for stage 1:
- ChromaDB remains the only backend and the default path
- no config/env backend selection yet
- no PostgreSQL code
- no benchmark or research files
- existing tests stay unchanged
Important compatibility details:
- read paths now call the seam with create=False so they still surface
the existing 'no palace found' behavior instead of silently creating
empty collections
- write paths keep create=True semantics through palace.get_collection()
- layers/searcher retain a chromadb module attribute so the existing
mock-based tests can keep patching PersistentClient unchanged
- ChromaBackend only creates palace directories on create=True, which
preserves mocked read-path tests that use fake read-only paths
Verification:
- python3 -m py_compile mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py
- pytest -q # 529 passed, 106 deselected
* refactor: clean up stage-1 seam compatibility shims
Tighten the stage-1 backend abstraction branch after review.
This follow-up does three small things:
- keep the chromadb compatibility hook in searcher.py and layers.py,
but express it through the backends.chroma module so it no longer
reads like an accidental unused import
- fix the palace_graph.py helper alias to avoid the local name collision
flagged by ruff (imported helper vs local _get_collection wrapper)
- preserve the existing mock-based test patch points unchanged while
keeping the new backend seam intact
Why this matters:
- the direct form looked like a
dead import in review, even though it was intentionally preserving the
existing test seam ( and
)
- palace_graph.py had a real lint issue ( redefinition) that was
small but worth fixing before a public PR
Verification:
- /opt/homebrew/bin/ruff check mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py
- pytest -q tests/test_layers.py tests/test_searcher.py
- pytest -q # 529 passed, 106 deselected
* docs: explain backend shim imports in search paths
Add short code comments in searcher.py and layers.py explaining why the
module-level `chromadb` alias remains after the stage-1 backend seam
refactor.
The alias is intentional: it preserves the existing mock patch points used
by the current test suite (`mempalace.searcher.chromadb.PersistentClient`
and `mempalace.layers.chromadb.PersistentClient`) while the runtime logic
now flows through the backend abstraction.
This keeps the public PR easier to review because the apparent "unused
import" now has an explicit reason next to it.
Verification:
- /opt/homebrew/bin/ruff check mempalace/searcher.py mempalace/layers.py
- pytest -q tests/test_layers.py tests/test_searcher.py
* refactor: reuse a default backend instance in palace helper
Tighten the stage-1 backend seam by promoting the default Chroma backend
adapter to a module-level singleton in `mempalace/palace.py`.
This keeps the stage-1 scope unchanged — Chroma is still the only backend
wired in this branch — but avoids constructing a fresh `ChromaBackend()`
object on every `get_collection()` call. The backend is stateless today,
so this is a readability/cleanup change rather than a behavioral one.
Why this helps:
- makes `palace.get_collection()` read like a real default factory instead
of an inline constructor call
- keeps the stage-1 branch a little cleaner before opening the public PR
- does not widen the backend surface or change any config/runtime behavior
Verification:
- python3 -m py_compile mempalace/palace.py
- pytest -q tests/test_miner.py tests/test_layers.py tests/test_searcher.py
- pytest -q # 529 passed, 106 deselected
* fix: harden read-only seam behavior and update seam tests
Preserve the stage-1 backend abstraction while closing the real read-path
regression surfaced in PR review.
What changed:
- make ChromaBackend.get_collection(create=False) fail fast when the palace
directory does not exist instead of letting PersistentClient create it as a
side effect
- update miner.status() to call get_collection(..., create=False) so status
keeps the historical 'No palace found' behavior
- remove the temporary chromadb shim aliases from layers.py and searcher.py
now that the tests patch the seam directly
- add focused tests for the new backends package, including ChromaCollection
delegation and ChromaBackend create=True/create=False behavior
- retarget layer/searcher tests to patch the backend seam instead of patching
chromadb.PersistentClient inside production modules
- add a regression test that status() does not create an empty palace when the
target path is missing
Verification:
- ruff check .
- uv run pytest -q
- uv run pytest -q tests/test_backends.py tests/test_cli.py tests/test_mcp_server.py tests/test_layers.py tests/test_searcher.py tests/test_miner.py
Notes:
- the separate benchmark/slow/stress layer was started as a soak but not used
as the merge gate for this PR branch
* refactor: drop duplicate mcp collection cache declaration
Remove a redundant `_collection_cache = None` assignment in
`mempalace/mcp_server.py` left over after the stage-1 backend seam refactor.
This does not change behavior; it only trims review noise in the MCP server
module after the read-path hardening pass.
Verification:
- ruff check mempalace/mcp_server.py
- uv run pytest -q tests/test_mcp_server.py
---------
Co-authored-by: Sergey Kuznetsov <sergey@iterudit.com>