Address Copilot review on #1403: the test seeked unconditionally to
offset 40960 with only `pre_size > 16384` as a guard. If pre_size sat
between 16384 and 40960 + 16384 = 57344 (e.g., on a chromadb version
that allocated fewer pages on init, or a future schema change), the
seek would extend the file with zero-padding and the original pages
would stay intact — quick_check would still pass on the (untouched)
real data, and the regression guard would silently skip detecting a
preflight-ordering regression.
Compute the offset from pre_size, page-aligned, with explicit asserts
that the file is large enough to mangle 4 pages without truncating
the header or extending past EOF.
#1364 added the SQLite quick_check preflight to rebuild_index, but
placed it AFTER backend.get_collection(...). On a SQLite-corrupt
palace, chromadb's rust binding raises pyo3_runtime.PanicException —
which is not a regular Exception subclass — so it propagates past the
existing `except Exception` handlers and the user sees a 30-line stack
trace instead of the friendly abort message #1364 was designed to
deliver. Reproduced with `mempalace repair --yes` against a palace
whose chroma.sqlite3 has 4 mangled pages: pre-fix, panic; post-fix,
the clean abort message and exit code 1.
Two changes:
- mempalace/cli.py cmd_repair: run sqlite_integrity_errors() right
after the basic palace-existence check, BEFORE the max_seq_id
preflight (which itself opens sqlite3) and BEFORE backend =
ChromaBackend(). Exit non-zero so unattended scripts and CI gates
see the failure.
- mempalace/repair.py rebuild_index: same move at the function level
for direct callers (tests, MCP) that bypass cmd_repair.
The new test test_rebuild_index_runs_sqlite_preflight_before_chromadb_open
uses a real chromadb-built palace (no ChromaBackend mock) plus a
real corrupt SQLite (16 KB of mangled pages) so the ordering is
exercised end-to-end. The previously-shipping test for the abort path
mocked both the backend and sqlite_integrity_errors, which is why the
ordering bug shipped CI-green.
Six existing test_cli.py cmd_repair tests used `(palace_dir /
"chroma.sqlite3").write_text("db")` to fake the SQLite file. The new
preflight correctly fails quick_check on those 2-byte stubs, so the
tests now create empty real SQLite DBs the same way the test_repair.py
fixtures already do.
#1357 (max_seq_id preflight) merged into develop while this branch
was in CI, opening a fresh conflict between the two preflight helpers.
mempalace/repair.py:
- Kept both: this branch's sqlite_integrity_errors() / print_sqlite_
integrity_abort() AND develop's maybe_repair_poisoned_max_seq_id_
before_rebuild() from #1357. They check for distinct corruption
classes and run as separate preflights.
tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors test group and
develop's max_seq_id preflight test group; non-overlapping coverage.
Local: 1623 tests pass, ruff lint+format clean against 0.4.x CI pin.
Conflicts opened by #1285 (temp-staging rebuild) and #1312
(collection_name in recovery paths) merging after this branch was
authored.
mempalace/repair.py:
- Kept this branch's sqlite_integrity_errors() and
print_sqlite_integrity_abort() helpers; took develop's rebuild_index
signature with the collection_name parameter from #1312. Normalized
the helper's print indent to 2 spaces to match the rest of the file.
tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors tests and develop's
rebuild_from_sqlite + configured-collection coverage.
- Replaced 7 sites of sqlite_path.write_text("fake") with
sqlite3.connect(...).close() — write_text("fake") fails PRAGMA
quick_check, so the new preflight aborts before the rebuild logic
the tests actually exercise. An empty real SQLite DB passes
quick_check and lets the tests run as intended.
- Took develop's temp-staging assertion shape (delete/create the
__repair_tmp collection in addition to the live drawers collection)
for the existing test_rebuild_index_success test.
Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
Three conflicts, all from develop landing #1285/#1310/#1312 after this
branch was authored:
- mempalace/cli.py: keep both import sets — this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
RebuildCollectionError / _close_chroma_handles / _extract_drawers /
_rebuild_collection_via_temp added in #1285.
- mempalace/repair.py: keep this branch's
maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
develop's rebuild_index signature with the collection_name parameter
added in #1312. Normalized print indent to 2 spaces matching the
rest of the file.
- tests/test_repair.py: keep both this branch's max_seq_id preflight
tests and develop's rebuild_from_sqlite + configured-collection-name
tests; they exercise distinct code paths and don't overlap.
Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
The helper opened a chromadb PersistentClient via ChromaBackend and never
closed it, leaving rust-side SQLite/HNSW file locks alive after the
helper returned. On Windows that blocks the in-place archive rename
inside rebuild_from_sqlite with WinError 32 on data_level0.bin,
causing test_rebuild_from_sqlite_in_place_archives_when_opted_in and
test_rebuild_from_sqlite_raises_on_upsert_failure to fail in the
test-windows CI job. No test consumes the returned collection, so
closing the backend in a try/finally is safe and drops the return.
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on #1310:
- repair.py: drawers collection name now resolves from
MempalaceConfig().collection_name via _drawers_collection_name() (closets
stays fixed by design — AAAK index references drawer IDs by string).
Lines up with the broader configured-collection work in #1312 so that
PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
_rebuild_one_collection so a Chroma "Collection already exists" failure
surfaces as RebuildPartialError with archive_path, not an unstructured
exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
with backend.close() so PersistentClient handles to dest_palace are
released on every exit path. The from-sqlite path post-dates #1285's
lifecycle hardening of the legacy rebuild, so this needed its own
cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
rebuild_from_sqlite returns {} (validation refusal sentinel), so
unattended scripts/CI distinguish "invalid inputs" from a successful
rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
now asserts every backing segment is scope='METADATA', locking in the
segment-layout assumption against future regressions that point the
JOIN at the VECTOR segment.
New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `--mode legacy` and the inline `cli.cmd_repair` rebuild path
call `Collection.count()` as their first read — the same call that
raises `chromadb.errors.InternalError: Failed to apply logs to the
hnsw segment writer` on the corruption class reported in #1308.
Repair would print "Cannot recover — palace may need to be re-mined
from source files" even though the underlying SQLite tables were
fully intact.
The new `--mode from-sqlite` reads `(id, document, metadata)` rows
directly from `chroma.sqlite3` via `segments` → `embeddings` →
`embedding_metadata` joins, never opens a chromadb client against
the corrupt palace, and re-upserts everything into a fresh palace.
- `--source PATH` extracts from a corrupt palace already moved aside
- `--archive-existing` handles the in-place case by renaming the
existing palace to `<palace>.pre-rebuild-<timestamp>` first
- Partial-rebuild failures raise `RebuildPartialError` with the
archive path so users can recover; CLI exits non-zero
- In-place mode calls `SharedSystemClient.clear_system_cache()` to
drop chromadb's process-wide System registry (cross-palace use
does not, to limit blast radius for library callers)
- Source validation runs before any destructive moves
Verified end-to-end recovering a 52,300-row real-world corrupt
palace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rollback cleanup was instantiating a fresh ChromaBackend, so the live backend that had opened the PersistentClient could keep file handles alive during restore. Close the active backend instance instead so rollback and CLI recovery can release Windows-safe locks before copying the backup back into place.
`_compute_heuristic_seq_id` ran `int(row[0])` directly on the result
of `MAX(e.seq_id)`. On palaces where chromadb 1.5.x has been writing
seq_ids natively (8-byte big-endian uint64 BLOB), that raises
`ValueError: invalid literal for int() with base 10: b'...'` before
the dry-run can print, leaving users with no path through the
recovery feature added in #1135 — the only documented un-poison
route for palaces hit by the original PR #664 shim bug.
Decode BLOB return values via `int.from_bytes(val, "big")` and
keep the existing `int(val)` path for INTEGER rows. Regression
test seeds a BLOB row in `embeddings.seq_id` and asserts the
heuristic surfaces the correct integer.
The BLOB-seq_id migration shim (PR #664) ran int.from_bytes(..., 'big')
over every BLOB in max_seq_id, including chromadb 1.5.x's own native
format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18
integer that silently suppresses every subsequent embeddings_queue write
for the affected segment (queue filter is seq_id > start), causing
silent drawer-write drops after a 1.5.x upgrade.
Two-part fix:
1. Shim narrowing (mempalace/backends/chroma.py)
- Drop max_seq_id from the shim loop. chromadb owns that column's
format; we don't reinterpret it.
- Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the
sysdb-10 b'\x11\x11' prefix rather than misconvert.
2. Recovery command (mempalace/repair.py, mempalace/cli.py)
- mempalace repair --mode max-seq-id [--segment <uuid>]
[--from-sidecar <path>] [--dry-run] [--yes] [--no-backup]
- Detects poisoned rows via threshold (seq_id > 2**53).
- Default heuristic: MAX(embeddings.seq_id) over the collection owning
the poisoned segment. Matches METADATA max exactly; VECTOR segments
get a few seq_ids ahead (queue skips an already-indexed window — an
acceptable loss vs. resetting to 0 and re-processing everything).
- --from-sidecar copies clean values from a pre-corruption sqlite db.
- Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs,
post-repair verification that raises MaxSeqIdVerificationError if
any row is still above threshold.
Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar,
dry-run, segment filter, no-op, backup, rollback-on-verify-failure).
3 new in tests/test_backends.py (max_seq_id untouched by shim,
sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs
still convert). Full suite: 1103 passed.
The user-reported case in #1208: a palace with 67,580 drawers had its
HNSW files manually quarantined to recover from corruption. ``mempalace
repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair
complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3
that did contain the full 67,580 — but the rebuilt collection only had
the first 10K. 85% data loss, no warning.
Root cause: ChromaDB's collection-layer get() silently caps at
``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a
collection whose segment metadata is stale (typical post-quarantine
state). col.count() returns the same capped value, so neither the
loop bound nor the extraction count flagged the truncation.
Fix is defense-in-depth, not a recovery mechanism. Repair now:
1. After extraction, queries chroma.sqlite3 directly via a read-only
sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN
collections WHERE name='mempalace_drawers'. If that count exceeds
the extracted count, abort with a clear message before any
destructive operation.
2. Falls back to a weaker check when the SQLite query can't run
(chromadb schema drift, locked file): if extracted exactly equals
CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to
refuse without explicit acknowledgement.
3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok``
(rebuild_index kwarg) to override after independent verification.
Useful for the rare case of a palace genuinely sized at exactly
10,000 drawers.
The guard logic lives in ``repair.check_extraction_safety()`` so the
two extraction paths (CLI ``cmd_repair`` and the lower-level
``rebuild_index``) share a single implementation. Raises
``TruncationDetected`` carrying the printable message.
Tests: 9 new cases covering the safe path (counts match, SQLite
unreadable but well under cap), both abort paths (SQLite higher than
extracted, unreadable + at cap), the override flag, and end-to-end
behavior of ``rebuild_index`` with the guard wired in. Plus two
``sqlite_drawer_count`` tests for the missing-file and bad-schema
cases.
What's NOT in this PR: actually recovering the missing 57,580
drawers from the user's case. The on-disk SQLite still holds them;
recovery is a separate flow (direct-extract from chroma.sqlite3,
bypass the chromadb collection layer entirely). This PR's job is
to stop repair from making it worse.
Refs #1208.
Prerequisite for RFC 001 (plugin spec, #743). Removes every direct
`import chromadb` outside the ChromaDB backend itself so the core
modules depend only on the backend abstraction layer.
Extends ChromaBackend with make_client, get_or_create_collection,
delete_collection, create_collection, and backend_version. Adds
update() to the BaseCollection contract. Non-backend callers
(mcp_server, dedup, repair, migrate, cli) now go through the
abstraction; tests patch ChromaBackend instead of chromadb.
With this landed, the RFC 001 spec can be enforced and PalaceStore
(#643) can ship as a plugin without touching core modules.