Files
mempalace/tests
bensig 452630e927 fix(repair): refuse to overwrite when extraction looks truncated (#1208)
The user-reported case in #1208: a palace with 67,580 drawers had its
HNSW files manually quarantined to recover from corruption. ``mempalace
repair`` then ran cleanly and reported "Drawers found: 10000 ... Repair
complete. 10000 drawers rebuilt." Backup was the v3.3.3 chroma.sqlite3
that did contain the full 67,580 — but the rebuilt collection only had
the first 10K. 85% data loss, no warning.

Root cause: ChromaDB's collection-layer get() silently caps at
``CHROMADB_DEFAULT_GET_LIMIT = 10_000`` rows when reading from a
collection whose segment metadata is stale (typical post-quarantine
state). col.count() returns the same capped value, so neither the
loop bound nor the extraction count flagged the truncation.

Fix is defense-in-depth, not a recovery mechanism. Repair now:

1. After extraction, queries chroma.sqlite3 directly via a read-only
   sqlite3 connection: COUNT(*) FROM embeddings JOIN segments JOIN
   collections WHERE name='mempalace_drawers'. If that count exceeds
   the extracted count, abort with a clear message before any
   destructive operation.
2. Falls back to a weaker check when the SQLite query can't run
   (chromadb schema drift, locked file): if extracted exactly equals
   CHROMADB_DEFAULT_GET_LIMIT, that's a strong-enough cap signal to
   refuse without explicit acknowledgement.
3. Adds ``--confirm-truncation-ok`` (CLI) and ``confirm_truncation_ok``
   (rebuild_index kwarg) to override after independent verification.
   Useful for the rare case of a palace genuinely sized at exactly
   10,000 drawers.

The guard logic lives in ``repair.check_extraction_safety()`` so the
two extraction paths (CLI ``cmd_repair`` and the lower-level
``rebuild_index``) share a single implementation. Raises
``TruncationDetected`` carrying the printable message.

Tests: 9 new cases covering the safe path (counts match, SQLite
unreadable but well under cap), both abort paths (SQLite higher than
extracted, unreadable + at cap), the override flag, and end-to-end
behavior of ``rebuild_index`` with the guard wired in. Plus two
``sqlite_drawer_count`` tests for the missing-file and bad-schema
cases.

What's NOT in this PR: actually recovering the missing 57,580
drawers from the user's case. The on-disk SQLite still holds them;
recovery is a separate flow (direct-extract from chroma.sqlite3,
bypass the chromadb collection layer entirely). This PR's job is
to stop repair from making it worse.

Refs #1208.
2026-04-25 23:34:05 -07:00
..
2026-04-11 16:16:49 -07:00