mempalace

Author	SHA1	Message	Date
igorls	10733f1df4	fix(backends/chroma): wire quarantine_stale_hnsw into _client() to prevent SIGSEGV on stale HNSW (#1121 , #1132 , #1263 ) PR #1173 wired quarantine_stale_hnsw into the static make_client() helper but not into the instance _client() method. As a result every non-MCP entry point (CLI mining, search, repair, status) — which all use get_collection / _get_or_create_collection / _client() — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, replicated palace, or crashed-mid-write. Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw) pre-open pass into a single private static helper ChromaBackend._prepare_palace_for_open(). Both make_client() and _client() now route through it, so the _quarantined_paths once-per- palace-per-process gate is preserved (no runtime thrash on hot paths) and behaviour stays identical — the fix is purely about extending the existing protection to the path that was missing it. Tests: - test_client_quarantines_corrupt_segment_on_first_open mirrors the existing make_client test and verifies _client() actually renames a corrupt segment on first open. - test_client_quarantines_only_on_first_call_per_palace verifies the cache gate prevents re-running quarantine across repeated _client() calls — important because _client() is hit on every backend op. Closes #1121. Closes #1132. Closes #1263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:56:36 -03:00
Legion345	d7f4638157	fix(storage): stop ChromaDB from crashing when reopening an existing palace	2026-04-28 13:08:04 -07:00
Igor Lins e Silva	de7801ecff	Merge pull request #1191 from funguf/fix/hnsw-index-bloat-rebased fix: prevent HNSW index bloat from resize+persist cycles	2026-04-27 03:37:57 -03:00
igorls	04c48dd0fe	fix(chroma): write blob-fix marker even when narrowing skips all rows The narrowed _fix_blob_seq_ids returned early when safe_rows was empty, but #1177's marker contract requires the marker to be written on every successful pass — even no-op — so subsequent opens skip the sqlite3 connection entirely. Without this, palaces that have no genuine 0.6.x BLOBs but DO have sysdb-10-prefixed rows would re-open sqlite3 on every call, defeating the #1090 corruption guard. Restructured the conditional so the marker write is unconditional after a successful sqlite scan, regardless of whether any rows were updated. Surfaced by test_fix_blob_seq_ids_writes_marker_when_already_integer during the develop-rebase of this PR. The author's branch predates the marker contract from #1177 (merged 2026-04-26), so this is a rebase-edge fix-up rather than a logic change to their narrowing behaviour.	2026-04-27 03:01:41 -03:00
eblander	f5c8b095dd	fix: narrow _fix_blob_seq_ids shim + add repair --mode max-seq-id The BLOB-seq_id migration shim (PR #664) ran int.from_bytes(..., 'big') over every BLOB in max_seq_id, including chromadb 1.5.x's own native format (b'\x11\x11' + 6 ASCII digits). That conversion yields a ~1.23e18 integer that silently suppresses every subsequent embeddings_queue write for the affected segment (queue filter is seq_id > start), causing silent drawer-write drops after a 1.5.x upgrade. Two-part fix: 1. Shim narrowing (mempalace/backends/chroma.py) - Drop max_seq_id from the shim loop. chromadb owns that column's format; we don't reinterpret it. - Defense-in-depth: skip rows in embeddings whose seq_id BLOB has the sysdb-10 b'\x11\x11' prefix rather than misconvert. 2. Recovery command (mempalace/repair.py, mempalace/cli.py) - mempalace repair --mode max-seq-id [--segment <uuid>] [--from-sidecar <path>] [--dry-run] [--yes] [--no-backup] - Detects poisoned rows via threshold (seq_id > 2**53). - Default heuristic: MAX(embeddings.seq_id) over the collection owning the poisoned segment. Matches METADATA max exactly; VECTOR segments get a few seq_ids ahead (queue skips an already-indexed window — an acceptable loss vs. resetting to 0 and re-processing everything). - --from-sidecar copies clean values from a pre-corruption sqlite db. - Backs up chroma.sqlite3, closes chroma handles, atomic UPDATEs, post-repair verification that raises MaxSeqIdVerificationError if any row is still above threshold. Tests: 8 new in tests/test_repair.py (detection, heuristic, sidecar, dry-run, segment filter, no-op, backup, rollback-on-verify-failure). 3 new in tests/test_backends.py (max_seq_id untouched by shim, sysdb-10 prefix skipped in embeddings, legacy big-endian u64 BLOBs still convert). Full suite: 1103 passed.	2026-04-27 02:57:01 -03:00
Fergus Ching	88a53b2ffa	fix: prevent HNSW index bloat via batch_size + sync_threshold metadata Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every collection-creation call site: * `mempalace/backends/chroma.py` — `get_collection(create=True)` and the legacy `create_collection()` path. Preserves existing `hnsw:space`, `hnsw:num_threads=1` (race fix from #976), and `*ef_kwargs` (embedding-function plumbing from `a4868a3`). `mempalace/mcp_server.py` — the direct `client.get_or_create_collection` path used when a palace is first opened by the MCP server. Without this third site, MCP-bootstrapped palaces would skip the guard and could still trigger the original bloat. Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index resizes and hundreds of persistDirty() calls. persistDirty uses relative seek positioning in link_lists.bin; accumulated seek drift across resize cycles causes the OS to extend the sparse file with zero-filled regions, each cycle compounding the next. Result: link_lists.bin grows into hundreds of GB sparse, after which `status`, `search`, and `repair` all segfault and the palace is unrecoverable. Empirical: rebuilt a palace from scratch on 39,792 drawers across 5 wings with this fix applied. Final palace 376 MB, link_lists.bin stays at 0 bytes across both Chroma collection dirs, status and search both return cleanly. Same workload without the fix bloated the palace to 565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers. Migration note: chromadb 1.5.x exposes a `collection.modify(configuration={"hnsw": {...}})` retrofit path for already-created collections (`UpdateHNSWConfiguration`), but this PR doesn't pursue it — by the time link_lists.bin has bloated the index is already corrupt and the only known recovery is a fresh mine. Tests assert both keys land on the persisted collection metadata in both `ChromaBackend` code paths, which also covers the #1161 "config silently dropped" concern at CI time. A separate smoke test was used to verify the metadata round-trips through `chromadb.PersistentClient` reopen on chromadb 1.5.8. Closes #344 Supersedes #346 Co-authored-by: robot-rocket-science <robot-rocket-science@users.noreply.github.com>	2026-04-27 02:54:19 -03:00
Igor Lins e Silva	025dd03047	Merge pull request #1177 from jphein/fix/blob-seq-marker-guard fix: skip _fix_blob_seq_ids sqlite open on already-migrated palaces (#1090)	2026-04-26 18:28:55 -03:00
Igor Lins e Silva	8a1fd95b1b	style: drop unused pathlib.Path imports in marker tests	2026-04-26 18:23:31 -03:00
Igor Lins e Silva	43aa1aa24e	style: ruff format under CI-pinned 0.4.x	2026-04-26 18:19:15 -03:00
jp	247744296d	fix(blob-seq-marker): tests + style nit per @igorls #1177 review Three new tests cover the marker contract: - test_fix_blob_seq_ids_writes_marker_after_blob_path — marker is written after a successful BLOB → INTEGER conversion. - test_fix_blob_seq_ids_writes_marker_when_already_integer — marker is written even on a no-op (already-INTEGER) path. The point of the marker is to skip sqlite3.connect() on subsequent calls, not to record that a conversion happened. - test_fix_blob_seq_ids_skips_sqlite_when_marker_present — verifies the load-bearing property via monkeypatched sqlite3.connect: when the marker exists, the function never opens sqlite. This is the bug #1090 fix — opening Python's sqlite3 against a live ChromaDB 1.5.x WAL DB corrupts the next PersistentClient call, and we must never do it again post-migration. Also folded in @igorls's style nit: - Path(marker).touch() instead of open(marker, "a").close(). Same effect, reads cleaner. Moved Path import to the top of the module since it didn't yet exist there. 35/35 backend tests pass. The "Production: fresh MCP server + stop hook + mempalace mine subprocess" checkbox in the PR body is checked in the PR reply — the canonical 151K palace has been running this fix since #1177 was filed (via Syncthing-replicated source on the disks daemon at v1.7.0); zero PersistentClient corruption since. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:21:50 -07:00
jp	942aae3ed5	fix(hnsw): address @igorls's #1173 review Restore three `_pin_hnsw_threads` tests that the previous integrity-gate commit deleted. The function is still live code on develop (defined at chroma.py:207, called from chroma.py:705 + mcp_server.py), so the deletion left the import unused (ruff F401) and dropped coverage on a function unrelated to this PR's scope. Restored verbatim from main. Plus three nits @igorls flagged: - Thread-safety doc: `_quarantined_paths` mutation is lock-free; documented that idempotency of `quarantine_stale_hnsw` is the safety property (concurrent same-palace calls produce a benign redundant rename attempt that no-ops, no need for a lock). - Pickle protocol assumption: `_segment_appears_healthy` requires PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today, and a future protocol-0/1 emission would conservatively quarantine + lazy-rebuild rather than mis-classify as healthy. - Class-level vs module-level scope: keeping class-level — the conftest reset is the controlled case, and module-level wouldn't remove the foot-gun, just relocate it. Conftest reset documented in the existing comment is the right pattern for test isolation. Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`) deferred — that pattern lives in #1177's territory, not #1173's. 37/37 tests pass on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:20:14 -07:00
jp	74ff5e6b98	fix(hnsw): integrity gate in quarantine_stale_hnsw — corruption vs flush-lag Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded the (lowered, in #1173) 300s threshold. ChromaDB 1.5.x flushes HNSW asynchronously and a clean shutdown does not force-flush, so the on- disk HNSW is always meaningfully older than chroma.sqlite3 — that's the steady state, not corruption. Quarantine renamed valid HNSW segments on every cold-start, chromadb created empty replacements, vector recall went to 0/N until rebuild. Confirmed in production on the disks daemon journal, 2026-04-26 06:56:45: three of three HNSW segments quarantined on cold-start with 538-557s mtime gaps (post-clean-shutdown flush lag), leaving a 151,478-drawer palace with vector_ranked=0. Drift directories at *.drift-20260426-065645/ each contained a complete 253MB data_level0.bin plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by the false-positive heuristic. Fix: two-stage gate. 1. mtime gate (existing) — gap > stale_seconds is necessary. 2. integrity gate (new) — sniff index_metadata.pickle for chromadb's expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e tail) and a non-trivial size, WITHOUT deserializing the file. Healthy segment with mtime drift → keep in place; truncated / zero-filled / partial-flush → quarantine. Format-sniff is deliberately non-deserializing — pickle deserialization can execute arbitrary code, and the PROTO+STOP byte presence + size floor is sufficient to distinguish a complete chromadb write from truncation, zero-fill, or a partial flush during process kill. Real load failures (the rare case where the bytes look right but chromadb fails to load) still surface to palace-daemon's _auto_repair, which calls quarantine_stale_hnsw directly on observed HNSW errors and bypasses this gate. The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization — even with the integrity check, repeating the sniff on every reconnect is unnecessary work — but its load-bearing role is now covered by this deeper fix. 4 new tests in test_backends.py: - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta) - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone (drift + valid meta — the production case at 06:24) - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone (fresh / never-flushed, no meta file) - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor size, partial-flush shape) Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed to renames_corrupt_segment with explicit corrupt meta_bytes — the old "renames any drift" contract is gone. Suite 1366/1366 pass. Coordinated cross-repo with palace-daemon's auto-repair-on-startup workaround (separate agent's commit ed3a892). With this fork-side fix the auto-repair becomes belt-and-suspenders; the structural cause of empty-HNSW-on-restart is addressed at the quarantine layer. CLAUDE.md row 26 + README fork-change-queue row + test count 1363→1366. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 09:40:25 -07:00
jp	e5e7a57930	fix(hnsw): gate quarantine_stale_hnsw to cold-start, not every reconnect Symptom on the canonical disks daemon: drift quarantines firing every 10–30 minutes throughout the day under steady write load. Logs show .drift-* directories accumulating despite the daemon being the only writer (no Syncthing replication of palace data). Root cause is a false-positive thrash in the quarantine heuristic: - chroma.sqlite3 mtime bumps on every write (millisecond cadence). - HNSW segment files (data_level0.bin) only flush to disk on chromadb's internal cadence, which can lag minutes behind sqlite under continuous write load. Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and the cycle repeats as soon as the next batch of writes lands. The 300s threshold (lowered from 3600s in PR #1173 after a 0.96h-drift production segfault) is correct for the cross-machine-replication failure mode it was designed for, but wrong for a daemon-strict deployment whose only "drift" source is its own benign flush lag. Fix: gate the proactive quarantine check to the first ``make_client()`` invocation per palace per process (``ChromaBackend._quarantined_paths`` set). Real cold-start drift (replication, partial restore, crashed-mid- write) still gets caught — that's exactly when a fresh daemon process opens the palace. Real runtime drift on observed HNSW errors still gets caught via palace-daemon's ``_auto_repair`` which calls ``quarantine_stale_hnsw`` directly, bypassing this gate. Two new tests in test_backends.py verify single-fire-per-palace and per-palace independence. Conftest clears the gate between tests. Suite 1362/1362 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 09:40:25 -07:00
Igor Lins e Silva	7773432bca	chore(rebase): reconcile with develop and apply ruff format After rebasing onto current develop: - chroma.py: keep develop's quarantine_stale_hnsw + UnsupportedFilterError validation alongside this PR's _pin_hnsw_threads retrofit. - tests/test_backends.py: combine quarantine_stale_hnsw and _pin_hnsw_threads test sections; ruff format. - miner.py: propagate the new `files=` kwarg (added on develop in #1183 for the init -> mine flow) through _mine_impl so the caller can pass a pre-scanned file list under the global lock.	2026-04-25 04:39:31 -03:00
Felipe Truman	8df944a54d	fix: best-effort HNSW thread-pin retrofit + drop dead attempt-cap constant Addresses remaining PR #976 review items after rebase on develop. `get_collection(create=False)` previously returned existing collections without re-applying `hnsw:num_threads=1`, so palaces created before the fix kept the unsafe parallel-insert path. Add `_pin_hnsw_threads()` helper that calls `collection.modify(configuration=UpdateCollectionConfiguration( hnsw=UpdateHNSWConfiguration(num_threads=1)))` best-effort on every `get_collection` call (including the MCP server's `_get_collection`). In chromadb 1.5.x the runtime config does not persist to disk across `PersistentClient` reopens, so the retrofit is re-applied each process start rather than being a one-shot migration. Fresh palaces keep the metadata-based pin as primary defense; legacy palaces now also get per-session protection without requiring `mempalace nuke` + re-mine. After the rebase on develop, `hook_precompact` delegates to `_mine_sync` and no longer emits `decision: block`, so the attempt-cap constant was orphaned. Grep confirms 0 usages in the repo — remove it. - `_pin_hnsw_threads` retrofits legacy collection (num_threads None -> 1) - `_pin_hnsw_threads` swallows all errors (never raises) - `ChromaBackend.get_collection(create=False)` applies retrofit on legacy palace - 62 tests pass (10 backends + 6 palace locks + 46 hooks_cli)	2026-04-25 04:36:29 -03:00
jp	0c38deaab5	feat(backends): quarantine_stale_hnsw — recover from HNSW/sqlite drift Add a helper that renames HNSW segment directories whose `data_level0.bin` is significantly older than `chroma.sqlite3`. Drift between the on-disk HNSW graph and the live embeddings table is the root cause of a segfault class where the Rust graph-walk dereferences dangling neighbor pointers for entries in the metadata segment that no longer exist in the HNSW index, crashing in a background thread on `count()` or `query()`. Issue #823 describes the same drift as a silent-staleness symptom (semantic search returns stale results after `add_drawer` because `data_level0.bin` lags the sqlite metadata under the default `sync_threshold=1000`). Under heavier load or after an interrupted write, the same drift can escalate from "silent stale results" to "SIGSEGV on next open," which is the failure mode observed at neo-cortex-mcp#2 (chromadb 1.5.5, Python 3.12) and acknowledged at chroma-core/chroma#2594. On one 135K-drawer palace where `index_metadata.pickle` claimed 137,813 elements against 135,464 rows in sqlite (2,349-entry drift), fresh Python processes crashed in `col.count()` 17/20 times; after renaming the segment dir out of the way and letting ChromaDB rebuild lazily, the same 20-run check went to 0 crashes. The recovery path #823 suggests (export / recreate / reimport) is heavy — it re-embeds every drawer. This helper is lighter: rename the segment dir so ChromaDB reopens without it, and the indexer rebuilds lazily on the next write. The original directory is renamed (not deleted) so the operator can recover if the heuristic misfires. If `chroma.sqlite3` is more than `stale_seconds` (default 3600) newer than the segment's `data_level0.bin`, the segment is considered suspect. One hour is deliberately conservative — normal HNSW flush cadence is seconds to minutes, so an hour of drift implies a crashed mid-write, not routine lag. - Additive: exposes `quarantine_stale_hnsw(palace_path, stale_seconds)` as a helper. Not wired into `_client()` / startup on this PR — the goal is to land the primitive first so operators and higher layers can opt in. A follow-up could call it automatically on palace open behind an env var or config flag. - Closes #823 by giving operators a first-class recovery path without having to install `chromadb-ops` or re-mine. Four new tests in `tests/test_backends.py`: - renames drifted segment, preserves original files under `.drift-TS` suffix - leaves fresh segments alone - no-op on missing palace path / missing `chroma.sqlite3` - skips already-quarantined (`.drift-` suffixed) directories `pytest tests/test_backends.py` → 11 passed. `ruff check` / `ruff format --check` — clean.	2026-04-18 18:04:05 -07:00
Igor Lins e Silva	efaa39bea9	test(backends): dedup update-length-validation tests `24bf97b` (network-download fix) and my earlier Copilot-review commit both added tests for the same ValueError. Keep the broader one that covers both 'documents length' and 'metadatas length' mismatches; drop the narrower duplicate.	2026-04-18 13:53:46 -03:00
Igor Lins e Silva	61dd6e7d9c	test(backends): fix Windows file-lock in cache-invalidation test PermissionError [WinError 32] on Windows when Path.unlink() runs while chromadb.PersistentClient still holds a handle on chroma.sqlite3. Rewrite test_chroma_cache_invalidates_when_db_file_missing to prime backend._clients/_freshness with a sentinel object instead of opening a real PersistentClient, so the unlink runs against an unheld file. The assertion is also corrected: after invalidation, ChromaBackend's _client rebuilds a fresh PersistentClient which re-creates chroma.sqlite3 and re-stats it, so freshness ends up at the post-rebuild stat (not (0, 0.0) as the assertion previously expected). The meaningful invariant is "freshness advanced past the pre-unlink value AND the sentinel was replaced", which the test now checks. Ref: Windows CI failure on 995.	2026-04-18 13:52:56 -03:00
copilot-swe-agent[bot]	24bf97bb65	fix(tests): avoid ONNX network download in update-length validation tests test_base_collection_update_default_validates_list_lengths and test_base_collection_update_default_rejects_mismatched_lengths were spinning up a real ChromaBackend and calling add(documents=...), which triggered ChromaDB's default ONNX embedding function and attempted a network download — failing in offline/sandboxed CI. BaseCollection.update() validates list lengths before any DB access, so no items need to be pre-loaded for the length-check to fire. Switch both tests to use _FakeCollection (same as the rest of the unit tests in this file) so they are pure in-memory and network-free. Also fixes a structural bug in test 1: collection._collection.add() was accidentally placed inside the pytest.raises(ValueError) block, masking the real assertion. Agent-Logs-Url: https://github.com/MemPalace/mempalace/sessions/55fc663e-b256-4b8b-88ce-4271560def8d Co-authored-by: igorls <4753812+igorls@users.noreply.github.com>	2026-04-18 16:23:58 +00:00
Igor Lins e Silva	42b940d263	fix(backends): address Copilot review on #995 Four defects surfaced by the automated review, fixed with targeted tests: 1. BaseCollection.update() default now validates that documents / metadatas / embeddings lengths match ids, raising ValueError instead of silently misaligning pairs or raising IndexError (base.py). 2. ChromaCollection.query() now rejects the two ambiguous input shapes up front — neither or both of query_texts / query_embeddings, and empty input lists — with clear ValueError messages rather than delegating to chromadb's less-obvious errors (chroma.py). 3. QueryResult.empty() accepts embeddings_requested=True to preserve the outer-query dimension with empty hit lists when the caller asked for embeddings, matching the spec rule that included fields carry the outer shape even when empty (base.py). ChromaCollection.query() threads this through on the empty-result path (chroma.py). 4. ChromaBackend cache-freshness check now matches the semantics from mcp_server._get_client (merged via #757) on three edge cases Copilot called out: (a) invalidate when chroma.sqlite3 disappears while a cached client is held, (b) treat a 0→nonzero stat transition as a change so a cache built when the DB did not yet exist is refreshed, (c) re-stat after PersistentClient constructs the DB lazily so freshness reflects the post-creation state (chroma.py). Tests: 978 passed (up from 970), 8 new tests covering the fixes.	2026-04-18 13:19:18 -03:00
Igor Lins e Silva	a17a8b734a	refactor(backends): typed QueryResult/GetResult, PalaceRef, BaseBackend registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).	2026-04-18 12:45:16 -03:00
eblander	1e86892e62	Fix: set cosine distance metadata on all collection creation sites ChromaDB defaults HNSW index to L2 (Euclidean) distance, but MemPalace scoring uses 1-distance which requires cosine (range 0-2). Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test call sites that were missing it. Closes #218	2026-04-13 11:00:52 -04:00
Jeffrey Hein	abc99f4154	fix: auto-repair BLOB seq_ids from chromadb 0.6→1.5 migration (#664 ) Note from code review: (1) silent exception swallow on migration failure means caller proceeds with potentially corrupt DB — consider returning a boolean or re-raising in a follow-up. (2) No blob length validation before int.from_bytes — malformed rows could produce wrong seq_id values. Both are edge cases; the fix is still valuable for the common chromadb 0.6→1.5 migration path.	2026-04-11 23:06:01 -07:00
Sergey Kuznetsov	ae5196bc8d	Мempalace backend seam (#413 ) * refactor: add stage-1 backend abstraction seam Introduce the first upstreamable storage seam for MemPalace without bringing in the PostgreSQL spike or any benchmark artifacts. This change adds a small backend package with: - BaseCollection as the minimal collection contract - ChromaBackend/ChromaCollection as the default implementation It then routes the main runtime collection consumers through that seam: - palace.py - searcher.py - layers.py - palace_graph.py - mcp_server.py - miner.status() Behavioral constraints kept for stage 1: - ChromaDB remains the only backend and the default path - no config/env backend selection yet - no PostgreSQL code - no benchmark or research files - existing tests stay unchanged Important compatibility details: - read paths now call the seam with create=False so they still surface the existing 'no palace found' behavior instead of silently creating empty collections - write paths keep create=True semantics through palace.get_collection() - layers/searcher retain a chromadb module attribute so the existing mock-based tests can keep patching PersistentClient unchanged - ChromaBackend only creates palace directories on create=True, which preserves mocked read-path tests that use fake read-only paths Verification: - python3 -m py_compile mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py - pytest -q # 529 passed, 106 deselected * refactor: clean up stage-1 seam compatibility shims Tighten the stage-1 backend abstraction branch after review. This follow-up does three small things: - keep the chromadb compatibility hook in searcher.py and layers.py, but express it through the backends.chroma module so it no longer reads like an accidental unused import - fix the palace_graph.py helper alias to avoid the local name collision flagged by ruff (imported helper vs local _get_collection wrapper) - preserve the existing mock-based test patch points unchanged while keeping the new backend seam intact Why this matters: - the direct form looked like a dead import in review, even though it was intentionally preserving the existing test seam ( and ) - palace_graph.py had a real lint issue ( redefinition) that was small but worth fixing before a public PR Verification: - /opt/homebrew/bin/ruff check mempalace/backends/__init__.py mempalace/backends/base.py mempalace/backends/chroma.py mempalace/palace.py mempalace/searcher.py mempalace/layers.py mempalace/palace_graph.py mempalace/mcp_server.py mempalace/miner.py - pytest -q tests/test_layers.py tests/test_searcher.py - pytest -q # 529 passed, 106 deselected * docs: explain backend shim imports in search paths Add short code comments in searcher.py and layers.py explaining why the module-level `chromadb` alias remains after the stage-1 backend seam refactor. The alias is intentional: it preserves the existing mock patch points used by the current test suite (`mempalace.searcher.chromadb.PersistentClient` and `mempalace.layers.chromadb.PersistentClient`) while the runtime logic now flows through the backend abstraction. This keeps the public PR easier to review because the apparent "unused import" now has an explicit reason next to it. Verification: - /opt/homebrew/bin/ruff check mempalace/searcher.py mempalace/layers.py - pytest -q tests/test_layers.py tests/test_searcher.py * refactor: reuse a default backend instance in palace helper Tighten the stage-1 backend seam by promoting the default Chroma backend adapter to a module-level singleton in `mempalace/palace.py`. This keeps the stage-1 scope unchanged — Chroma is still the only backend wired in this branch — but avoids constructing a fresh `ChromaBackend()` object on every `get_collection()` call. The backend is stateless today, so this is a readability/cleanup change rather than a behavioral one. Why this helps: - makes `palace.get_collection()` read like a real default factory instead of an inline constructor call - keeps the stage-1 branch a little cleaner before opening the public PR - does not widen the backend surface or change any config/runtime behavior Verification: - python3 -m py_compile mempalace/palace.py - pytest -q tests/test_miner.py tests/test_layers.py tests/test_searcher.py - pytest -q # 529 passed, 106 deselected * fix: harden read-only seam behavior and update seam tests Preserve the stage-1 backend abstraction while closing the real read-path regression surfaced in PR review. What changed: - make ChromaBackend.get_collection(create=False) fail fast when the palace directory does not exist instead of letting PersistentClient create it as a side effect - update miner.status() to call get_collection(..., create=False) so status keeps the historical 'No palace found' behavior - remove the temporary chromadb shim aliases from layers.py and searcher.py now that the tests patch the seam directly - add focused tests for the new backends package, including ChromaCollection delegation and ChromaBackend create=True/create=False behavior - retarget layer/searcher tests to patch the backend seam instead of patching chromadb.PersistentClient inside production modules - add a regression test that status() does not create an empty palace when the target path is missing Verification: - ruff check . - uv run pytest -q - uv run pytest -q tests/test_backends.py tests/test_cli.py tests/test_mcp_server.py tests/test_layers.py tests/test_searcher.py tests/test_miner.py Notes: - the separate benchmark/slow/stress layer was started as a soak but not used as the merge gate for this PR branch * refactor: drop duplicate mcp collection cache declaration Remove a redundant `_collection_cache = None` assignment in `mempalace/mcp_server.py` left over after the stage-1 backend seam refactor. This does not change behavior; it only trims review noise in the MCP server module after the read-path hardening pass. Verification: - ruff check mempalace/mcp_server.py - uv run pytest -q tests/test_mcp_server.py --------- Co-authored-by: Sergey Kuznetsov <sergey@iterudit.com>	2026-04-11 16:16:49 -07:00

24 Commits