376 Commits

Author SHA1 Message Date
Igor Lins e Silva 2fc47a52fc Merge pull request #1004 from coogie/coogie/fix/miner-routing
Tests / test-linux (3.11) (push) Successful in 13m29s
Tests / test-linux (3.13) (push) Successful in 13m6s
Tests / test-linux (3.9) (push) Successful in 13m14s
Tests / lint (push) Successful in 9m24s
Tests / test-windows (push) Has been cancelled
Tests / test-macos (push) Has been cancelled
fix(miner): use token-boundary matching in detect_room
2026-05-09 01:36:51 -03:00
mvalentsev 1822756afe fix(test_sync): use tmp_dir for elsewhere path so it stays absolute on Windows
Windows runs treated `/tmp/elsewhere/x.md` as relative because Windows
absolute paths require a drive letter, so `_classify_drawer` routed
`drawer_out_of_scope` to `no_source` instead of `out_of_scope` and
`test_dry_run_classifies_correctly` failed on test-windows.

`Path(tmp_dir) / "elsewhere" / "x.md"` is absolute on every platform
and still lives outside the project root that the synced_world fixture
exposes via `repo_path`, so the bucket assertions hold cross-platform.
2026-05-09 04:12:18 +05:00
mvalentsev 18f877869b style(test_sync): match CI ruff 0.4.x format 2026-05-09 04:04:56 +05:00
mvalentsev 0ff4121404 fix(sync): symmetric source_file resolve + perf optimizations
CI fix: `_classify_drawer` now resolves `source_file` symmetric to
`project_roots` (which `_normalize_project_dirs` and
`_auto_detect_project_roots` already `.resolve()`). Without this, on
platforms where the temp directory is a symlink (macOS `/var/folders` ->
`/private/var/folders`, Windows 8.3 short-name normalization), every
drawer mis-bucketed as `out_of_scope` and survived prune.

Perf:
- `_resolve_project_root`: early-return on first match (sorted-desc
  precondition).
- `_normalize_project_dirs`: sort `(-len(str(p)), str(p))` desc for
  early-return + deterministic tie-break on equal-length paths.
- `_auto_detect_project_roots`: `seen_sources` dedupe so a 200-chunk
  file costs one disk walk, not 200.
- `sync_palace` main loop: per-file classification cache; registry
  sentinels (`_reg_*`, `room=_registry`, `ingest_mode=registry`) routed
  to "kept" before cache lookup so a sentinel sharing a `source_file`
  with a pruned drawer cannot inherit a stale "gitignored" verdict.
- Closet purge: collapse O(N) per-file purge into one
  `where={"source_file": {"$in": [...]}}` get + one bulk delete.

Tests (5 new in `TestSyncPalace`, 38 total):
- `test_symlinked_project_root_resolves`: pins symmetric resolve via
  real `os.symlink` (skipped on Windows).
- `test_classification_cache_avoids_redundant_disk_hits`: monkeypatch
  counter on `_classify_drawer` asserts `call_count == 1` for 5 chunks
  sharing one source_file.
- `test_closet_batch_purge_single_call`: wraps closets collection with
  `CallCountingCol` (forwards `.get`/`.delete`); asserts
  `delete_calls == 1` and `get_calls == 1`; expected `removed_closets`
  derived from `report["by_source"]` to stay robust to fixture changes.
- `test_registry_check_runs_before_cache_lookup`: a regular drawer
  caches "gitignored" first; a sentinel with the same source_file must
  still be kept.
- `test_normalize_project_dirs_sort_stable_on_equal_length`: pins the
  alphabetical secondary key when paths share length.
2026-05-09 04:01:34 +05:00
mvalentsev 1d3eecbf9d feat(sync): add gitignore-aware drawer prune (#1252)
Add `mempalace sync` CLI command and `mempalace_sync` MCP tool that
prune drawers whose source files are gitignored, deleted, or moved
out of the project. Reuses the existing GitignoreMatcher
infrastructure in mempalace/miner.py so the same gitignore rules
that block ingest also drive the corresponding cleanup.

Closes #1252.
2026-05-09 03:16:03 +05:00
Igor Lins e Silva 3a76360301 fix(hooks): per-target PID guard with atomic claim (#1212, #1206)
The hook PID guard used a single global ``~/.mempalace/hook_state/mine.pid``
file, which failed two ways:

1. ``_mine_already_running`` read-then-spawn was a TOCTOU race. Two
   near-simultaneous Stop hook fires both passed the existence/liveness
   check before either wrote — so both ended up calling
   ``_spawn_mine``.

2. ``_spawn_mine`` unconditionally overwrote the global PID file with
   the new child's PID. The first PID was lost, orphaning the first
   child. The user-visible result in #1212 was two concurrent
   ``mempalace mine`` processes running against the same source, both
   driving HNSW inserts in parallel — exactly the corruption pattern
   the guard was meant to prevent. #1206 reported the same shape from
   the perspective of the user (two mines hung on a 350MB folder).

Replace the global file with per-target slots under
``~/.mempalace/hook_state/mine_pids/``, keyed by sha256 of the mine
sub-arguments (everything after ``mine``). The slot is claimed via
``O_CREAT | O_EXCL`` so the claim is atomic — two simultaneous fires
can never both pass.  Stale slots (PID exists but is dead) are
reclaimed transparently. Different targets (e.g. project mine vs
transcript ingest, or two different MEMPAL_DIRs) get independent
slots and run in parallel.

The mine subprocess receives its slot path via
``MEMPALACE_MINE_PID_FILE`` env var; ``miner._cleanup_mine_pid_file``
reads that var on exit and removes the slot if it points at our PID,
so orphaned slots from crashed mines don't accumulate.

Also routes ``_ingest_transcript`` through ``_spawn_mine`` so the
transcript ingest path now participates in the same dedup — repeated
Stop fires for the same transcript no longer stack parallel mines.

Closes #1212
Closes #1206
2026-05-08 02:09:00 -03:00
Igor Lins e Silva c70d5182dd Merge pull request #1413 from MemPalace/fix/1264-mine-lock-holder-diagnostics
fix(mine): identify lock holder + exit non-zero on contention (#1264)
2026-05-08 01:59:00 -03:00
Igor Lins e Silva 11a35de5ac test(palace): set USERPROFILE too so the lock-path test works on Windows
os.path.expanduser("~") reads HOME on POSIX but USERPROFILE on Windows;
the lock-body bound test was monkeypatching HOME only, so on
test-windows the lock file landed in the runner's real ~/.mempalace
and the tmp_path glob found nothing.

Patch USERPROFILE in addition to HOME, and read the body as bytes so
the byte-0 sentinel doesn't trip a UTF-8 decode warning. Assertion
shifts from line-count to size-bound (still detects unbounded growth
across re-acquires).
2026-05-08 01:34:46 -03:00
Igor Lins e Silva ef8d83cc8a fix(mine): identify lock holder + exit non-zero on contention
When a `mempalace mine` collided with another writer (live mcp_server,
another mine, anything taking mine_palace_lock), the operator saw a
generic "another `mempalace mine` is already running" message and the
CLI exited 0 — making the contention invisible to nohup or scripts
checking $?. The reporter ran a `nohup mempalace mine ... & disown`
and got a 200-byte log with only the auto-defaults warning, no clue
that an MCP server was holding the store.

palace.py: the lock file now records the holder's PID + first three
argv tokens on acquire. A failed acquire reads the file and surfaces
"palace <path> is held by PID N (mempalace mcp_server); wait for it
to finish or stop the holder before retrying" in the
MineAlreadyRunning message. Open mode changes from "w" to "a+" so the
prior holder's identity survives long enough to be read.

miner.mine() now lets MineAlreadyRunning propagate. cmd_mine catches
it, prints the holder-aware message to stderr, and exits non-zero so
shell wrappers detect the contention.

Note: this is a behavior change for in-process callers that depended
on miner.mine() silently swallowing MineAlreadyRunning. The silent
swallow was the bug.

Closes #1264
2026-05-08 01:00:00 -03:00
Igor Lins e Silva 71804c0aa5 fix(hooks): detach Popen children so the hook can exit on Windows
The Stop hook spawns mining subprocesses via subprocess.Popen and then
returns. On Windows the parent stays blocked at session end because the
child inherits stdout/stderr handles and the OS waits for them to
release before the parent can exit — the user-visible symptom is the
"running stop hooks... 3/3" spinner hanging for minutes (#1268).

Add _detached_popen_kwargs() helper that returns the right detach knobs
per platform:
- POSIX: start_new_session=True, stdin=DEVNULL, close_fds=True
- Windows: creationflags=DETACHED_PROCESS|CREATE_NEW_PROCESS_GROUP|
  CREATE_BREAKAWAY_FROM_JOB, stdin=DEVNULL, close_fds=True

Apply to all three fire-and-forget Popen sites in hooks_cli:
_spawn_mine, _ingest_transcript, _desktop_toast. Leave _mine_sync's
subprocess.run alone — that path is intentionally synchronous (the
precompact hook must wait for the mine to finish).

Note: the issue body references mempalace-stop.js, which does not exist
in this repo (the plugin ships shell wrappers calling Python). The
mechanism described — child holds parent open via inherited handles —
is universal, so this fix targets the equivalent symptom in our Python
hook path. Will follow up on the upstream JS file with the reporter.
2026-05-08 00:55:11 -03:00
Igor Lins e Silva ea36a00f5f Merge pull request #1406 from MemPalace/fix/925-diary-content-hash
fix(diary): detect same-size edits via content hash (#925)
2026-05-07 17:53:24 -03:00
Stephen Coogan ead2c5d299 fix(miner): use token-boundary matching in detect_room
Substring checks in path/filename routing caused systemic misrouting
in large monorepos — e.g., "views" ⊂ "interviews" sent every file
under views/ to the interviews room. Switch to separator-bounded
token matching (-, _, ., /) via a _name_matches helper, applied to
priority 1 (path parts) and priority 2 (filename).
2026-05-07 21:44:30 +01:00
Igor Lins e Silva 26bc3d4f91 test(diary): write fixture with explicit utf-8 to fix Windows hash assert
test_legacy_state_backfills_content_hash failed on test-windows because
Path.write_text without an encoding uses the system locale (cp1252 on
Windows). The em dash was written as 0x97, then read back by
diary_ingest as UTF-8 with errors=replace — round-trip produced
different bytes than the in-Python literal, so the assertion comparing
the persisted hash to sha256(text.encode(utf-8)) diverged.

Pin the write side to encoding=utf-8 so the on-disk bytes match what
diary_ingest decodes. No production change.
2026-05-07 17:41:19 -03:00
Igor Lins e Silva ba30ab6951 Merge pull request #1405 from MemPalace/fix/1156-exporter-reject-symlinks
fix(exporter): refuse symlinks at export targets (#1156)
2026-05-07 17:38:23 -03:00
Igor Lins e Silva 2ff6283b32 fix(diary): rebuild closets on hash change + backfill legacy state
Address Copilot review on #925:

- Full closet rebuild whenever the content hash differs from prior
  state, not only on entry-count growth. Without this, an in-place
  edit (same entry count, different body) updated the drawer but
  left the closet/search index stale — defeats the verbatim guarantee
  at the search layer even if the drawer is correct.
- Legacy size-only skip path now records the computed content_hash
  back into state so subsequent runs use the strict hash check
  instead of remaining on the size-only path indefinitely.
- Test updates: typo direction in the regression test now matches the
  comment (typo "Teh" → fix "The"), assertion now also checks the
  closet collection reflects the edit, and a new test exercises the
  legacy-state backfill path.
2026-05-07 12:54:09 -03:00
Igor Lins e Silva 75452380a8 fix(exporter): refuse symlinks at file targets and skip tests on Windows
Address Copilot review on #1156:

- Per-file symlink check via new _safe_open_for_write() helper. Uses
  O_NOFOLLOW on POSIX (close TOCTOU window between islink check and
  open) and falls back to islink + open on Windows. Applied to room
  files and index.md, mirroring the existing dir-level check.
- Tests now wrap os.symlink() in _try_symlink_or_skip() so Windows
  without Developer Mode and restricted CI sandboxes skip rather than
  hard-fail. Added two regression tests for the file-level cases
  (room file, index.md).
2026-05-07 12:51:47 -03:00
Igor Lins e Silva 8e21b5abd4 test(closet_llm): use _ for unused return values per copilot review 2026-05-07 12:49:27 -03:00
Igor Lins e Silva 0d1c1fbcaa fix(diary): detect same-size edits via content hash
The skip-if-unchanged check compared byte length only, so any in-place
edit preserving total length (typo fix "teh"→"the", word swap) was
silently dropped — a verbatim-storage violation: the user's actual
words never reached the palace.

Switch the gate to sha256(text). State entries gain a "content_hash"
field; the legacy size-only path is preserved when prev_hash is missing
so a post-upgrade run does not re-ingest every untouched diary.

Closes #925
2026-05-07 12:42:02 -03:00
Igor Lins e Silva 40e2c8b056 fix(exporter): refuse symlinks at export targets
A symlink pre-placed at the export output_dir or any wing subdirectory
would redirect markdown writes to wherever the symlink points. The
miner already rejects symlinked inputs via Path.is_symlink(); the
exporter should apply the same caution to outputs.

Add _reject_symlink() helper and call it before makedirs on both
output_dir and each wing_dir. Refusal raises ValueError with a clear
message rather than silently falling through.

Closes #1156
2026-05-07 12:40:26 -03:00
Igor Lins e Silva 2a0ed0cb8f fix(closet_llm): retry _call_llm on JSONDecodeError instead of bailing
The retry loop already backs off on HTTP 429/503 and rate-limit-shaped
exceptions, but JSONDecodeError exited on the first failure. Local LLM
runtimes occasionally produce malformed JSON (truncated streams, partial
chunks under load), and the retry was effectively dead for that path.

Mirror the 429/503 branch: sleep with exponential backoff and continue
through all 3 attempts, only returning None after the final failure.

Closes #1155
2026-05-07 12:38:39 -03:00
Igor Lins e Silva 7b151039c9 test(repair): page-align corruption offset in preflight regression test
Address Copilot review on #1403: the test seeked unconditionally to
offset 40960 with only `pre_size > 16384` as a guard. If pre_size sat
between 16384 and 40960 + 16384 = 57344 (e.g., on a chromadb version
that allocated fewer pages on init, or a future schema change), the
seek would extend the file with zero-padding and the original pages
would stay intact — quick_check would still pass on the (untouched)
real data, and the regression guard would silently skip detecting a
preflight-ordering regression.

Compute the offset from pre_size, page-aligned, with explicit asserts
that the file is large enough to mangle 4 pages without truncating
the header or extending past EOF.
2026-05-07 12:07:54 -03:00
Igor Lins e Silva 5134a635ed fix(repair): run SQLite integrity preflight before chromadb open
#1364 added the SQLite quick_check preflight to rebuild_index, but
placed it AFTER backend.get_collection(...). On a SQLite-corrupt
palace, chromadb's rust binding raises pyo3_runtime.PanicException —
which is not a regular Exception subclass — so it propagates past the
existing `except Exception` handlers and the user sees a 30-line stack
trace instead of the friendly abort message #1364 was designed to
deliver. Reproduced with `mempalace repair --yes` against a palace
whose chroma.sqlite3 has 4 mangled pages: pre-fix, panic; post-fix,
the clean abort message and exit code 1.

Two changes:

- mempalace/cli.py cmd_repair: run sqlite_integrity_errors() right
  after the basic palace-existence check, BEFORE the max_seq_id
  preflight (which itself opens sqlite3) and BEFORE backend =
  ChromaBackend(). Exit non-zero so unattended scripts and CI gates
  see the failure.

- mempalace/repair.py rebuild_index: same move at the function level
  for direct callers (tests, MCP) that bypass cmd_repair.

The new test test_rebuild_index_runs_sqlite_preflight_before_chromadb_open
uses a real chromadb-built palace (no ChromaBackend mock) plus a
real corrupt SQLite (16 KB of mangled pages) so the ordering is
exercised end-to-end. The previously-shipping test for the abort path
mocked both the backend and sqlite_integrity_errors, which is why the
ordering bug shipped CI-green.

Six existing test_cli.py cmd_repair tests used `(palace_dir /
"chroma.sqlite3").write_text("db")` to fake the SQLite file. The new
preflight correctly fails quick_check on those 2-byte stubs, so the
tests now create empty real SQLite DBs the same way the test_repair.py
fixtures already do.
2026-05-07 11:52:58 -03:00
Igor Lins e Silva aecd543d75 merge: develop into fix/1362-repair-sqlite-integrity-preflight (round 2)
#1357 (max_seq_id preflight) merged into develop while this branch
was in CI, opening a fresh conflict between the two preflight helpers.

mempalace/repair.py:
- Kept both: this branch's sqlite_integrity_errors() / print_sqlite_
  integrity_abort() AND develop's maybe_repair_poisoned_max_seq_id_
  before_rebuild() from #1357. They check for distinct corruption
  classes and run as separate preflights.

tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors test group and
  develop's max_seq_id preflight test group; non-overlapping coverage.

Local: 1623 tests pass, ruff lint+format clean against 0.4.x CI pin.
2026-05-07 10:45:29 -03:00
Igor Lins e Silva 3893228e03 Merge pull request #1342 from fatkobra/fix/1274-missing-hnsw-metadata-gate
fix(storage): quarantine partial HNSW flush without metadata (#1274)
2026-05-07 10:42:20 -03:00
Igor Lins e Silva f2291b0320 merge: develop into fix/1362-repair-sqlite-integrity-preflight
Conflicts opened by #1285 (temp-staging rebuild) and #1312
(collection_name in recovery paths) merging after this branch was
authored.

mempalace/repair.py:
- Kept this branch's sqlite_integrity_errors() and
  print_sqlite_integrity_abort() helpers; took develop's rebuild_index
  signature with the collection_name parameter from #1312. Normalized
  the helper's print indent to 2 spaces to match the rest of the file.

tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors tests and develop's
  rebuild_from_sqlite + configured-collection coverage.
- Replaced 7 sites of sqlite_path.write_text("fake") with
  sqlite3.connect(...).close() — write_text("fake") fails PRAGMA
  quick_check, so the new preflight aborts before the rebuild logic
  the tests actually exercise. An empty real SQLite DB passes
  quick_check and lets the tests run as intended.
- Took develop's temp-staging assertion shape (delete/create the
  __repair_tmp collection in addition to the live drawers collection)
  for the existing test_rebuild_index_success test.

Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
2026-05-07 10:41:27 -03:00
Igor Lins e Silva a1e19081d9 merge: develop into fix/1274-missing-hnsw-metadata-gate
Two conflicts, both because #1339 (bloated link payloads) merged into
develop after this branch was authored:

- mempalace/backends/chroma.py: _segment_appears_healthy now stacks
  three checks — bloated-link from #1339 (top), missing-metadata-with-
  data-floor from this branch (middle), pickle format sniff (bottom).
  All three are complementary; #1339 catches structural payload
  corruption, this branch catches pickle truncation, the original
  catches pickle protocol-byte corruption.

- tests/test_backends.py: kept both new imports (_segment_appears_healthy
  from this branch, quarantine_invalid_hnsw_metadata from #1285).

Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
2026-05-07 10:35:19 -03:00
Igor Lins e Silva bdaac9d9a6 merge: develop into fix/1295-repair-max-seq-id-preflight
Three conflicts, all from develop landing #1285/#1310/#1312 after this
branch was authored:

- mempalace/cli.py: keep both import sets — this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
  RebuildCollectionError / _close_chroma_handles / _extract_drawers /
  _rebuild_collection_via_temp added in #1285.

- mempalace/repair.py: keep this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
  develop's rebuild_index signature with the collection_name parameter
  added in #1312. Normalized print indent to 2 spaces matching the
  rest of the file.

- tests/test_repair.py: keep both this branch's max_seq_id preflight
  tests and develop's rebuild_from_sqlite + configured-collection-name
  tests; they exercise distinct code paths and don't overlap.

Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
2026-05-07 10:32:29 -03:00
Igor Lins e Silva 88a2ebb85a Merge pull request #1339 from fatkobra/fix/1218-hnsw-link-payload-health
fix(storage): quarantine bloated HNSW link payloads (#1218)
2026-05-07 10:19:47 -03:00
Igor Lins e Silva e272ed3fbf Merge pull request #1359 from fatkobra/fix/1099-migrate-write-roundtrip
fix(migrate): verify write roundtrip before bailout (#1099)
2026-05-07 09:59:58 -03:00
Igor Lins e Silva 72685f3aef Merge pull request #1312 from mjc/stale-chroma-reconnect
fix: use configured collection in recovery paths
2026-05-07 09:40:29 -03:00
Igor Lins e Silva e9aee19433 fix(tests): apply ruff format after rebase resolution
The collection_name plumbing rebase produced a few unformatted blocks
in test_mcp_server.py and test_searcher.py; bringing them in line with
the 0.4.x CI pin so test-windows / lint stay green.
2026-05-07 09:10:22 -03:00
Mika Cohen ec6d2dde01 fix: use configured collection in recovery paths 2026-05-07 09:10:00 -03:00
Igor Lins e Silva 5488e7bb22 fix(miner): harden Windows mine against ONNX bad_alloc + silent partial exits
Three small changes that together address the failure modes in #1296:

1. Add pnpm-lock.yaml and yarn.lock to SKIP_FILENAMES, mirroring the
   existing package-lock.json rule. A 24K-line pnpm-lock.yaml produced
   ~1124 chunks in one batch and tripped onnxruntime bad_alloc on
   Windows; pnpm/yarn lockfiles are no more useful to mine than npm's.

2. Skip any file that produces more than MAX_CHUNKS_PER_FILE (500)
   chunks, with a clear log line. Catches the broader class — generated
   CSV/JSON, build artifacts, etc. — that the named-file SKIP list will
   never fully cover. The cap is conservative (500 chunks * 800 chars ≈
   400 KB of source) so legitimate hand-written content still mines.

3. Print a partial-progress summary on any exception in _mine_impl, not
   just KeyboardInterrupt, then re-raise. Without this, an arbitrary
   exception (ONNX bad_alloc, chromadb HNSW error, OS fault) propagates
   silently — the operator sees only the last progress line and assumes
   the mine succeeded. The new path mirrors the KeyboardInterrupt
   summary (files_processed, drawers_filed, last_file) plus the
   exception type and message, then re-raises so the original traceback
   surfaces and the exit code is non-zero.

Tests cover: SKIP_FILENAMES contents, the chunk-cap path returning
(0, room) with no upserts, and the new mine-aborted summary surfacing
both the partial counters and the exception class.
2026-05-07 08:56:41 -03:00
Igor Lins e Silva 88493acd0d Merge pull request #1285 from mjc/hnsw-repair
fix: harden Chroma repair preflight and rollback recovery
2026-05-07 08:06:59 -03:00
Igor Lins e Silva be6dc033fd merge: develop into hnsw-repair (resolve chroma.py + test_backends.py conflicts)
Develop (post-#1162 lock-plumbing era) refactored the per-open quarantine
pass into ChromaBackend._prepare_palace_for_open. This branch's
inline-expansion form added quarantine_invalid_hnsw_metadata as a third
check, plus a "discard from _quarantined_paths on inode swap" guard so
re-opens against a different physical DB re-run quarantine.

Resolution merges both:

- _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata,
  gated by the same _quarantined_paths set.
- _client keeps the inode_changed -> _quarantined_paths.discard() guard
  before calling the helper, so a fresh inode triggers a fresh pass.
- make_client collapses to a single _prepare_palace_for_open() call.
- test_backends.py keeps both the pickle (#1285) and shutil (develop)
  imports — both are used.
2026-05-07 07:48:45 -03:00
Igor Lins e Silva 670aba974f test(repair): close ChromaBackend in _seed_palace to release Windows file locks
The helper opened a chromadb PersistentClient via ChromaBackend and never
closed it, leaving rust-side SQLite/HNSW file locks alive after the
helper returned. On Windows that blocks the in-place archive rename
inside rebuild_from_sqlite with WinError 32 on data_level0.bin,
causing test_rebuild_from_sqlite_in_place_archives_when_opted_in and
test_rebuild_from_sqlite_raises_on_upsert_failure to fail in the
test-windows CI job. No test consumes the returned collection, so
closing the backend in a try/finally is safe and drops the return.
2026-05-07 07:37:25 -03:00
Igor Lins e Silva 8d8f54a807 Merge remote-tracking branch 'origin/develop' into fix/1308-rebuild-from-sqlite 2026-05-07 07:30:56 -03:00
Igor Lins e Silva e334e257bf fix(mcp): retry _get_collection once on transient failure (#1286)
A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per #1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).

Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
  via a monkeypatched `_get_client`, second attempt succeeds; assert the
  caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
  fail, assert we exhaust the loop and return None (no infinite retry).

Surgical extraction from PR #1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.

Co-authored-by: Jeffrey Hein <jp@jphein.com>
2026-05-06 04:52:18 -03:00
Brian potter d92c741084 fix(repair): address PR #1310 review feedback
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on #1310:

- repair.py: drawers collection name now resolves from
  MempalaceConfig().collection_name via _drawers_collection_name() (closets
  stays fixed by design — AAAK index references drawer IDs by string).
  Lines up with the broader configured-collection work in #1312 so that
  PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
  _rebuild_one_collection so a Chroma "Collection already exists" failure
  surfaces as RebuildPartialError with archive_path, not an unstructured
  exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
  with backend.close() so PersistentClient handles to dest_palace are
  released on every exit path. The from-sqlite path post-dates #1285's
  lifecycle hardening of the legacy rebuild, so this needed its own
  cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
  rebuild_from_sqlite returns {} (validation refusal sentinel), so
  unattended scripts/CI distinguish "invalid inputs" from a successful
  rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
  now asserts every backing segment is scope='METADATA', locking in the
  segment-layout assumption against future regressions that point the
  JOIN at the VECTOR segment.

New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 04:37:22 -03:00
Brian potter a7c4ed24d7 fix(repair): add --mode from-sqlite to recover palaces with corrupt HNSW (#1308)
Both `--mode legacy` and the inline `cli.cmd_repair` rebuild path
call `Collection.count()` as their first read — the same call that
raises `chromadb.errors.InternalError: Failed to apply logs to the
hnsw segment writer` on the corruption class reported in #1308.
Repair would print "Cannot recover — palace may need to be re-mined
from source files" even though the underlying SQLite tables were
fully intact.

The new `--mode from-sqlite` reads `(id, document, metadata)` rows
directly from `chroma.sqlite3` via `segments` → `embeddings` →
`embedding_metadata` joins, never opens a chromadb client against
the corrupt palace, and re-upserts everything into a fresh palace.

  - `--source PATH` extracts from a corrupt palace already moved aside
  - `--archive-existing` handles the in-place case by renaming the
    existing palace to `<palace>.pre-rebuild-<timestamp>` first
  - Partial-rebuild failures raise `RebuildPartialError` with the
    archive path so users can recover; CLI exits non-zero
  - In-place mode calls `SharedSystemClient.clear_system_cache()` to
    drop chromadb's process-wide System registry (cross-palace use
    does not, to limit blast radius for library callers)
  - Source validation runs before any destructive moves

Verified end-to-end recovering a 52,300-row real-world corrupt
palace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 04:36:39 -03:00
Igor Lins e Silva 9b24cfc93b Merge pull request #987 from alpiua/fix-mcp-null-payload
fix(mcp): handle null JSON-RPC request payloads safely
2026-05-06 03:33:45 -03:00
Igor Lins e Silva 869ab38095 style: ruff format mcp_server.py + test_mcp_server.py (CI lint) 2026-05-06 02:19:57 -03:00
Oleksii Pylypchuk a85d432b54 feat: add validation for missing name parameter in tools/call requests 2026-05-06 02:19:57 -03:00
Oleksii Pylypchuk 55d79dc8cd fix: include null id in JSON-RPC invalid request error responses and add validation tests 2026-05-06 02:19:57 -03:00
Igor Lins e Silva aac8437979 style: ruff format tests/test_searcher.py (CI lint) 2026-05-06 02:19:54 -03:00
eldar702 5347c2c71c fix(searcher): clamp effective_distance to valid cosine range [0, 2]
``search_memories`` computes ``effective_dist = dist - boost`` where
``boost`` can be as large as ``CLOSET_RANK_BOOSTS[0] == 0.40`` for a
rank-0 closet hit. When the raw drawer distance is small — any
near-exact match — the subtraction goes negative.

Two downstream effects:

1. Line 418 returns ``round(max(0.0, 1 - effective_dist), 3)`` as
   ``similarity``. With ``effective_dist = -0.30`` that yields
   ``similarity = 1.30``, outside the documented ``[0, 1]`` range.
   The ``max(0.0, ...)`` only prevents negative similarities; it does
   not cap above 1.
2. Line 427 stores ``_sort_key: effective_dist`` and line 435 sorts
   ``scored`` ascending by that key. A negative key drops *below* the
   rest, so the strongest hybrid matches end up sorting after weaker
   ones — ranking inversion under the exact conditions hybrid retrieval
   is supposed to serve best.

Clamp ``effective_dist`` to the valid cosine-distance range ``[0, 2]``.
The boost still wins (closet-backed hit still ranks first), it just no
longer flips the order.

Test added: mock drawer_col (base dist 0.08 / 0.35 for two sources) +
closet_col (rank-0 closet for the 0.08 source) → assert all hits have
``0 <= similarity <= 1`` and ``0 <= effective_distance <= 2``, and that
the closet-boosted source still ranks first.

Relationship to other PRs:

* **#988** clamps the output ``similarity`` alone. That does not fix
  the sort-key inversion or the invalid ``effective_distance`` in the
  returned dict. This PR clamps at the arithmetic source so both
  downstream users of the value stay in range.
* Orthogonal to **#979** (``tool_check_duplicate`` negative similarity).
2026-05-06 02:19:54 -03:00
Igor Lins e Silva f854da779f fix(lint): hoist hooks_cli_mod import to top of test_hooks_cli (E402)
The alias was placed below an explanatory comment block introduced by
#1305, which trips ruff E402 (module-level import not at top of file).
Moved next to the existing 'from mempalace.hooks_cli import (...)' line.

CI lint went red on develop after #1305 merged with the failing check;
this re-greens it so subsequent PRs do not inherit the failure.
2026-05-06 01:57:44 -03:00
Igor Lins e Silva 67cda9d455 Merge pull request #1030 from eldar702/fix/none-metadata-residual-guards
fix: guard None metadata/doc in tool_check_duplicate and Layer1/Layer2
2026-05-06 01:51:24 -03:00
Igor Lins e Silva d9ab5b7fd3 Merge pull request #1305 from lcatlett/upstream/respect-absent-palace-dir
fix(hooks): treat absent ~/.mempalace as auto-save off
2026-05-06 01:49:22 -03:00
Igor Lins e Silva ea6f2c0c4c Merge pull request #1162 from imtylervo/fix/palace-write-lock-queue-pattern
fix: serialize ChromaCollection writes through palace lock
2026-05-06 01:48:51 -03:00