Six items from the automated review on PR #998:
1. **Cursor tie-break bug (correctness).** The skip condition was
`rec.timestamp <= cursor`; if multiple messages share the max
timestamp and only some were ingested before a crash, the rest
would be lost forever. Changed to `< cursor`, relying on
deterministic drawer IDs for safe re-attempt at the boundary.
Regression test
`test_sweep_recovers_untaken_message_at_cursor_timestamp`.
2. **`drawers_added` counted upserts, not adds.** Added a pre-flight
`collection.get(ids=batch)` to distinguish new rows from already-
present ones. Return value now carries `drawers_added`,
`drawers_already_present`, `drawers_upserted`, and `drawers_skipped`
separately. Dict-compatible access (`existing.get("ids")`) keeps it
working on both the raw Chroma return and the typed `GetResult`.
3. **`sweep_directory` hid failures in the summary.** `files_processed`
used to exclude failed files. Replaced with `files_attempted` (all
discovered) + `files_succeeded` (subset that completed); CLI output
shows `succeeded/attempted`.
4. **Coordination claim was overstated.** The primary miners don't
stamp `session_id`/`timestamp` metadata, so the sweeper coordinates
only with its own prior runs. Softened docstrings on module and CLI
command. Uniform cross-miner metadata is flagged as a follow-up.
5. **MAX_FILE_SIZE comments were misleading.** Said source size "does
not affect storage or embedding cost" — true per-drawer, but source
size still scales drawer count, embedding work, and memory usage
(files are read in full, not streamed). Corrected in both
`miner.py` and `convo_miner.py`.
6. Added the tie-break regression test that reproduces the correctness
bug from (1).
Tests: 970 passed (was 969), ruff + pre-commit clean.
Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
Four changes on top of the proposal's initial sweeper draft, driven by
the CLAUDE.md design principles:
1. Drop the 500-char truncation on tool_use / tool_result content in
_flatten_content. The "verbatim always" principle forbids lossy
compression of user-adjacent data; a long code-edit diff handed to
the assistant must round-trip intact. Unknown block types now also
serialize their full payload instead of just a type marker. New test
test_parse_preserves_tool_blocks_verbatim covers a 5000-char input.
2. Use the full session_id in drawer IDs (not session_id[:12]). Rules
out cross-session collisions if a transcript source ever uses
non-UUID session identifiers or shared prefixes.
3. Replace silent `except Exception: return None` in get_palace_cursor
with a logger.warning — the exact anti-pattern this PR otherwise
criticizes in miner.py. The fallback behavior is still safe
(deterministic IDs make a missed cursor recover on the next run),
but the failure is now discoverable.
4. sweep_directory now collects per-file failures into the result dict
and the CLI exits non-zero when any file failed, so a partial-sweep
outcome is visible rather than swallowed.
Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
The primary miners (miner.py, convo_miner.py) operate at file
granularity and can drop data for several reasons: size caps, silent
OSError on read, dedup false positives, extensions the project miner
does not recognize. Even with tonight's hotfixes, any future bug in
the file-level path risks silent data loss.
The sweeper is a second, cooperating miner that works at MESSAGE
granularity:
- Parses Claude Code .jsonl line by line, yielding only
user/assistant records (filters progress, file-history-snapshot,
etc. noise).
- For each session_id, queries the palace for max(timestamp) and
treats that as the cursor.
- Ingests only messages newer than the cursor, as one small drawer
per exchange (never hits a size cap — each drawer is 1-5 KB).
- Deterministic drawer IDs from session_id + message UUID make
reruns idempotent; crash mid-sweep is safe.
Tandem coordination is free: if the primary miner committed up to
timestamp T, the sweeper resumes from T. If the primary miner missed
everything, the sweeper catches it all. Neither duplicates the other.
Smoke test on a real Claude Code transcript:
1st run: +39 drawers, 0 already present
2nd run: +0 drawers, 39 already present (perfect idempotence)
Opt-in via:
mempalace sweep <file.jsonl>
mempalace sweep <transcript-dir>
No changes to existing miners. No schema migration. Purely additive.
Tests: tests/test_sweeper.py (7 tests covering parsing, tandem
coordination, idempotency, resume-from-cursor, metadata correctness).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>