Address Copilot review: cursor tie-break, honest metrics, accurate comments

Six items from the automated review on PR #998:

1. **Cursor tie-break bug (correctness).** The skip condition was
   `rec.timestamp <= cursor`; if multiple messages share the max
   timestamp and only some were ingested before a crash, the rest
   would be lost forever. Changed to `< cursor`, relying on
   deterministic drawer IDs for safe re-attempt at the boundary.
   Regression test
   `test_sweep_recovers_untaken_message_at_cursor_timestamp`.

2. **`drawers_added` counted upserts, not adds.** Added a pre-flight
   `collection.get(ids=batch)` to distinguish new rows from already-
   present ones. Return value now carries `drawers_added`,
   `drawers_already_present`, `drawers_upserted`, and `drawers_skipped`
   separately. Dict-compatible access (`existing.get("ids")`) keeps it
   working on both the raw Chroma return and the typed `GetResult`.

3. **`sweep_directory` hid failures in the summary.** `files_processed`
   used to exclude failed files. Replaced with `files_attempted` (all
   discovered) + `files_succeeded` (subset that completed); CLI output
   shows `succeeded/attempted`.

4. **Coordination claim was overstated.** The primary miners don't
   stamp `session_id`/`timestamp` metadata, so the sweeper coordinates
   only with its own prior runs. Softened docstrings on module and CLI
   command. Uniform cross-miner metadata is flagged as a follow-up.

5. **MAX_FILE_SIZE comments were misleading.** Said source size "does
   not affect storage or embedding cost" — true per-drawer, but source
   size still scales drawer count, embedding work, and memory usage
   (files are read in full, not streamed). Corrected in both
   `miner.py` and `convo_miner.py`.

6. Added the tie-break regression test that reproduces the correctness
   bug from (1).

Tests: 970 passed (was 969), ruff + pre-commit clean.

Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
This commit is contained in:
Igor Lins e Silva
2026-04-18 13:22:18 -03:00
parent 29ce7c7135
commit 4a088ea8e1
5 changed files with 171 additions and 36 deletions
+66
View File
@@ -225,6 +225,72 @@ class TestSweeperTandem:
"coordination is broken."
)
def test_sweep_recovers_untaken_message_at_cursor_timestamp(self, tmp_path):
"""Regression for Copilot PR #998 review: with a `<= cursor` skip,
any message sharing the max timestamp but not yet ingested (e.g.
crash mid-batch) would be lost forever. The skip must be `<` and
tie-break via deterministic drawer ID.
Scenario: three messages share timestamp T. First sweep ingests
two of them and the process dies before the third. Second sweep
must pick up the third — not skip it because cursor == T.
"""
from mempalace.palace import get_collection
from mempalace.sweeper import (
_drawer_id_for_message,
parse_claude_jsonl,
sweep,
)
shared_ts = "2026-04-18T11:00:00Z"
lines = [
{
"type": "user",
"timestamp": shared_ts,
"sessionId": "s-tie",
"uuid": f"u-{i}",
"message": {"role": "user", "content": f"msg {i}"},
}
for i in range(3)
]
jsonl_path = tmp_path / "tied.jsonl"
jsonl_path.write_text("\n".join(json.dumps(x) for x in lines) + "\n")
palace_path = str(tmp_path / "palace")
# Simulate a partial ingest: write 2 of 3 directly via the backend
# with the same drawer IDs the sweeper would use.
col = get_collection(palace_path, create=True)
recs = list(parse_claude_jsonl(str(jsonl_path)))
partial_ids = [_drawer_id_for_message(r["session_id"], r["uuid"]) for r in recs[:2]]
col.upsert(
ids=partial_ids,
documents=[f"USER: {r['content']}" for r in recs[:2]],
metadatas=[
{
"session_id": r["session_id"],
"timestamp": r["timestamp"],
"message_uuid": r["uuid"],
"role": r["role"],
"ingest_mode": "sweep",
}
for r in recs[:2]
],
)
# Now run the sweeper. It must pick up the 3rd message, not skip
# it because cursor == its timestamp.
result = sweep(str(jsonl_path), palace_path)
assert result["drawers_added"] == 1, (
f"Sweeper lost the untaken message at cursor timestamp. "
f"Expected drawers_added=1 (the 3rd record), got "
f"{result['drawers_added']}. Cursor skip is still `<=` "
"instead of `<`, or tie-break via drawer-id is broken."
)
assert result["drawers_already_present"] == 2, (
f"Expected 2 drawers already present (the partial ingest), "
f"got {result['drawers_already_present']}."
)
class TestSweeperDrawerMetadata:
"""Each drawer must carry the metadata the tandem-miner coordination