Address Copilot review: cursor tie-break, honest metrics, accurate comments
Six items from the automated review on PR #998: 1. **Cursor tie-break bug (correctness).** The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. **`drawers_added` counted upserts, not adds.** Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. **`sweep_directory` hid failures in the summary.** `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. **Coordination claim was overstated.** The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. **MAX_FILE_SIZE comments were misleading.** Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
This commit is contained in:
@@ -225,6 +225,72 @@ class TestSweeperTandem:
|
||||
"coordination is broken."
|
||||
)
|
||||
|
||||
def test_sweep_recovers_untaken_message_at_cursor_timestamp(self, tmp_path):
|
||||
"""Regression for Copilot PR #998 review: with a `<= cursor` skip,
|
||||
any message sharing the max timestamp but not yet ingested (e.g.
|
||||
crash mid-batch) would be lost forever. The skip must be `<` and
|
||||
tie-break via deterministic drawer ID.
|
||||
|
||||
Scenario: three messages share timestamp T. First sweep ingests
|
||||
two of them and the process dies before the third. Second sweep
|
||||
must pick up the third — not skip it because cursor == T.
|
||||
"""
|
||||
from mempalace.palace import get_collection
|
||||
from mempalace.sweeper import (
|
||||
_drawer_id_for_message,
|
||||
parse_claude_jsonl,
|
||||
sweep,
|
||||
)
|
||||
|
||||
shared_ts = "2026-04-18T11:00:00Z"
|
||||
lines = [
|
||||
{
|
||||
"type": "user",
|
||||
"timestamp": shared_ts,
|
||||
"sessionId": "s-tie",
|
||||
"uuid": f"u-{i}",
|
||||
"message": {"role": "user", "content": f"msg {i}"},
|
||||
}
|
||||
for i in range(3)
|
||||
]
|
||||
jsonl_path = tmp_path / "tied.jsonl"
|
||||
jsonl_path.write_text("\n".join(json.dumps(x) for x in lines) + "\n")
|
||||
|
||||
palace_path = str(tmp_path / "palace")
|
||||
# Simulate a partial ingest: write 2 of 3 directly via the backend
|
||||
# with the same drawer IDs the sweeper would use.
|
||||
col = get_collection(palace_path, create=True)
|
||||
recs = list(parse_claude_jsonl(str(jsonl_path)))
|
||||
partial_ids = [_drawer_id_for_message(r["session_id"], r["uuid"]) for r in recs[:2]]
|
||||
col.upsert(
|
||||
ids=partial_ids,
|
||||
documents=[f"USER: {r['content']}" for r in recs[:2]],
|
||||
metadatas=[
|
||||
{
|
||||
"session_id": r["session_id"],
|
||||
"timestamp": r["timestamp"],
|
||||
"message_uuid": r["uuid"],
|
||||
"role": r["role"],
|
||||
"ingest_mode": "sweep",
|
||||
}
|
||||
for r in recs[:2]
|
||||
],
|
||||
)
|
||||
|
||||
# Now run the sweeper. It must pick up the 3rd message, not skip
|
||||
# it because cursor == its timestamp.
|
||||
result = sweep(str(jsonl_path), palace_path)
|
||||
assert result["drawers_added"] == 1, (
|
||||
f"Sweeper lost the untaken message at cursor timestamp. "
|
||||
f"Expected drawers_added=1 (the 3rd record), got "
|
||||
f"{result['drawers_added']}. Cursor skip is still `<=` "
|
||||
"instead of `<`, or tie-break via drawer-id is broken."
|
||||
)
|
||||
assert result["drawers_already_present"] == 2, (
|
||||
f"Expected 2 drawers already present (the partial ingest), "
|
||||
f"got {result['drawers_already_present']}."
|
||||
)
|
||||
|
||||
|
||||
class TestSweeperDrawerMetadata:
|
||||
"""Each drawer must carry the metadata the tandem-miner coordination
|
||||
|
||||
Reference in New Issue
Block a user