Address Copilot review: cursor tie-break, honest metrics, accurate comments
Six items from the automated review on PR #998: 1. **Cursor tie-break bug (correctness).** The skip condition was `rec.timestamp <= cursor`; if multiple messages share the max timestamp and only some were ingested before a crash, the rest would be lost forever. Changed to `< cursor`, relying on deterministic drawer IDs for safe re-attempt at the boundary. Regression test `test_sweep_recovers_untaken_message_at_cursor_timestamp`. 2. **`drawers_added` counted upserts, not adds.** Added a pre-flight `collection.get(ids=batch)` to distinguish new rows from already- present ones. Return value now carries `drawers_added`, `drawers_already_present`, `drawers_upserted`, and `drawers_skipped` separately. Dict-compatible access (`existing.get("ids")`) keeps it working on both the raw Chroma return and the typed `GetResult`. 3. **`sweep_directory` hid failures in the summary.** `files_processed` used to exclude failed files. Replaced with `files_attempted` (all discovered) + `files_succeeded` (subset that completed); CLI output shows `succeeded/attempted`. 4. **Coordination claim was overstated.** The primary miners don't stamp `session_id`/`timestamp` metadata, so the sweeper coordinates only with its own prior runs. Softened docstrings on module and CLI command. Uniform cross-miner metadata is flagged as a follow-up. 5. **MAX_FILE_SIZE comments were misleading.** Said source size "does not affect storage or embedding cost" — true per-drawer, but source size still scales drawer count, embedding work, and memory usage (files are read in full, not streamed). Corrected in both `miner.py` and `convo_miner.py`. 6. Added the tie-break regression test that reproduces the correctness bug from (1). Tests: 970 passed (was 969), ruff + pre-commit clean. Co-Authored-By: MSL <232237854+milla-jovovich@users.noreply.github.com>
This commit is contained in:
@@ -58,8 +58,11 @@ CHUNK_SIZE = 800 # chars per drawer — align with miner.py
|
||||
MAX_FILE_SIZE = 500 * 1024 * 1024 # 500 MB — skip files larger than this.
|
||||
# Matches miner.py at 500 MB. Long Claude Code sessions, multi-year
|
||||
# ChatGPT exports, and lifetime Slack dumps routinely exceed 10 MB; the
|
||||
# cap at that level silently dropped them with `continue`. Source size
|
||||
# does not affect storage or embedding cost — chunking happens downstream.
|
||||
# cap at that level silently dropped them with `continue`. Per-drawer
|
||||
# size is bounded by CHUNK_SIZE, but larger source files still produce
|
||||
# more drawers and therefore more embedding/storage work — and content
|
||||
# is normalized and loaded fully into memory before chunking, so memory
|
||||
# use also scales with source size.
|
||||
|
||||
|
||||
def _register_file(collection, source_file: str, wing: str, agent: str):
|
||||
|
||||
Reference in New Issue
Block a user