fix: prevent HNSW index bloat via batch_size + sync_threshold metadata
Sets `hnsw:batch_size` and `hnsw:sync_threshold` to 50_000 at every
collection-creation call site:
* `mempalace/backends/chroma.py` — `get_collection(create=True)` and the
legacy `create_collection()` path. Preserves existing `hnsw:space`,
`hnsw:num_threads=1` (race fix from #976), and `**ef_kwargs`
(embedding-function plumbing from a4868a3).
* `mempalace/mcp_server.py` — the direct `client.get_or_create_collection`
path used when a palace is first opened by the MCP server. Without this
third site, MCP-bootstrapped palaces would skip the guard and could
still trigger the original bloat.
Without these defaults, mining ~10K+ drawers triggers ~30 HNSW index
resizes and hundreds of persistDirty() calls. persistDirty uses relative
seek positioning in link_lists.bin; accumulated seek drift across resize
cycles causes the OS to extend the sparse file with zero-filled regions,
each cycle compounding the next. Result: link_lists.bin grows into
hundreds of GB sparse, after which `status`, `search`, and `repair` all
segfault and the palace is unrecoverable.
Empirical: rebuilt a palace from scratch on 39,792 drawers across 5
wings with this fix applied. Final palace 376 MB, link_lists.bin stays
at 0 bytes across both Chroma collection dirs, status and search both
return cleanly. Same workload without the fix bloated the palace to
565 GB sparse (30 GB on disk) and segfaulted at ~15K drawers.
Migration note: chromadb 1.5.x exposes a
`collection.modify(configuration={"hnsw": {...}})` retrofit path for
already-created collections (`UpdateHNSWConfiguration`), but this PR
doesn't pursue it — by the time link_lists.bin has bloated the index
is already corrupt and the only known recovery is a fresh mine.
Tests assert both keys land on the persisted collection metadata in
both `ChromaBackend` code paths, which also covers the #1161 "config
silently dropped" concern at CI time. A separate smoke test was used
to verify the metadata round-trips through `chromadb.PersistentClient`
reopen on chromadb 1.5.8.
Closes #344
Supersedes #346
Co-authored-by: robot-rocket-science <robot-rocket-science@users.noreply.github.com>
This commit is contained in:
@@ -336,6 +336,42 @@ def test_chroma_backend_creates_collection_with_cosine_distance(tmp_path):
|
||||
assert col.metadata.get("hnsw:space") == "cosine"
|
||||
|
||||
|
||||
def test_chroma_backend_sets_hnsw_bloat_guard_on_creation(tmp_path):
|
||||
"""The HNSW guard from #344 must land on freshly-created collection metadata.
|
||||
|
||||
Without batch_size + sync_threshold, mining ~10K+ drawers triggers the
|
||||
resize+persist drift that bloats link_lists.bin into hundreds of GB sparse
|
||||
and segfaults `status` / `search` / `repair`. The guard belongs at
|
||||
collection-creation time so every fresh palace gets it without needing
|
||||
a runtime retrofit. Asserting both keys land on the persisted metadata
|
||||
also covers the #1161 "config silently dropped" concern at CI time.
|
||||
"""
|
||||
palace_path = tmp_path / "palace"
|
||||
|
||||
ChromaBackend().get_collection(
|
||||
str(palace_path),
|
||||
collection_name="mempalace_drawers",
|
||||
create=True,
|
||||
)
|
||||
|
||||
client = chromadb.PersistentClient(path=str(palace_path))
|
||||
col = client.get_collection("mempalace_drawers")
|
||||
assert col.metadata.get("hnsw:batch_size") == 50_000
|
||||
assert col.metadata.get("hnsw:sync_threshold") == 50_000
|
||||
|
||||
|
||||
def test_chroma_backend_create_collection_sets_hnsw_bloat_guard(tmp_path):
|
||||
"""Same guard must apply via the legacy create_collection() path."""
|
||||
palace_path = tmp_path / "palace"
|
||||
|
||||
ChromaBackend().create_collection(str(palace_path), "mempalace_drawers")
|
||||
|
||||
client = chromadb.PersistentClient(path=str(palace_path))
|
||||
col = client.get_collection("mempalace_drawers")
|
||||
assert col.metadata.get("hnsw:batch_size") == 50_000
|
||||
assert col.metadata.get("hnsw:sync_threshold") == 50_000
|
||||
|
||||
|
||||
def test_fix_blob_seq_ids_converts_blobs_to_integers(tmp_path):
|
||||
"""Simulate a ChromaDB 0.6.x database with BLOB seq_ids and verify repair."""
|
||||
db_path = tmp_path / "chroma.sqlite3"
|
||||
|
||||
Reference in New Issue
Block a user