Files

T

Igor Lins e Silva e8017ca2ec bench: add per-room recall threshold test

Concentrates all drawers into a single wing+room to isolate the
embedding model's retrieval limit independent of palace filtering.
Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even
with wing+room filters applied — the spatial structure helps by
keeping buckets small, but can't fix the underlying embedding ceiling.

2026-04-08 05:06:31 -03:00

5.3 KiB

Raw Blame History

MemPalace Scale Benchmark Suite

106 tests that benchmark mempalace at scale to validate real-world performance limits.

Why

MemPalace has strong academic scores (96.6% R@5 on LongMemEval) but no empirical data on how it behaves at scale. Key unknowns:

tool_status() loads ALL metadata into memory — at what palace size does this OOM?
PersistentClient is re-instantiated on every MCP call — what's the overhead?
Modified files are never re-ingested — what's the skip-check cost at scale?
How does query latency degrade as the palace grows from 1K to 100K drawers?
Does wing/room filtering actually improve retrieval, and by how much?
At what per-room drawer count does recall break regardless of filtering?

This suite finds those answers.

Quick Start

# Fast smoke test (~2 min)
uv run pytest tests/benchmarks/ -v --bench-scale=small -m "benchmark and not slow"

# Full small scale (~35 min)
uv run pytest tests/benchmarks/ -v --bench-scale=small

# Medium scale with JSON report
uv run pytest tests/benchmarks/ -v --bench-scale=medium --bench-report=results.json

# Stress test (local only, very slow)
uv run pytest tests/benchmarks/ -v --bench-scale=stress -m stress

Scale Levels

Level	Drawers	Wings	Rooms/Wing	KG Triples	Use case
small	1,000	3	5	200	CI, quick checks
medium	10,000	8	12	2,000	Pre-release testing
large	50,000	15	20	10,000	Scale limit finding
stress	100,000	25	30	50,000	Breaking point

Test Modules

Critical Path

File	What it tests
`test_mcp_bench.py`	MCP tool response times, unbounded metadata fetch, client re-instantiation overhead
`test_chromadb_stress.py`	ChromaDB breaking point, query degradation curve, batch vs sequential insert
`test_memory_profile.py`	RSS/heap growth over repeated operations, leak detection

Performance Baselines

File	What it tests
`test_ingest_bench.py`	Mining throughput (files/sec, drawers/sec), peak RSS, chunking speed, re-ingest skip overhead
`test_search_bench.py`	Query latency vs palace size, recall@k with planted needles, concurrent queries, n_results scaling

Architectural Validation

File	What it tests
`test_palace_boost.py`	Retrieval improvement from wing/room filtering at different scales
`test_recall_threshold.py`	Per-room recall ceiling — isolates embedding model limit with all drawers in one bucket
`test_knowledge_graph_bench.py`	Triple insertion rate, temporal query accuracy, SQLite concurrent access
`test_layers_bench.py`	MemoryStack wake-up cost, Layer1 unbounded fetch, token budget compliance

Architecture

tests/benchmarks/
  conftest.py              # --bench-scale / --bench-report CLI options, fixtures, markers
  data_generator.py        # Deterministic data factory (seeded RNG, planted needles)
  report.py                # JSON report writer + regression checker
  test_*.py                # 9 test modules (106 tests total)

Data Generator

PalaceDataGenerator(seed=42, scale="small") produces deterministic, realistic test data:

generate_project_tree() — writes real files + mempalace.yaml for mine() to ingest
populate_palace_directly() — bypasses mining, inserts directly into ChromaDB (10-100x faster for search/MCP benchmarks)
generate_kg_triples() — entity-relationship triples with temporal validity
generate_search_queries() — queries with known-good answers for recall measurement

Planted needles: Unique identifiable content (e.g., NEEDLE_0042: PostgreSQL vacuum autovacuum threshold...) seeded into specific wings/rooms. Search queries target these needles, enabling recall@k measurement without an LLM judge.

JSON Reports

When run with --bench-report=path.json, produces machine-readable output:

{
  "timestamp": "2026-04-07T...",
  "git_sha": "abc123",
  "scale": "small",
  "system": {"os": "linux", "cpu_count": 8},
  "results": {
    "mcp_status": {"latency_ms_at_1000": 45.2, "rss_delta_mb_at_5000": 12.3},
    "search": {"avg_latency_ms_at_5000": 23.1, "recall_at_5": 0.92},
    "chromadb_insert": {"sequential_ms": 8500, "batched_ms": 1200, "speedup_ratio": 7.1}
  }
}

Regression Detection

from tests.benchmarks.report import check_regression

regressions = check_regression("current.json", "baseline.json", threshold=0.2)
# Returns list of metric descriptions that degraded beyond 20%

CI Integration

The GitHub Actions workflow runs benchmarks on PRs at small scale:

benchmark:
  runs-on: ubuntu-latest
  if: github.event_name == 'pull_request'
  # Runs: pytest tests/benchmarks/ -m "benchmark and not stress and not slow" --bench-scale=small

Existing unit tests are isolated with --ignore=tests/benchmarks.

Markers

@pytest.mark.benchmark — all benchmark tests
@pytest.mark.slow — tests taking >30s even at small scale
@pytest.mark.stress — tests that should only run at large/stress scale

Dependencies

Only one new dependency beyond the existing dev stack: psutil (for cross-platform RSS measurement). tracemalloc and resource are stdlib.

5.3 KiB Raw Blame History