Files
mempalace/tests/benchmarks/README.md
T
Igor Lins e Silva e8017ca2ec bench: add per-room recall threshold test
Concentrates all drawers into a single wing+room to isolate the
embedding model's retrieval limit independent of palace filtering.
Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even
with wing+room filters applied — the spatial structure helps by
keeping buckets small, but can't fix the underlying embedding ceiling.
2026-04-08 05:06:31 -03:00

5.3 KiB

MemPalace Scale Benchmark Suite

106 tests that benchmark mempalace at scale to validate real-world performance limits.

Why

MemPalace has strong academic scores (96.6% R@5 on LongMemEval) but no empirical data on how it behaves at scale. Key unknowns:

  • tool_status() loads ALL metadata into memory — at what palace size does this OOM?
  • PersistentClient is re-instantiated on every MCP call — what's the overhead?
  • Modified files are never re-ingested — what's the skip-check cost at scale?
  • How does query latency degrade as the palace grows from 1K to 100K drawers?
  • Does wing/room filtering actually improve retrieval, and by how much?
  • At what per-room drawer count does recall break regardless of filtering?

This suite finds those answers.

Quick Start

# Fast smoke test (~2 min)
uv run pytest tests/benchmarks/ -v --bench-scale=small -m "benchmark and not slow"

# Full small scale (~35 min)
uv run pytest tests/benchmarks/ -v --bench-scale=small

# Medium scale with JSON report
uv run pytest tests/benchmarks/ -v --bench-scale=medium --bench-report=results.json

# Stress test (local only, very slow)
uv run pytest tests/benchmarks/ -v --bench-scale=stress -m stress

Scale Levels

Level Drawers Wings Rooms/Wing KG Triples Use case
small 1,000 3 5 200 CI, quick checks
medium 10,000 8 12 2,000 Pre-release testing
large 50,000 15 20 10,000 Scale limit finding
stress 100,000 25 30 50,000 Breaking point

Test Modules

Critical Path

File What it tests
test_mcp_bench.py MCP tool response times, unbounded metadata fetch, client re-instantiation overhead
test_chromadb_stress.py ChromaDB breaking point, query degradation curve, batch vs sequential insert
test_memory_profile.py RSS/heap growth over repeated operations, leak detection

Performance Baselines

File What it tests
test_ingest_bench.py Mining throughput (files/sec, drawers/sec), peak RSS, chunking speed, re-ingest skip overhead
test_search_bench.py Query latency vs palace size, recall@k with planted needles, concurrent queries, n_results scaling

Architectural Validation

File What it tests
test_palace_boost.py Retrieval improvement from wing/room filtering at different scales
test_recall_threshold.py Per-room recall ceiling — isolates embedding model limit with all drawers in one bucket
test_knowledge_graph_bench.py Triple insertion rate, temporal query accuracy, SQLite concurrent access
test_layers_bench.py MemoryStack wake-up cost, Layer1 unbounded fetch, token budget compliance

Architecture

tests/benchmarks/
  conftest.py              # --bench-scale / --bench-report CLI options, fixtures, markers
  data_generator.py        # Deterministic data factory (seeded RNG, planted needles)
  report.py                # JSON report writer + regression checker
  test_*.py                # 9 test modules (106 tests total)

Data Generator

PalaceDataGenerator(seed=42, scale="small") produces deterministic, realistic test data:

  • generate_project_tree() — writes real files + mempalace.yaml for mine() to ingest
  • populate_palace_directly() — bypasses mining, inserts directly into ChromaDB (10-100x faster for search/MCP benchmarks)
  • generate_kg_triples() — entity-relationship triples with temporal validity
  • generate_search_queries() — queries with known-good answers for recall measurement

Planted needles: Unique identifiable content (e.g., NEEDLE_0042: PostgreSQL vacuum autovacuum threshold...) seeded into specific wings/rooms. Search queries target these needles, enabling recall@k measurement without an LLM judge.

JSON Reports

When run with --bench-report=path.json, produces machine-readable output:

{
  "timestamp": "2026-04-07T...",
  "git_sha": "abc123",
  "scale": "small",
  "system": {"os": "linux", "cpu_count": 8},
  "results": {
    "mcp_status": {"latency_ms_at_1000": 45.2, "rss_delta_mb_at_5000": 12.3},
    "search": {"avg_latency_ms_at_5000": 23.1, "recall_at_5": 0.92},
    "chromadb_insert": {"sequential_ms": 8500, "batched_ms": 1200, "speedup_ratio": 7.1}
  }
}

Regression Detection

from tests.benchmarks.report import check_regression

regressions = check_regression("current.json", "baseline.json", threshold=0.2)
# Returns list of metric descriptions that degraded beyond 20%

CI Integration

The GitHub Actions workflow runs benchmarks on PRs at small scale:

benchmark:
  runs-on: ubuntu-latest
  if: github.event_name == 'pull_request'
  # Runs: pytest tests/benchmarks/ -m "benchmark and not stress and not slow" --bench-scale=small

Existing unit tests are isolated with --ignore=tests/benchmarks.

Markers

  • @pytest.mark.benchmark — all benchmark tests
  • @pytest.mark.slow — tests taking >30s even at small scale
  • @pytest.mark.stress — tests that should only run at large/stress scale

Dependencies

Only one new dependency beyond the existing dev stack: psutil (for cross-platform RSS measurement). tracemalloc and resource are stdlib.