c4baceccb4
Follow-up to #766 which covers version.py, pyproject.toml, README, CHANGELOG, and CONTRIBUTING. These 11 files still had the old org name in URLs: - website/ (VitePress config + 6 docs pages) - .claude-plugin/ (plugin.json repository, README marketplace command) - .codex-plugin/ (plugin.json URLs, README links) Author name fields are intentionally unchanged.
96 lines
3.4 KiB
Markdown
96 lines
3.4 KiB
Markdown
# Benchmarks
|
||
|
||
Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository.
|
||
|
||
## The Core Finding
|
||
|
||
MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline.
|
||
|
||
**And it scores 96.6% on LongMemEval.**
|
||
|
||
## LongMemEval Results
|
||
|
||
| Mode | R@5 | LLM Required | Cost/query |
|
||
|------|-----|-------------|------------|
|
||
| Raw ChromaDB | **96.6%** | None | $0 |
|
||
| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
|
||
| Palace + rerank | 99.4% | Haiku | ~$0.001 |
|
||
| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
|
||
|
||
The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking.
|
||
|
||
### Per-Category Breakdown (Raw, 96.6%)
|
||
|
||
| Question Type | R@5 | Count |
|
||
|---------------|-----|-------|
|
||
| Knowledge update | 99.0% | 78 |
|
||
| Multi-session | 98.5% | 133 |
|
||
| Temporal reasoning | 96.2% | 133 |
|
||
| Single-session user | 95.7% | 70 |
|
||
| Single-session preference | 93.3% | 30 |
|
||
| Single-session assistant | 92.9% | 56 |
|
||
|
||
### Held-Out Validation
|
||
|
||
**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize.
|
||
|
||
## Comparison vs Published Systems
|
||
|
||
| System | LongMemEval R@5 | API Required | Cost |
|
||
|--------|----------------|--------------|------|
|
||
| **MemPalace (hybrid)** | **100%** | Optional | Free |
|
||
| Supermemory ASMR | ~99% | Yes | — |
|
||
| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
|
||
| Mastra | 94.87% | Yes | API costs |
|
||
| Hindsight | 91.4% | Yes | API costs |
|
||
| Mem0 | ~85% | Yes | $19–249/mo |
|
||
|
||
## Other Benchmarks
|
||
|
||
### ConvoMem (Salesforce, 75K+ QA pairs)
|
||
|
||
| System | Score |
|
||
|--------|-------|
|
||
| **MemPalace** | **92.9%** |
|
||
| Gemini (long context) | 70–82% |
|
||
| Block extraction | 57–71% |
|
||
| Mem0 (RAG) | 30–45% |
|
||
|
||
On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table.
|
||
|
||
### LoCoMo (1,986 multi-hop QA pairs)
|
||
|
||
| Mode | R@10 | LLM |
|
||
|------|------|-----|
|
||
| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet |
|
||
| bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
|
||
| Hybrid v5 (top-10, no rerank) | **88.9%** | None |
|
||
| Session, no rerank (top-10) | 60.3% | None |
|
||
|
||
### MemBench (ACL 2025, 8,500 items)
|
||
|
||
**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%).
|
||
|
||
## Reproducing Results
|
||
|
||
All benchmarks are reproducible with public datasets:
|
||
|
||
```bash
|
||
git clone https://github.com/MemPalace/mempalace.git
|
||
cd mempalace
|
||
pip install chromadb pyyaml
|
||
|
||
# Download LongMemEval data
|
||
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
|
||
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
|
||
|
||
# Run raw baseline (96.6%, no API key needed)
|
||
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
|
||
```
|
||
|
||
::: tip
|
||
Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score.
|
||
:::
|
||
|
||
For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
|