Files

T

Igor Lins e Silva 61d02e10fe benchmarks: add v3.3.0 reproduction results + 50/450 split

Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).

Scorecard — all reproduce exactly:

  LongMemEval
    raw R@5                            96.6% (500/500)   ✅
    hybrid_v4 held-out 450 R@5         98.4% (442/450)   ✅
    hybrid_v4 + minimax rerank R@5     99.2% (496/500)   *
    hybrid_v4 + minimax rerank R@10   100.0% (500/500)   *

  LoCoMo (session, top-10)
    raw                                60.3% (1986q)     ✅
    hybrid v5                          88.9% (1986q)     ✅

  ConvoMem all-categories (250 items)   92.9%            ✅
  MemBench all-categories (8500)        80.3%            ✅

* The minimax-m2.7:cloud rerank run replicates the "100%" claim
  with a different LLM family (no Anthropic dependency). R@10 is
  a perfect reproduction; R@5 misses 4 questions that the
  published Haiku run caught — consistent with BENCHMARKS.md's own
  disclosure that hybrid_v4 includes three question-specific fixes
  developed by inspecting misses, i.e. teaching to the test.

The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.

Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.

2026-04-14 21:21:11 -03:00

BENCHMARKS.md

bench: add benchmark runners, results docs, and test suite

2026-04-04 18:33:42 -07:00

convomem_bench.py

fix(bench): remove global SSL verification bypass in convomem_bench (#176 )

2026-04-11 23:14:12 -07:00

HYBRID_MODE.md

fix(bench): remove hardcoded credential paths from benchmark runners (#177 )

2026-04-11 23:14:36 -07:00

lme_split_50_450.json

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

locomo_bench.py

benchmarks: add --llm-backend ollama for non-Anthropic rerank

2026-04-14 21:20:14 -03:00

longmemeval_bench.py

benchmarks: apply ruff-format to llm_rerank (trivial line wrap)

2026-04-14 21:20:54 -03:00

membench_bench.py

bench: add benchmark runners, results docs, and test suite

2026-04-04 18:33:42 -07:00

README.md

bench: add benchmark runners, results docs, and test suite

2026-04-04 18:33:42 -07:00

results_convomem_raw_top10_20260414_1649.json

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_locomo_hybrid_session_top10_20260414_1649.json

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_locomo_raw_session_top10_20260414_1634.json

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_membench_hybrid_all_movie_top5_20260414_1656.json

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_mempal_hybrid_v4_held_out_session_20260414_1634.jsonl

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_mempal_hybrid_v4_llmrerank_session_20260414_1654.jsonl

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_mempal_hybrid_v4_llmrerank_session_20260414_1659.jsonl

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

results_mempal_raw_session_20260414_1629.jsonl

benchmarks: add v3.3.0 reproduction results + 50/450 split

2026-04-14 21:21:11 -03:00

README.md

MemPal Benchmarks — Reproduction Guide

Run the exact same benchmarks we report. Clone, install, run.

Setup

git clone -b ben/benchmarking https://github.com/aya-thekeeper/mempal.git
cd mempal
pip install chromadb pyyaml

Benchmark 1: LongMemEval (500 questions)

Tests retrieval across ~53 conversation sessions per question. The standard benchmark for AI memory.

# Download data
mkdir -p /tmp/longmemeval-data
curl -fsSL -o /tmp/longmemeval-data/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json

# Run (raw mode — our headline 96.6% result)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json

# Run with AAAK compression (84.2%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --mode aaak

# Run with room-based boosting (89.4%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --mode rooms

# Quick test on 20 questions first
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --limit 20

# Turn-level granularity
python benchmarks/longmemeval_bench.py /tmp/longmemeval-data/longmemeval_s_cleaned.json --granularity turn

Expected output (raw mode, full 500):

Recall@5:  0.966
Recall@10: 0.982
NDCG@10:   0.889
Time:      ~5 minutes on Apple Silicon

Benchmark 2: LoCoMo (1,986 QA pairs)

Tests multi-hop reasoning across 10 long conversations (19-32 sessions each, 400-600 dialog turns).

# Clone LoCoMo
git clone https://github.com/snap-research/locomo.git /tmp/locomo

# Run (session granularity — our 60.3% result)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --granularity session

# Dialog granularity (harder — 48.0%)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --granularity dialog

# Higher top-k (77.8% at top-50)
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --top-k 50

# Quick test on 1 conversation
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json --limit 1

Expected output (session, top-10, full 10 conversations):

Avg Recall: 0.603
Temporal:   0.692
Time:       ~2 minutes

Benchmark 3: ConvoMem (Salesforce, 75K+ QA pairs)

Tests six categories of conversational memory. Downloads from HuggingFace automatically.

# Run all categories, 50 items each (our 92.9% result)
python benchmarks/convomem_bench.py --category all --limit 50

# Single category
python benchmarks/convomem_bench.py --category user_evidence --limit 100

# Quick test
python benchmarks/convomem_bench.py --category user_evidence --limit 10

Categories available: user_evidence, assistant_facts_evidence, changing_evidence, abstention_evidence, preference_evidence, implicit_connection_evidence

Expected output (all categories, 50 each):

Avg Recall: 0.929
Assistant Facts: 1.000
User Facts:      0.980
Time:            ~2 minutes

What Each Benchmark Tests

Benchmark	What it measures	Why it matters
LongMemEval	Can you find a fact buried in 53 sessions?	Tests basic retrieval quality — the "needle in a haystack"
LoCoMo	Can you connect facts across conversations over weeks?	Tests multi-hop reasoning and temporal understanding
ConvoMem	Does your memory system work at scale?	Tests all memory types: facts, preferences, changes, abstention

Results Files

Raw results are in benchmarks/results_*.jsonl and benchmarks/results_*.json. Each file contains every question, every retrieved document, and every score — fully auditable.

Requirements

Python 3.9+
chromadb (the only dependency)
~300MB disk for LongMemEval data
~5 minutes for each full benchmark run
No API key. No internet during benchmark (after data download). No GPU.

Next Benchmarks (Planned)

Scale testing — ConvoMem at 50/100/300 conversations per item
Hybrid AAAK — search raw text, deliver AAAK-compressed results
End-to-end QA — retrieve + generate answer + measure F1 (needs LLM API key)