Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update

# Conflicts:
#	website/index.md
This commit is contained in:
Dominique Deschatre
2026-04-16 22:31:22 -03:00
99 changed files with 337031 additions and 1734 deletions
+7 -8
View File
@@ -80,12 +80,11 @@ The knowledge graph uses SQLite with two tables:
Database location: `~/.mempalace/knowledge_graph.sqlite3`
## Comparison
## Related Work
| Feature | MemPalace | Zep (Graphiti) |
|---------|-----------|----------------|
| Storage | SQLite (local) | Neo4j (cloud) |
| Cost | Free | $25/mo+ |
| Temporal validity | Yes | Yes |
| Self-hosted | Always | Enterprise only |
| Privacy | Everything local | SOC 2, HIPAA |
Temporal entity-relationship graphs are a familiar pattern — Zep's
Graphiti, for example, also exposes a bi-temporal model. MemPalace's
knowledge graph is local-first (SQLite, everything on disk) and free;
Zep is a managed service backed by Neo4j with its own pricing, SLAs,
and compliance surface. See Zep's own [documentation](https://www.getzep.com/)
for authoritative details on their deployment model.
+2 -9
View File
@@ -92,16 +92,9 @@ The original stored text chunks. This is the primary retrieval layer used by the
## Why Structure Matters
Tested on 22,000+ real conversation memories:
Wing and room identifiers become metadata filters at query time. Narrowing a search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which is useful when you have many unrelated projects or people filed in the same palace.
| Search scope | R@10 | Improvement |
|-------------|------|-------------|
| All closets | 60.9% | baseline |
| Within wing | 73.1% | +12% |
| Wing + hall | 84.8% | +24% |
| Wing + room | 94.8% | +34% |
The practical point is that structure improves retrieval. In the project benchmarks, narrowing the search scope by wing and room outperformed searching the entire corpus at once.
This is standard metadata filtering in the underlying vector store, not a novel retrieval mechanism. The useful property here is operational — clear scoping rules that a human or an agent can apply predictably — not a magic retrieval boost.
## Navigation
+7 -1
View File
@@ -34,14 +34,20 @@ Three steps: **init**, **mine**, **search**.
### 1. Initialize Your Palace
`mempalace init` requires a project directory to scan. Pass a path,
or `.` to use the current directory.
```bash
mempalace init ~/projects/myapp
# or, from inside the project:
mempalace init .
```
This scans your project directory and:
- Detects people and projects from file content
- Creates rooms from your folder structure
- Sets up `~/.mempalace/` config directory
- Ensures the `~/.mempalace/` config directory exists
### 2. Mine Your Data
+7 -14
View File
@@ -23,23 +23,16 @@ mempalace search "deploy process" --results 10
## How Search Works
1. Your query is embedded using ChromaDB's default model (`all-MiniLM-L6-v2`)
2. The embedding is compared against all drawers using cosine similarity
3. Optional wing/room filters narrow the search scope
4. Results are returned with similarity scores and source metadata
1. Your query is embedded using the vector store's default model (`all-MiniLM-L6-v2` with the default ChromaDB backend).
2. The embedding is compared against all drawers using cosine similarity.
3. Optional wing/room filters narrow the search scope — standard metadata filtering in the underlying vector store.
4. Results are returned with similarity scores and source metadata.
### Why Structure Matters
### Why Scoping Matters
Tested on 22,000+ real conversation memories:
Wing/room filtering is useful when a single palace contains many unrelated projects or people. Narrowing the search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which keeps retrieval predictable as the palace grows.
```
Search all closets: 60.9% R@10
Search within wing: 73.1% (+12%)
Search wing + hall: 84.8% (+24%)
Search wing + room: 94.8% (+34%)
```
Wings and rooms aren't cosmetic — they're a **34% retrieval improvement**.
This is a metadata-filter feature of the vector store, not a novel retrieval mechanism. Treat it as an operational convenience: clear scoping rules that a human or an agent can apply predictably.
## Programmatic Search
+1
View File
@@ -0,0 +1 @@
mempalaceofficial.com
+102 -50
View File
@@ -1,28 +1,51 @@
# Benchmarks
Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository.
Curated summary of MemPalace's reproducible benchmark results. For the
complete progression with every experiment, see
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
All headline numbers on this page are reproducible from the committed
repository — datasets, scripts, and per-question result JSONLs are all
checked in.
## The Core Finding
MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline.
MemPalace's benchmarked raw baseline stores the source text and searches
it with the vector store's default embeddings. No extraction or
summarisation step is required for that baseline, and it reproduces at
**96.6% R@5** on LongMemEval with no LLM at any stage.
**And it scores 96.6% on LongMemEval.**
## LongMemEval — Retrieval Recall
## LongMemEval Results
Retrieval recall asks: is the labelled session for this question inside
the top-K retrieved sessions? It is not the same metric as end-to-end QA
accuracy; a system can have perfect retrieval recall and poor QA answer
quality, and vice versa.
| Mode | R@5 | LLM Required | Cost/query |
|------|-----|-------------|------------|
| Raw ChromaDB | **96.6%** | None | $0 |
| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
| Palace + rerank | 99.4% | Haiku | ~$0.001 |
| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
**Full 500 questions:**
The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking.
| Mode | R@5 | LLM required | Cost/query |
|---|---|---|---|
| Raw — vector search over verbatim sessions | **96.6%** | None | $0 |
| Hybrid v4 — keyword/temporal/preference boosts, no LLM | 98.6% | None | $0 |
| Hybrid v4 + LLM rerank (minimax-m2.7 via Ollama) | 99.2% | Any capable model | $0 local / varies cloud |
### Per-Category Breakdown (Raw, 96.6%)
**Held-out set (450 questions, never used during `hybrid_v4` development):**
| Question Type | R@5 | Count |
|---------------|-----|-------|
| Mode | R@5 | R@10 | NDCG@10 |
|---|---|---|---|
| Hybrid v4 | **98.4%** | 99.8% | 0.938 |
The held-out figure is the honest generalisable number. The full-500
scores are higher but include the 50 "dev" questions that hybrid_v4's
three targeted fixes (quoted-phrase boost, person-name boost, nostalgia
patterns) were developed against. `benchmarks/BENCHMARKS.md` calls this
"teaching to the test" and the held-out 98.4% is the clean number to
quote when a single R@5 figure is needed for the hybrid pipeline.
### Per-category breakdown (raw, 96.6%)
| Question type | R@5 | Count |
|---|---|---|
| Knowledge update | 99.0% | 78 |
| Multi-session | 98.5% | 133 |
| Temporal reasoning | 96.2% | 133 |
@@ -30,66 +53,95 @@ The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The
| Single-session preference | 93.3% | 30 |
| Single-session assistant | 92.9% | 56 |
### Held-Out Validation
## LoCoMo — Retrieval Recall
**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize.
LoCoMo contains 1,986 questions across 10 long conversations (1932
sessions each).
## Comparison vs Published Systems
| Mode | R@10 | LLM required |
|---|---|---|
| Session, no rerank, top-10 | 60.3% | None |
| Hybrid v5 (keyword + predicate boosts), top-10 | 88.9% | None |
| System | LongMemEval R@5 | API Required | Cost |
|--------|----------------|--------------|------|
| **MemPalace (hybrid)** | **100%** | Optional | Free |
| Supermemory ASMR | ~99% | Yes | — |
| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
| Mastra | 94.87% | Yes | API costs |
| Hindsight | 91.4% | Yes | API costs |
| Mem0 | ~85% | Yes | $19249/mo |
We do not publish a "100% R@10" headline for LoCoMo. A reported 100% in
earlier drafts used `top_k=50`, which exceeds the per-conversation
session count (1932) — so the retrieval stage returns every session in
every conversation by construction. That number measures an LLM's
reading comprehension over the whole conversation, not retrieval. The
honest retrieval-recall number for LoCoMo is the top-10 figure.
## Other Benchmarks
### ConvoMem (Salesforce, 75K+ QA pairs)
**ConvoMem** (Salesforce; 50 items per category × 5 categories = 250
items): MemPalace raw retrieval reaches **92.9% avg recall**. Strongest
categories: Assistant Facts 100%, User Facts 98%. Weakest: Preferences
86%. The Salesforce dataset contains ~75K items in total; our headline
number is from the 250-item sample the benchmark script was designed
around.
| System | Score |
|--------|-------|
| **MemPalace** | **92.9%** |
| Gemini (long context) | 7082% |
| Block extraction | 5771% |
| Mem0 (RAG) | 3045% |
**MemBench** (ACL 2025; 8,500 items, all topics): MemPalace hybrid
top-5 reaches **80.3% R@5 overall**. Strongest: aggregative 99.3%,
comparative 98.4%, lowlevel_rec 99.8%. Weakest: noisy 43.4%
(distractor-heavy by design), conditional 57.3%.
On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table.
## Why We Don't Publish a Cross-System Comparison Table
### LoCoMo (1,986 multi-hop QA pairs)
Previous versions of this page placed MemPalace's retrieval recall (R@5)
next to other projects' end-to-end QA accuracy figures under a single
"LongMemEval R@5" column. Those are different metrics and are not
comparable. A system can have 100% retrieval recall and 40% QA
accuracy, and vice versa.
| Mode | R@10 | LLM |
|------|------|-----|
| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet |
| bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
| Hybrid v5 (top-10, no rerank) | **88.9%** | None |
| Session, no rerank (top-10) | 60.3% | None |
If you are evaluating memory systems against MemPalace and want a fair
comparison, use the retrieval-recall numbers above and the benchmark
scripts in the repo; or pick the metric the other project publishes and
compare on that. Each project's published source is the correct
reference:
### MemBench (ACL 2025, 8,500 items)
**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%).
- [Mastra — Observational Memory](https://mastra.ai/research/observational-memory)
(their published metric is binary QA accuracy with GPT-5-mini)
- [Mem0 — Research](https://mem0.ai/research)
(their published LoCoMo metric is end-to-end QA accuracy, not retrieval recall)
- [Supermemory — ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/)
(their published metric is QA accuracy; authors explicitly frame the
ensemble as an experimental proof-of-concept, not production)
## Reproducing Results
All benchmarks are reproducible with public datasets:
Every benchmark runs deterministically from this repository.
```bash
git clone https://github.com/MemPalace/mempalace.git
cd mempalace
pip install chromadb pyyaml
pip install -e ".[dev]"
# Download LongMemEval data
# LongMemEval — raw (96.6%)
curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
# Run raw baseline (96.6%, no API key needed)
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
# LongMemEval — hybrid v4 on the held-out 450 (98.4%)
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
--mode hybrid_v4 --held-out --split-file benchmarks/lme_split_50_450.json
# LoCoMo — session, top-10 (60.3%)
git clone https://github.com/snap-research/locomo.git /tmp/locomo
python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json \
--granularity session --top-k 10
# LongMemEval — hybrid v4 + rerank, any OpenAI-compatible endpoint
python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
--mode hybrid_v4 --llm-rerank \
--llm-backend ollama --llm-model <your-model-tag>
```
::: tip
Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score.
Results are deterministic: same data, same script, same split seed →
same score. The committed `benchmarks/results_*.jsonl` files include
every question, every retrieved corpus id, and every score, so every
individual answer is auditable — not just the aggregate.
:::
For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
For the complete progression (hybrid v1 → v4, diary mode, palace mode,
LoCoMo architecture iterations, methodology integrity notes), see
[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+17 -11
View File
@@ -4,23 +4,29 @@ All commands accept `--palace <path>` to override the default palace location.
## `mempalace init`
Detect rooms from your folder structure and set up the palace.
Scan a project directory for people, projects, and rooms, and set up the palace.
```bash
mempalace init <dir>
mempalace init <dir> --yes # non-interactive mode
mempalace init <dir> # <dir> is required
mempalace init <dir> --yes # non-interactive mode
mempalace init ~/projects/myapp # example
mempalace init . # initialize from the current directory
```
| Option | Description |
|--------|-------------|
| `<dir>` | Project directory to scan |
| `--yes` | Auto-accept all detected entities |
| Option | Description |
|---------|------------------------------------------------------------------------------|
| `<dir>` | **Required.** Project directory to scan. Pass `.` for the current directory. |
| `--yes` | Auto-accept all detected entities |
What it does:
1. Scans for people and projects in file content
2. Detects rooms from folder structure
3. Creates `~/.mempalace/` config directory
4. Saves detected entities to `<dir>/entities.json`
1. Scans `<dir>` for people and projects in file content
2. Detects rooms from `<dir>`'s folder structure
3. Saves detected entities to `<dir>/entities.json`
4. Ensures the global `~/.mempalace/` config directory exists
Running `mempalace init` with no argument will exit with
`error: the following arguments are required: dir`.
## `mempalace mine`
+1 -1
View File
@@ -68,7 +68,7 @@ If you're planning a significant change, open an issue first. Key principles:
- **Verbatim first** — never summarize user content. Store exact words.
- **Local first** — everything runs on the user's machine. No cloud dependencies.
- **Zero API by default** — core features must work without any API key.
- **Palace structure matters** — wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement.
- **Palace structure is scoping, not magic** — wings, halls, and rooms act as metadata filters in the underlying vector store. They make scoping predictable when a palace holds many unrelated projects; they are not a novel retrieval mechanism.
## Community
+133 -1
View File
@@ -1,6 +1,6 @@
# MCP Tools Reference
Detailed parameter schemas for all 19 MCP tools.
Detailed parameter schemas for all 29 MCP tools.
## Palace — Read Tools
@@ -114,6 +114,48 @@ Delete a drawer by ID. Irreversible.
---
### `mempalace_get_drawer`
Fetch a single drawer by ID — returns full content and metadata.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `drawer_id` | string | **Yes** | ID of the drawer to fetch |
**Returns:** `{ drawer: { id, wing, room, content, ... } }`
---
### `mempalace_list_drawers`
List drawers with pagination. Optional wing/room filter. Returns IDs, wings, rooms, and content previews.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `wing` | string | No | Filter by wing |
| `room` | string | No | Filter by room |
| `limit` | integer | No | Max results per page (default 20, max 100) |
| `offset` | integer | No | Offset for pagination (default 0) |
**Returns:** `{ drawers: [...], total, limit, offset }`
---
### `mempalace_update_drawer`
Update an existing drawer's content and/or metadata (wing, room). Fetches the existing drawer first; returns an error if not found.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `drawer_id` | string | **Yes** | ID of the drawer to update |
| `content` | string | No | New content (omit to keep existing) |
| `wing` | string | No | New wing (omit to keep existing) |
| `room` | string | No | New room (omit to keep existing) |
**Returns:** `{ success, drawer_id, updated_fields }`
---
## Knowledge Graph Tools
### `mempalace_kg_query`
@@ -221,6 +263,61 @@ Palace graph overview: nodes, tunnels, edges, connectivity.
---
### `mempalace_create_tunnel`
Create a cross-wing tunnel linking two palace locations. Use when content in one project relates to another — e.g., an API design in `project_api` connects to a database schema in `project_database`.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `source_wing` | string | **Yes** | Wing of the source |
| `source_room` | string | **Yes** | Room in the source wing |
| `target_wing` | string | **Yes** | Wing of the target |
| `target_room` | string | **Yes** | Room in the target wing |
| `label` | string | No | Description of the connection |
| `source_drawer_id` | string | No | Specific source drawer ID |
| `target_drawer_id` | string | No | Specific target drawer ID |
**Returns:** `{ success, tunnel_id, source, target }`
---
### `mempalace_list_tunnels`
List all explicit cross-wing tunnels. Optionally filter by wing.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `wing` | string | No | Filter tunnels by wing (source or target) |
**Returns:** `{ tunnels: [...], count }`
---
### `mempalace_delete_tunnel`
Delete an explicit tunnel by its ID.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `tunnel_id` | string | **Yes** | Tunnel ID to delete |
**Returns:** `{ success, tunnel_id }`
---
### `mempalace_follow_tunnels`
Follow tunnels from a room to see what it connects to in other wings. Returns connected rooms with drawer previews.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `wing` | string | **Yes** | Wing to start from |
| `room` | string | **Yes** | Room to follow tunnels from |
**Returns:** `[{ wing, room, label, previews }]`
---
## Agent Diary Tools
### `mempalace_diary_write`
@@ -247,3 +344,38 @@ Read recent diary entries.
| `last_n` | integer | No | Number of recent entries (default: 10) |
**Returns:** `{ agent, entries: [{ date, timestamp, topic, content }], total, showing }`
---
## System Tools
### `mempalace_hook_settings`
Get or set auto-save hook behaviour. `silent_save=true` saves directly without MCP-level clutter; `silent_save=false` uses the legacy blocking path. `desktop_toast=true` surfaces a desktop notification when a save completes. Call with no arguments to view the current settings.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `silent_save` | boolean | No | `true` = silent direct save, `false` = blocking MCP calls |
| `desktop_toast` | boolean | No | `true` = show desktop toast via `notify-send` |
**Returns:** `{ silent_save, desktop_toast }`
---
### `mempalace_memories_filed_away`
Check whether a recent palace checkpoint was saved. Returns message count and timestamp of the last save.
**Parameters:** None
**Returns:** `{ filed, message_count, timestamp }`
---
### `mempalace_reconnect`
Force a reconnect to the palace database. Use this after external scripts or CLI commands modified the palace directly, which can leave the in-memory HNSW index stale.
**Parameters:** None
**Returns:** `{ success, palace_path }`