Merge remote-tracking branch 'upstream/develop' into feat/landing-page-update

# Conflicts: # website/index.md
2026-04-16 22:31:22 -03:00
parent d8ac4c3abb 55a004fe1e
commit 44c525ddd3
99 changed files with 337031 additions and 1734 deletions
@@ -80,12 +80,11 @@ The knowledge graph uses SQLite with two tables:

 Database location: `~/.mempalace/knowledge_graph.sqlite3`

-## Comparison
+## Related Work

-| Feature | MemPalace | Zep (Graphiti) |
-|---------|-----------|----------------|
-| Storage | SQLite (local) | Neo4j (cloud) |
-| Cost | Free | $25/mo+ |
-| Temporal validity | Yes | Yes |
-| Self-hosted | Always | Enterprise only |
-| Privacy | Everything local | SOC 2, HIPAA |
+Temporal entity-relationship graphs are a familiar pattern — Zep's
+Graphiti, for example, also exposes a bi-temporal model. MemPalace's
+knowledge graph is local-first (SQLite, everything on disk) and free;
+Zep is a managed service backed by Neo4j with its own pricing, SLAs,
+and compliance surface. See Zep's own [documentation](https://www.getzep.com/)
+for authoritative details on their deployment model.
@@ -92,16 +92,9 @@ The original stored text chunks. This is the primary retrieval layer used by the

 ## Why Structure Matters

-Tested on 22,000+ real conversation memories:
+Wing and room identifiers become metadata filters at query time. Narrowing a search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which is useful when you have many unrelated projects or people filed in the same palace.

-| Search scope | R@10 | Improvement |
-|-------------|------|-------------|
-| All closets | 60.9% | baseline |
-| Within wing | 73.1% | +12% |
-| Wing + hall | 84.8% | +24% |
-| Wing + room | 94.8% | +34% |
-
-The practical point is that structure improves retrieval. In the project benchmarks, narrowing the search scope by wing and room outperformed searching the entire corpus at once.
+This is standard metadata filtering in the underlying vector store, not a novel retrieval mechanism. The useful property here is operational — clear scoping rules that a human or an agent can apply predictably — not a magic retrieval boost.

 ## Navigation

@@ -34,14 +34,20 @@ Three steps: **init**, **mine**, **search**.

 ### 1. Initialize Your Palace

+`mempalace init` requires a project directory to scan. Pass a path,
+or `.` to use the current directory.
+
 ```bash
 mempalace init ~/projects/myapp
+# or, from inside the project:
+mempalace init .
 ```

 This scans your project directory and:
+
 - Detects people and projects from file content
 - Creates rooms from your folder structure
- Sets up `~/.mempalace/` config directory
+- Ensures the `~/.mempalace/` config directory exists

 ### 2. Mine Your Data

@@ -23,23 +23,16 @@ mempalace search "deploy process" --results 10

 ## How Search Works

-1. Your query is embedded using ChromaDB's default model (`all-MiniLM-L6-v2`)
-2. The embedding is compared against all drawers using cosine similarity
-3. Optional wing/room filters narrow the search scope
-4. Results are returned with similarity scores and source metadata
+1. Your query is embedded using the vector store's default model (`all-MiniLM-L6-v2` with the default ChromaDB backend).
+2. The embedding is compared against all drawers using cosine similarity.
+3. Optional wing/room filters narrow the search scope — standard metadata filtering in the underlying vector store.
+4. Results are returned with similarity scores and source metadata.

-### Why Structure Matters
+### Why Scoping Matters

-Tested on 22,000+ real conversation memories:
+Wing/room filtering is useful when a single palace contains many unrelated projects or people. Narrowing the search to a specific wing (or wing + room) means the vector store only scores candidates inside that scope, which keeps retrieval predictable as the palace grows.

-```
-Search all closets:          60.9%  R@10
-Search within wing:          73.1%  (+12%)
-Search wing + hall:          84.8%  (+24%)
-Search wing + room:          94.8%  (+34%)
-```
-
-Wings and rooms aren't cosmetic — they're a **34% retrieval improvement**.
+This is a metadata-filter feature of the vector store, not a novel retrieval mechanism. Treat it as an operational convenience: clear scoping rules that a human or an agent can apply predictably.

 ## Programmatic Search

@@ -0,0 +1 @@
+mempalaceofficial.com
@@ -1,28 +1,51 @@
 # Benchmarks

-Curated summary of MemPalace benchmark results. For the full 725-line progression with every experiment, see [`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md) in the repository.
+Curated summary of MemPalace's reproducible benchmark results. For the
+complete progression with every experiment, see
+[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+All headline numbers on this page are reproducible from the committed
+repository — datasets, scripts, and per-question result JSONLs are all
+checked in.

 ## The Core Finding

-MemPalace's benchmarked raw baseline stores the source text and searches it with ChromaDB's default embeddings. No extraction layer or summarization step is required for that baseline.
+MemPalace's benchmarked raw baseline stores the source text and searches
+it with the vector store's default embeddings. No extraction or
+summarisation step is required for that baseline, and it reproduces at
+**96.6% R@5** on LongMemEval with no LLM at any stage.

-**And it scores 96.6% on LongMemEval.**
+## LongMemEval — Retrieval Recall

-## LongMemEval Results
+Retrieval recall asks: is the labelled session for this question inside
+the top-K retrieved sessions? It is not the same metric as end-to-end QA
+accuracy; a system can have perfect retrieval recall and poor QA answer
+quality, and vice versa.

-| Mode | R@5 | LLM Required | Cost/query |
-|------|-----|-------------|------------|
-| Raw ChromaDB | **96.6%** | None | $0 |
-| Hybrid v3 + rerank | 99.4% | Haiku | ~$0.001 |
-| Palace + rerank | 99.4% | Haiku | ~$0.001 |
-| **Hybrid v4 + rerank** | **100%** | Haiku | ~$0.001 |
+**Full 500 questions:**

-The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The 100% result uses optional Haiku reranking.
+| Mode | R@5 | LLM required | Cost/query |
+|---|---|---|---|
+| Raw — vector search over verbatim sessions | **96.6%** | None | $0 |
+| Hybrid v4 — keyword/temporal/preference boosts, no LLM | 98.6% | None | $0 |
+| Hybrid v4 + LLM rerank (minimax-m2.7 via Ollama) | 99.2% | Any capable model | $0 local / varies cloud |

-### Per-Category Breakdown (Raw, 96.6%)
+**Held-out set (450 questions, never used during `hybrid_v4` development):**

-| Question Type | R@5 | Count |
-|---------------|-----|-------|
+| Mode | R@5 | R@10 | NDCG@10 |
+|---|---|---|---|
+| Hybrid v4 | **98.4%** | 99.8% | 0.938 |
+
+The held-out figure is the honest generalisable number. The full-500
+scores are higher but include the 50 "dev" questions that hybrid_v4's
+three targeted fixes (quoted-phrase boost, person-name boost, nostalgia
+patterns) were developed against. `benchmarks/BENCHMARKS.md` calls this
+"teaching to the test" and the held-out 98.4% is the clean number to
+quote when a single R@5 figure is needed for the hybrid pipeline.
+
+### Per-category breakdown (raw, 96.6%)
+
+| Question type | R@5 | Count |
+|---|---|---|
 | Knowledge update | 99.0% | 78 |
 | Multi-session | 98.5% | 133 |
 | Temporal reasoning | 96.2% | 133 |
@@ -30,66 +53,95 @@ The 96.6% raw score requires no API key, no cloud, and no LLM at any stage. The
 | Single-session preference | 93.3% | 30 |
 | Single-session assistant | 92.9% | 56 |

-### Held-Out Validation
+## LoCoMo — Retrieval Recall

-**98.4% R@5** on 450 questions that hybrid_v4 was never tuned on — confirming the improvements generalize.
+LoCoMo contains 1,986 questions across 10 long conversations (19–32
+sessions each).

-## Comparison vs Published Systems
+| Mode | R@10 | LLM required |
+|---|---|---|
+| Session, no rerank, top-10 | 60.3% | None |
+| Hybrid v5 (keyword + predicate boosts), top-10 | 88.9% | None |

-| System | LongMemEval R@5 | API Required | Cost |
-|--------|----------------|--------------|------|
-| **MemPalace (hybrid)** | **100%** | Optional | Free |
-| Supermemory ASMR | ~99% | Yes | — |
-| **MemPalace (raw)** | **96.6%** | **None** | **Free** |
-| Mastra | 94.87% | Yes | API costs |
-| Hindsight | 91.4% | Yes | API costs |
-| Mem0 | ~85% | Yes | $19–249/mo |
+We do not publish a "100% R@10" headline for LoCoMo. A reported 100% in
+earlier drafts used `top_k=50`, which exceeds the per-conversation
+session count (19–32) — so the retrieval stage returns every session in
+every conversation by construction. That number measures an LLM's
+reading comprehension over the whole conversation, not retrieval. The
+honest retrieval-recall number for LoCoMo is the top-10 figure.

 ## Other Benchmarks

-### ConvoMem (Salesforce, 75K+ QA pairs)
+**ConvoMem** (Salesforce; 50 items per category × 5 categories = 250
+items): MemPalace raw retrieval reaches **92.9% avg recall**. Strongest
+categories: Assistant Facts 100%, User Facts 98%. Weakest: Preferences
+86%. The Salesforce dataset contains ~75K items in total; our headline
+number is from the 250-item sample the benchmark script was designed
+around.

-| System | Score |
-|--------|-------|
-| **MemPalace** | **92.9%** |
-| Gemini (long context) | 70–82% |
-| Block extraction | 57–71% |
-| Mem0 (RAG) | 30–45% |
+**MemBench** (ACL 2025; 8,500 items, all topics): MemPalace hybrid
+top-5 reaches **80.3% R@5 overall**. Strongest: aggregative 99.3%,
+comparative 98.4%, lowlevel_rec 99.8%. Weakest: noisy 43.4%
+(distractor-heavy by design), conditional 57.3%.

-On this benchmark, MemPalace materially outperforms the Mem0 result cited in the comparison table.
+## Why We Don't Publish a Cross-System Comparison Table

-### LoCoMo (1,986 multi-hop QA pairs)
+Previous versions of this page placed MemPalace's retrieval recall (R@5)
+next to other projects' end-to-end QA accuracy figures under a single
+"LongMemEval R@5" column. Those are different metrics and are not
+comparable. A system can have 100% retrieval recall and 40% QA
+accuracy, and vice versa.

-| Mode | R@10 | LLM |
-|------|------|-----|
-| Hybrid v5 + Sonnet rerank (top-50) | **100%** | Sonnet |
-| bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
-| Hybrid v5 (top-10, no rerank) | **88.9%** | None |
-| Session, no rerank (top-10) | 60.3% | None |
+If you are evaluating memory systems against MemPalace and want a fair
+comparison, use the retrieval-recall numbers above and the benchmark
+scripts in the repo; or pick the metric the other project publishes and
+compare on that. Each project's published source is the correct
+reference:

-### MemBench (ACL 2025, 8,500 items)
-
-**80.3% R@5** overall. Strongest categories: aggregative (99.3%), comparative (98.4%), lowlevel_rec (99.8%).
+- [Mastra — Observational Memory](https://mastra.ai/research/observational-memory)
+  (their published metric is binary QA accuracy with GPT-5-mini)
+- [Mem0 — Research](https://mem0.ai/research)
+  (their published LoCoMo metric is end-to-end QA accuracy, not retrieval recall)
+- [Supermemory — ASMR post](https://supermemory.ai/blog/we-broke-the-frontier-in-agent-memory-introducing-99-sota-memory-system/)
+  (their published metric is QA accuracy; authors explicitly frame the
+  ensemble as an experimental proof-of-concept, not production)

 ## Reproducing Results

-All benchmarks are reproducible with public datasets:
+Every benchmark runs deterministically from this repository.

 ```bash
 git clone https://github.com/MemPalace/mempalace.git
 cd mempalace
-pip install chromadb pyyaml
+pip install -e ".[dev]"

-# Download LongMemEval data
+# LongMemEval — raw (96.6%)
 curl -fsSL -o /tmp/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
-
-# Run raw baseline (96.6%, no API key needed)
 python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json
+
+# LongMemEval — hybrid v4 on the held-out 450 (98.4%)
+python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
+  --mode hybrid_v4 --held-out --split-file benchmarks/lme_split_50_450.json
+
+# LoCoMo — session, top-10 (60.3%)
+git clone https://github.com/snap-research/locomo.git /tmp/locomo
+python benchmarks/locomo_bench.py /tmp/locomo/data/locomo10.json \
+  --granularity session --top-k 10
+
+# LongMemEval — hybrid v4 + rerank, any OpenAI-compatible endpoint
+python benchmarks/longmemeval_bench.py /tmp/longmemeval_s_cleaned.json \
+  --mode hybrid_v4 --llm-rerank \
+  --llm-backend ollama --llm-model <your-model-tag>
 ```

 ::: tip
-Results are deterministic. Same data + same script = same result every time. Every result JSONL file contains every question, every retrieved document, every score.
+Results are deterministic: same data, same script, same split seed →
+same score. The committed `benchmarks/results_*.jsonl` files include
+every question, every retrieved corpus id, and every score, so every
+individual answer is auditable — not just the aggregate.
 :::

-For complete reproduction instructions, benchmark integrity notes, and the full score progression, see the [full benchmark documentation](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
+For the complete progression (hybrid v1 → v4, diary mode, palace mode,
+LoCoMo architecture iterations, methodology integrity notes), see
+[`benchmarks/BENCHMARKS.md`](https://github.com/MemPalace/mempalace/blob/main/benchmarks/BENCHMARKS.md).
@@ -4,23 +4,29 @@ All commands accept `--palace <path>` to override the default palace location.

 ## `mempalace init`

-Detect rooms from your folder structure and set up the palace.
+Scan a project directory for people, projects, and rooms, and set up the palace.

 ```bash
-mempalace init <dir>
-mempalace init <dir> --yes  # non-interactive mode
+mempalace init <dir>                 # <dir> is required
+mempalace init <dir> --yes           # non-interactive mode
+mempalace init ~/projects/myapp      # example
+mempalace init .                     # initialize from the current directory
 ```

-| Option | Description |
-|--------|-------------|
-| `<dir>` | Project directory to scan |
-| `--yes` | Auto-accept all detected entities |
+| Option  | Description                                                                  |
+|---------|------------------------------------------------------------------------------|
+| `<dir>` | **Required.** Project directory to scan. Pass `.` for the current directory. |
+| `--yes` | Auto-accept all detected entities                                            |

 What it does:
-1. Scans for people and projects in file content
-2. Detects rooms from folder structure
-3. Creates `~/.mempalace/` config directory
-4. Saves detected entities to `<dir>/entities.json`
+
+1. Scans `<dir>` for people and projects in file content
+2. Detects rooms from `<dir>`'s folder structure
+3. Saves detected entities to `<dir>/entities.json`
+4. Ensures the global `~/.mempalace/` config directory exists
+
+Running `mempalace init` with no argument will exit with
+`error: the following arguments are required: dir`.

 ## `mempalace mine`

@@ -68,7 +68,7 @@ If you're planning a significant change, open an issue first. Key principles:
 - **Verbatim first** — never summarize user content. Store exact words.
 - **Local first** — everything runs on the user's machine. No cloud dependencies.
 - **Zero API by default** — core features must work without any API key.
- **Palace structure matters** — wings, halls, and rooms aren't cosmetic — they drive a 34% retrieval improvement.
+- **Palace structure is scoping, not magic** — wings, halls, and rooms act as metadata filters in the underlying vector store. They make scoping predictable when a palace holds many unrelated projects; they are not a novel retrieval mechanism.

 ## Community

@@ -1,6 +1,6 @@
 # MCP Tools Reference

-Detailed parameter schemas for all 19 MCP tools.
+Detailed parameter schemas for all 29 MCP tools.

 ## Palace — Read Tools

@@ -114,6 +114,48 @@ Delete a drawer by ID. Irreversible.

 ---

+### `mempalace_get_drawer`
+
+Fetch a single drawer by ID — returns full content and metadata.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `drawer_id` | string | **Yes** | ID of the drawer to fetch |
+
+**Returns:** `{ drawer: { id, wing, room, content, ... } }`
+
+---
+
+### `mempalace_list_drawers`
+
+List drawers with pagination. Optional wing/room filter. Returns IDs, wings, rooms, and content previews.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | No | Filter by wing |
+| `room` | string | No | Filter by room |
+| `limit` | integer | No | Max results per page (default 20, max 100) |
+| `offset` | integer | No | Offset for pagination (default 0) |
+
+**Returns:** `{ drawers: [...], total, limit, offset }`
+
+---
+
+### `mempalace_update_drawer`
+
+Update an existing drawer's content and/or metadata (wing, room). Fetches the existing drawer first; returns an error if not found.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `drawer_id` | string | **Yes** | ID of the drawer to update |
+| `content` | string | No | New content (omit to keep existing) |
+| `wing` | string | No | New wing (omit to keep existing) |
+| `room` | string | No | New room (omit to keep existing) |
+
+**Returns:** `{ success, drawer_id, updated_fields }`
+
+---
+
 ## Knowledge Graph Tools

 ### `mempalace_kg_query`
@@ -221,6 +263,61 @@ Palace graph overview: nodes, tunnels, edges, connectivity.

 ---

+### `mempalace_create_tunnel`
+
+Create a cross-wing tunnel linking two palace locations. Use when content in one project relates to another — e.g., an API design in `project_api` connects to a database schema in `project_database`.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `source_wing` | string | **Yes** | Wing of the source |
+| `source_room` | string | **Yes** | Room in the source wing |
+| `target_wing` | string | **Yes** | Wing of the target |
+| `target_room` | string | **Yes** | Room in the target wing |
+| `label` | string | No | Description of the connection |
+| `source_drawer_id` | string | No | Specific source drawer ID |
+| `target_drawer_id` | string | No | Specific target drawer ID |
+
+**Returns:** `{ success, tunnel_id, source, target }`
+
+---
+
+### `mempalace_list_tunnels`
+
+List all explicit cross-wing tunnels. Optionally filter by wing.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | No | Filter tunnels by wing (source or target) |
+
+**Returns:** `{ tunnels: [...], count }`
+
+---
+
+### `mempalace_delete_tunnel`
+
+Delete an explicit tunnel by its ID.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `tunnel_id` | string | **Yes** | Tunnel ID to delete |
+
+**Returns:** `{ success, tunnel_id }`
+
+---
+
+### `mempalace_follow_tunnels`
+
+Follow tunnels from a room to see what it connects to in other wings. Returns connected rooms with drawer previews.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `wing` | string | **Yes** | Wing to start from |
+| `room` | string | **Yes** | Room to follow tunnels from |
+
+**Returns:** `[{ wing, room, label, previews }]`
+
+---
+
 ## Agent Diary Tools

 ### `mempalace_diary_write`
@@ -247,3 +344,38 @@ Read recent diary entries.
 | `last_n` | integer | No | Number of recent entries (default: 10) |

 **Returns:** `{ agent, entries: [{ date, timestamp, topic, content }], total, showing }`
+
+---
+
+## System Tools
+
+### `mempalace_hook_settings`
+
+Get or set auto-save hook behaviour. `silent_save=true` saves directly without MCP-level clutter; `silent_save=false` uses the legacy blocking path. `desktop_toast=true` surfaces a desktop notification when a save completes. Call with no arguments to view the current settings.
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `silent_save` | boolean | No | `true` = silent direct save, `false` = blocking MCP calls |
+| `desktop_toast` | boolean | No | `true` = show desktop toast via `notify-send` |
+
+**Returns:** `{ silent_save, desktop_toast }`
+
+---
+
+### `mempalace_memories_filed_away`
+
+Check whether a recent palace checkpoint was saved. Returns message count and timestamp of the last save.
+
+**Parameters:** None
+
+**Returns:** `{ filed, message_count, timestamp }`
+
+---
+
+### `mempalace_reconnect`
+
+Force a reconnect to the palace database. Use this after external scripts or CLI commands modified the palace directly, which can leave the in-memory HNSW index stale.
+
+**Parameters:** None
+
+**Returns:** `{ success, palace_path }`