README: honest update from Milla & Ben — own the mistakes, fix the claims

The community caught real problems within hours of launch. Addressing them directly: - Added prominent "A Note from Milla & Ben" section at top owning the issues - Fixed AAAK section: removed "lossless" claim, removed bogus token example, honest about lossy nature and 84.2% regression on LongMemEval - Headline benchmark table: clearly labeled as "raw mode" (the 96.6% number) - Removed misleading "100%" headline (still real but rerank pipeline not in public scripts yet — addressing) - Removed misleading "+34% palace boost" headline (it's metadata filtering, real but not a novel mechanism) - Marked Contradiction Detection as "experimental, not yet wired into KG ops" - Closet legend now notes plain-text summaries in v3.0.0, AAAK closets coming - Intro pillars rewritten honestly — raw verbatim is the win, AAAK is experimental compression layer Thank you to @panuhorsmalahti (#43), @lhl (#27), @gizmax (#39) and everyone who filed issues in the first 48 hours. Brutal honest criticism is exactly what makes open source work.
2026-04-07 11:08:53 -07:00
parent 1782628b8a
commit aa10f8fbf1
1 changed files with 54 additions and 25 deletions
@@ -12,9 +12,11 @@ Every conversation you have with an AI — every decision, every debugging sessi

 Other memory systems try to fix this by letting AI decide what's worth remembering. It extracts "user prefers Postgres" and throws away the conversation where you explained *why*. MemPalace takes a different approach: **store everything, then make it findable.**

-**The Palace** — Ancient Greek orators memorized entire speeches by placing ideas in rooms of an imaginary building. Walk through the building, find the idea. MemPalace applies the same principle to AI memory: your conversations are organized into wings (people and projects), halls (types of memory), and rooms (specific ideas). No AI decides what matters — you keep every word, and the structure makes it searchable. That structure alone improves retrieval by 34%.
+**The Palace** — Ancient Greek orators memorized entire speeches by placing ideas in rooms of an imaginary building. Walk through the building, find the idea. MemPalace applies the same principle to AI memory: your conversations are organized into wings (people and projects), halls (types of memory), and rooms (specific ideas). No AI decides what matters — you keep every word, and the structure gives you a navigable map instead of a flat search index.

-**AAAK** — A lossless shorthand dialect designed for AI agents. Not meant to be read by humans — meant to be read by your AI, fast. 30x compression, zero information loss. Your AI loads months of context in ~120 tokens. And because AAAK is just structured text with a universal grammar, it works with **any model that reads text** — Claude, GPT, Gemini, Llama, Mistral. No decoder, no fine-tuning, no cloud API required. Run it against a local model and your entire memory stack stays offline. Nothing else like it exists.
+**Raw verbatim storage** — MemPalace stores your actual exchanges in ChromaDB without summarization or extraction. The 96.6% LongMemEval result comes from this raw mode. We don't burn an LLM to decide what's "worth remembering" — we keep everything and let semantic search find it.
+
+**AAAK (experimental)** — A lossy abbreviation dialect for packing repeated entities into fewer tokens at scale. Readable by any LLM that reads text — Claude, GPT, Gemini, Llama, Mistral — no decoder needed. **AAAK is a separate compression layer, not the storage default**, and on the LongMemEval benchmark it currently regresses vs raw mode (84.2% vs 96.6%). We're iterating. See the [note above](#a-note-from-milla--ben--april-7-2026) for the honest status.

 **Local, open, adaptable** — MemPalace runs entirely on your machine, on any data you have locally, without using any external API or services. It has been tested on conversations — but it can be adapted for different types of datastores. This is why we're open-sourcing it.

@@ -35,19 +37,53 @@ Other memory systems try to fix this by letting AI decide what's worth rememberi

 <table>
 <tr>
-<td align="center"><strong>96.6%</strong><br><sub>LongMemEval R@5<br>Zero API calls</sub></td>
-<td align="center"><strong>100%</strong><br><sub>LongMemEval R@5<br>with Haiku rerank</sub></td>
-<td align="center"><strong>+34%</strong><br><sub>Retrieval boost<br>from palace structure</sub></td>
+<td align="center"><strong>96.6%</strong><br><sub>LongMemEval R@5<br><b>raw mode</b>, zero API calls</sub></td>
+<td align="center"><strong>500/500</strong><br><sub>questions tested<br>independently reproduced</sub></td>
 <td align="center"><strong>$0</strong><br><sub>No subscription<br>No cloud. Local only.</sub></td>
 </tr>
 </table>

-<sub>Reproducible — runners in <a href="benchmarks/">benchmarks/</a>. <a href="benchmarks/BENCHMARKS.md">Full results</a>.</sub>
+<sub>Reproducible — runners in <a href="benchmarks/">benchmarks/</a>. <a href="benchmarks/BENCHMARKS.md">Full results</a>. The 96.6% is from <b>raw verbatim mode</b>, not AAAK or rooms mode (those score lower — see <a href="#a-note-from-milla--ben--april-7-2026">note above</a>).</sub>

 </div>

 ---

+## A Note from Milla & Ben — April 7, 2026
+
+> The community caught real problems in this README within hours of launch and we want to address them directly.
+>
+> **What we got wrong:**
+>
+> - **The AAAK token example was incorrect.** We used a rough heuristic (`len(text)//3`) for token counts instead of an actual tokenizer. Real counts via OpenAI's tokenizer: the English example is 66 tokens, the AAAK example is 73. AAAK does not save tokens at small scales — it's designed for *repeated entities at scale*, and the README example was a bad demonstration of that. We're rewriting it.
+>
+> - **"30x lossless compression" was overstated.** AAAK is a lossy abbreviation system (entity codes, sentence truncation). Independent benchmarks show AAAK mode scores **84.2% R@5 vs raw mode's 96.6%** on LongMemEval — a 12.4 point regression. The honest framing is: AAAK is an experimental compression layer that trades fidelity for token density, and **the 96.6% headline number is from RAW mode, not AAAK**.
+>
+> - **"+34% palace boost" was misleading.** That number compares unfiltered search to wing+room metadata filtering. Metadata filtering is a standard ChromaDB feature, not a novel retrieval mechanism. Real and useful, but not a moat.
+>
+> - **"Contradiction detection"** exists as a separate utility (`fact_checker.py`) but is not currently wired into the knowledge graph operations as the README implied.
+>
+> - **"100% with Haiku rerank"** is real (we have the result files) but the rerank pipeline is not in the public benchmark scripts. We're adding it.
+>
+> **What's still true and reproducible:**
+>
+> - **96.6% R@5 on LongMemEval in raw mode**, on 500 questions, zero API calls — independently reproduced on M2 Ultra in under 5 minutes by [@gizmax](https://github.com/milla-jovovich/mempalace/issues/39).
+> - Local, free, no subscription, no cloud, no data leaving your machine.
+> - The architecture (wings, rooms, closets, drawers) is real and useful, even if it's not a magical retrieval boost.
+>
+> **What we're doing:**
+>
+> 1. Rewriting the AAAK example with real tokenizer counts and a scenario where AAAK actually demonstrates compression
+> 2. Adding `mode raw / aaak / rooms` clearly to the benchmark documentation so the trade-offs are visible
+> 3. Wiring `fact_checker.py` into the KG ops so the contradiction detection claim becomes true
+> 4. Pinning ChromaDB to a tested range (Issue #100), fixing the shell injection in hooks (#110), and addressing the macOS ARM64 segfault (#74)
+>
+> **Thank you to everyone who poked holes in this.** Brutal honest criticism is exactly what makes open source work, and it's what we asked for. Special thanks to [@panuhorsmalahti](https://github.com/milla-jovovich/mempalace/issues/43), [@lhl](https://github.com/milla-jovovich/mempalace/issues/27), [@gizmax](https://github.com/milla-jovovich/mempalace/issues/39), and everyone who filed an issue or a PR in the first 48 hours. We're listening, we're fixing, and we'd rather be right than impressive.
+>
+> — *Milla Jovovich & Ben Sigman*
+
+---
+
 ## Quick Start

 ```bash
@@ -190,7 +226,7 @@ You say what you're looking for and boom, it already knows which wing to go to.
 **Rooms** — specific topics within a wing. Auth, billing, deploy — endless rooms.
 **Halls** — connections between related rooms *within* the same wing. If Room A (auth) and Room B (security) are related, a hall links them.
 **Tunnels** — connections *between* wings. When Person A and a Project both have a room about "auth," a tunnel cross-references them automatically.
-**Closets** — compressed summaries that point to the original content. Fast for AI to read.
+**Closets** — summaries that point to the original content. (In v3.0.0 these are plain-text summaries; AAAK-encoded closets are coming in a future update — see [Task #30](https://github.com/milla-jovovich/mempalace/issues/30).)
 **Drawers** — the original verbatim files. The exact words, never summarized.

 **Halls** are memory types — the same in every wing, acting as corridors:
@@ -234,30 +270,23 @@ Wings and rooms aren't cosmetic. They're a **34% retrieval improvement**. The pa

 Your AI wakes up with L0 + L1 (~170 tokens) and knows your world. Searches only fire when needed.

-### AAAK Compression
+### AAAK Dialect (experimental)

-AAAK is a lossless dialect — 30x compression, readable by any LLM without a decoder. It works with **Claude, GPT, Gemini, Llama, Mistral** — any model that reads text. Run it against a local Llama model and your whole memory stack stays offline.
+AAAK is a lossy abbreviation system — entity codes, structural markers, and sentence truncation — designed to pack repeated entities and relationships into fewer tokens at scale. It is **readable by any LLM that reads text** (Claude, GPT, Gemini, Llama, Mistral) without a decoder, so a local model can use it without any cloud dependency.

-**English (~1000 tokens):**
-```
-Priya manages the Driftwood team: Kai (backend, 3 years), Soren (frontend),
-Maya (infrastructure), and Leo (junior, started last month). They're building
-a SaaS analytics platform. Current sprint: auth migration to Clerk.
-Kai recommended Clerk over Auth0 based on pricing and DX.
-```
+**Honest status (April 2026):**

-**AAAK (~120 tokens):**
-```
-TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
-PROJ: DRIFTWOOD(saas.analytics) | SPRINT: auth.migration→clerk
-DECISION: KAI.rec:clerk>auth0(pricing+dx) | ★★★★
-```
+- **AAAK is lossy, not lossless.** It uses regex-based abbreviation, not reversible compression.
+- **It does not save tokens at small scales.** Short text already tokenizes efficiently. AAAK overhead (codes, separators) costs more than it saves on a few sentences.
+- **It can save tokens at scale** — in scenarios with many repeated entities (a team mentioned hundreds of times, the same project across thousands of sessions), the entity codes amortize.
+- **AAAK currently regresses LongMemEval** vs raw verbatim retrieval (84.2% R@5 vs 96.6%). The 96.6% headline number is from **raw mode**, not AAAK mode.
+- **The MemPalace storage default is raw verbatim text in ChromaDB** — that's where the benchmark wins come from. AAAK is a separate compression layer for context loading, not the storage format.

-Same information. 8x fewer tokens. Your AI learns AAAK automatically from the MCP server — no manual setup.
+We're iterating on the dialect spec, adding a real tokenizer for stats, and exploring better break points for when to use it. Track progress in [Issue #43](https://github.com/milla-jovovich/mempalace/issues/43) and [#27](https://github.com/milla-jovovich/mempalace/issues/27).

-### Contradiction Detection
+### Contradiction Detection (experimental, not yet wired into KG)

-MemPalace catches mistakes before they reach you:
+A separate utility (`fact_checker.py`) can check assertions against entity facts. It's not currently called automatically by the knowledge graph operations — this is being fixed (track in [Issue #27](https://github.com/milla-jovovich/mempalace/issues/27)). When enabled it catches things like:

 ```
 Input:  "Soren finished the auth migration"