This commit is contained in:
Jason Stedwell
2026-06-19 21:12:14 -05:00
parent a860819bfb
commit 2fc3a0a80b
26 changed files with 1559 additions and 89 deletions
+87
View File
@@ -0,0 +1,87 @@
# echo-memory eval — 0.6 vs 0.7 A/B harness
A reproducible, credential-free A/B comparison of the plugin **before** (0.6: raw-curl
recipes from `SKILL.md`) and **after** (0.7: the shipped `scripts/echo.sh` client) the
hardening work. It quantifies the claims in the comparative analysis: token cost of the
I/O layer, and the rate of **silent write failures**.
## Run it
```bash
cd eval
python3 run_eval.py # default params
python3 run_eval.py --recovery 2500 # sensitivity-test the recovery assumption
python3 run_eval.py --cpt 3.5 # different chars/token proxy
```
No network, no API key, no live vault. Pure stdlib (Python 3 + bash for `echo.sh`).
Results table prints to stdout and a machine-readable copy lands in `results/latest.json`.
## How it works
- **`mock_olrapi.py`** — a deterministic mock of the Obsidian Local REST API surface the
plugin uses, reproducing its real behaviors and quirks (404 shape, the `/vault//`
double-slash 400, directory listings with `dir/` entries, `PATCH` heading targets that
return `400 invalid-target / 40080` when the heading is absent). Faults are triggered by
path markers so one server serves every scenario:
- `flaky` in the path → first write returns `503`, then succeeds (tests retry).
- `phantom` in the path → `PUT` returns `200` but does **not** persist (tests read-back verify).
- a `PATCH` to a missing heading → `400` (the silent-write-loss trigger).
- **`run_eval.py`** — for each scenario, runs both methods against a freshly reset +
re-seeded server (so faults are identical for both), then reads ground truth back
**independently** from the mock. The 0.7 side executes the *actual shipped `echo.sh`*;
the 0.6 side faithfully models the documented recipe (real HTTP, but no status check,
no retry, no verify, no dedupe).
## Metrics
| metric | meaning |
|---|---|
| `gen_tokens` | output tokens the model must generate for the op (`len(emitted)/cpt` proxy) |
| `silent_failure` | method **reported success** but ground truth is wrong (lost write or dup) — and nobody noticed |
| `detected` | the method surfaced the failure (nonzero exit) instead of hiding it |
| `effective_tokens` | `gen + silent_failures*recovery + detected*detect_cost` |
| `silent-error-free ops` | the headline accuracy number |
| `writes actually persisted` | did the single op land (separate from "was it silent") |
## Scenarios
1. **agent-log-missing-heading**`PATCH` append to a note lacking the target heading (`400`).
2. **scope-switch** — clean `PATCH replace` (no fault; pure token comparison).
3. **inbox-capture-replayed** — same capture issued twice (retry/replay): dedup vs duplicate.
4. **session-log-flaky-network** — one-time `503`: retry vs single-shot.
5. **heartbeat-phantom-write** — accepted-but-not-persisted: read-back verify vs none.
6. **cold-start-load-6-reads** — 6 GETs (no fault; pure token comparison).
## Representative result (defaults)
```
generated tokens 723 -> 174 (+76% fewer)
silent failures 4 -> 0 (-4)
duplicate lines 1 -> 0 (-1)
silent-error-free ops 1/5 -> 5/5
effective tokens (assumed) 6723 -> 334
```
## Honest caveats
- **Mechanics, not reasoning.** This measures the deterministic plumbing differences. It
does **not** measure model judgment (routing choices, prose quality) — that needs a live model.
- **`recovery` and `detect-cost` are assumptions**, not measurements. The headline
"silent failures: 4 → 0" is a hard count from ground truth; the `effective_tokens` figure
is a model on top of it — tune `--recovery` to see the sensitivity.
- **`gen_tokens` is a `chars/cpt` proxy** for the I/O layer only, not a tokenizer count, and
excludes the one-time `+12%` SKILL.md context cost noted in the analysis (that's a
per-session context cost, not per-op).
## Extending to a live-model run (optional)
To measure real model behavior and true token counts:
1. Define the same six scenarios as natural-language tasks (e.g. "log a session note for X").
2. Run each twice — once with the 0.6 skill files, once with 0.7 — through the Agent SDK
against the **mock** server (point `ECHO_BASE` at it) so faults stay deterministic.
3. Record `usage.output_tokens` per task from the API and whether the vault ended correct
(same ground-truth read used here).
The mock + ground-truth checks in this harness are reusable as-is for that; only the driver
changes from "scripted ops" to "model-driven ops".
+172
View File
@@ -0,0 +1,172 @@
#!/usr/bin/env python3
"""mock_olrapi.py — a deterministic mock of the Obsidian Local REST API surface the
echo-memory plugin uses, with controllable fault injection.
It reproduces the real API's observed behavior and quirks so the A/B eval can run
without credentials and without touching the live vault:
GET /vault/<path> 200 + body, or 404 {errorCode:40400}
GET /vault/ (or dir + '/') 200 {"files":[... , "sub/"]} (dirs end in '/')
GET /vault// 400 (the double-slash quirk)
GET ... Accept: document-map 200 {"headings":[...]}
PUT /vault/<path> 201/200, stores body, creates parents
POST /vault/<path> 200, appends (creates if absent)
PATCH/vault/<path> heading target missing -> 400 {errorCode:40080}; else apply
POST /search/simple/?query= 200 []
DELETE /vault/<path> 200 / 404
Fault injection is triggered by markers in the path (so a single server serves every
scenario deterministically):
* path contains "flaky" -> the FIRST write (PUT/POST) to that path returns 503,
subsequent writes succeed (tests retry-on-5xx).
* path contains "phantom" -> PUT returns 200 but does NOT persist (accepted-but-lost;
tests read-back verify).
* a PATCH heading whose Target heading is absent from the stored doc -> 400 40080
(the silent-write-loss trigger).
Debug (out of band, for the harness ground-truth checks):
GET /__debug__?path=<p> -> raw stored content, or "<<MISSING>>"
POST /__debug__reset -> clear all state + fault memory
"""
import json, re, sys, argparse
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import urlparse, parse_qs, unquote
STATE = {} # vault-path -> content (str)
FLAKY_FIRED = set() # paths that already consumed their one-time 503
def headings_of(text):
return [m.group(1).strip() for m in re.finditer(r"(?m)^#{1,6}\s+(.*)$", text or "")]
def heading_present(text, target):
# target may be a full "A::B" path; match its leaf heading line anywhere in the doc
leaf = target.split("::")[-1].strip()
return any(h == leaf or h.endswith("::" + leaf) or h == target for h in headings_of(text)) \
or bool(re.search(r"(?m)^#{1,6}\s+" + re.escape(leaf) + r"\s*$", text or ""))
class H(BaseHTTPRequestHandler):
def log_message(self, *a): pass # quiet
def _send(self, code, body="", ctype="text/markdown"):
b = body.encode() if isinstance(body, str) else body
self.send_response(code)
self.send_header("Content-Type", ctype)
self.send_header("Content-Length", str(len(b)))
self.end_headers()
self.wfile.write(b)
def _json(self, code, obj):
self._send(code, json.dumps(obj, indent=2), "application/json")
def _read_body(self):
n = int(self.headers.get("Content-Length", 0) or 0)
return self.rfile.read(n).decode("utf-8", "replace") if n else ""
def _vpath(self):
p = urlparse(self.path).path
if p.startswith("/vault/"):
return p[len("/vault/"):] # keep trailing slash if present
return None
# ---- debug -------------------------------------------------------------
def _maybe_debug(self):
u = urlparse(self.path)
if u.path == "/__debug__":
q = parse_qs(u.query)
path = unquote(q.get("path", [""])[0])
self._send(200, STATE.get(path, "<<MISSING>>"))
return True
if u.path == "/__debug__reset":
STATE.clear(); FLAKY_FIRED.clear()
self._json(200, {"ok": True})
return True
return False
def do_GET(self):
if self._maybe_debug(): return
raw = urlparse(self.path).path
if raw.startswith("/search/simple"):
return self._json(200, [])
vp = self._vpath()
if vp is None:
return self._json(404, {"errorCode": 40400, "message": "Not Found"})
if "//" in self.path.replace("/vault/", "", 1): # double-slash quirk
return self._json(400, {"errorCode": 40000, "message": "Bad Request"})
if vp == "" or vp.endswith("/"): # directory listing
prefix = vp
kids = set()
for k in STATE:
if k.startswith(prefix) and k != prefix:
rest = k[len(prefix):]
kids.add(rest.split("/")[0] + ("/" if "/" in rest else ""))
return self._json(200, {"files": sorted(kids)})
if vp in STATE:
if "document-map" in (self.headers.get("Accept", "")):
return self._json(200, {"headings": headings_of(STATE[vp]),
"blocks": [], "frontmatterFields": []})
return self._send(200, STATE[vp])
return self._json(404, {"errorCode": 40400, "message": "Not Found"})
def _flaky_once(self, vp):
if "flaky" in vp and vp not in FLAKY_FIRED:
FLAKY_FIRED.add(vp)
return True
return False
def do_PUT(self):
vp = self._vpath(); body = self._read_body()
if vp is None: return self._json(400, {"message": "bad path"})
if self._flaky_once(vp):
return self._json(503, {"message": "Service Unavailable"})
if "phantom" in vp: # accepted but NOT persisted
return self._send(200, "")
STATE[vp] = body
return self._send(200, "")
def do_POST(self):
if self._maybe_debug(): return
raw = urlparse(self.path).path
if raw.startswith("/search/simple"):
return self._json(200, [])
vp = self._vpath(); body = self._read_body()
if vp is None: return self._json(400, {"message": "bad path"})
if self._flaky_once(vp):
return self._json(503, {"message": "Service Unavailable"})
STATE[vp] = STATE.get(vp, "") + body # append
return self._send(200, "")
def do_PATCH(self):
vp = self._vpath(); body = self._read_body()
if vp is None: return self._json(400, {"message": "bad path"})
ttype = self.headers.get("Target-Type", "")
op = self.headers.get("Operation", "append")
target = self.headers.get("Target", "")
cur = STATE.get(vp, "")
if ttype == "heading":
if not heading_present(cur, target):
return self._json(400, {"errorCode": 40080, "message": "invalid-target"})
# apply: naive append/prepend/replace around the heading line
STATE[vp] = cur + ("\n" + body if op != "prepend" else body + "\n")
return self._send(200, "")
if ttype == "frontmatter":
STATE[vp] = cur + f"\n<fm:{target}={body}>"
return self._send(200, "")
STATE[vp] = cur + "\n" + body
return self._send(200, "")
def do_DELETE(self):
vp = self._vpath()
if vp in STATE:
del STATE[vp]; return self._send(200, "")
return self._json(404, {"errorCode": 40400, "message": "Not Found"})
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--port", type=int, default=8799)
a = ap.parse_args()
srv = ThreadingHTTPServer(("127.0.0.1", a.port), H)
print(f"mock-olrapi listening on http://127.0.0.1:{a.port}", flush=True)
srv.serve_forever()
if __name__ == "__main__":
main()
+154
View File
@@ -0,0 +1,154 @@
{
"params": {
"port": 8799,
"cpt": 4.0,
"recovery": 3000,
"detect_cost": 80
},
"rows": [
{
"scenario": "agent-log-missing-heading",
"fault": "bad heading -> 400 invalid-target",
"06": {
"emit_tokens": 92,
"claimed_ok": true,
"detected": false,
"persisted": false,
"duplicates": 0,
"silent_failure": true
},
"07": {
"emit_tokens": 22,
"claimed_ok": false,
"detected": true,
"persisted": false,
"duplicates": 0,
"silent_failure": false
}
},
{
"scenario": "scope-switch",
"fault": "(none)",
"06": {
"emit_tokens": 93,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
},
"07": {
"emit_tokens": 24,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
}
},
{
"scenario": "inbox-capture-replayed",
"fault": "duplicate attempt (retry/replay)",
"06": {
"emit_tokens": 144,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 1,
"silent_failure": true
},
"07": {
"emit_tokens": 33,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
}
},
{
"scenario": "session-log-flaky-network",
"fault": "one-time 503 (transient)",
"06": {
"emit_tokens": 74,
"claimed_ok": true,
"detected": false,
"persisted": false,
"duplicates": 0,
"silent_failure": true
},
"07": {
"emit_tokens": 17,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
}
},
{
"scenario": "heartbeat-phantom-write",
"fault": "accepted-but-not-persisted",
"06": {
"emit_tokens": 71,
"claimed_ok": true,
"detected": false,
"persisted": false,
"duplicates": 0,
"silent_failure": true
},
"07": {
"emit_tokens": 14,
"claimed_ok": false,
"detected": true,
"persisted": false,
"duplicates": 0,
"silent_failure": false
}
},
{
"scenario": "cold-start-load-6-reads",
"fault": "(none)",
"06": {
"emit_tokens": 249,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
},
"07": {
"emit_tokens": 64,
"claimed_ok": true,
"detected": false,
"persisted": true,
"duplicates": 0,
"silent_failure": false
}
}
],
"totals": {
"0.6": {
"gen_tokens": 723,
"silent_failures": 4,
"detected_failures": 0,
"duplicates": 1,
"effective_tokens": 12723
},
"0.7": {
"gen_tokens": 174,
"silent_failures": 0,
"detected_failures": 2,
"duplicates": 0,
"effective_tokens": 334
}
},
"silent_error_free": {
"0.6": "1/5",
"0.7": "5/5"
},
"writes_persisted": {
"0.6": "2/5",
"0.7": "3/5"
}
}
+303
View File
@@ -0,0 +1,303 @@
#!/usr/bin/env python3
"""run_eval.py — A/B eval of echo-memory 0.6 (raw curl, documented recipes) vs
0.7 (the shipped echo.sh client) over representative memory operations, with
fault injection.
Design
------
* A mock Obsidian REST API (mock_olrapi.py) gives deterministic behavior + faults,
so no credentials are needed and the real vault is never touched.
* The 0.7 side runs the ACTUAL shipped scripts/echo.sh (status-checked, retry, verify,
idempotent append).
* The 0.6 side faithfully models the documented raw-curl recipes: it performs the
same HTTP but does NOT inspect status, does NOT retry, does NOT verify, and does
NOT dedupe — exactly what SKILL.md 0.6 told the model to emit.
* Ground truth is read back independently from the mock after each method, so a
"silent failure" (method reported success but the vault is actually wrong, and
nobody noticed) is detected regardless of what the method claimed.
* Each (scenario, method) runs against a freshly reset + re-seeded server, so faults
(e.g. one-time 503) are identical for both methods.
Metrics
-------
* gen_tokens : output tokens the MODEL must generate for the op (len(emitted)/CPT).
* silent_failure : claimed success BUT ground truth is wrong (lost write or duplicate).
* detected : the method surfaced the failure (loud, retryable) instead of hiding it.
* effective_tokens = gen_tokens + silent_failures*RECOVERY + detected*DETECT_COST
(RECOVERY/DETECT_COST are labeled assumptions, tune via env).
Usage: python3 run_eval.py [--port 8799] [--cpt 4] [--recovery 1500] [--detect-cost 80]
"""
import os, sys, json, time, subprocess, tempfile, argparse, urllib.request, urllib.error
from pathlib import Path
HERE = Path(__file__).resolve().parent
ECHO = HERE.parent / "echo-memory.plugin.src" / "skills" / "echo-memory" / "scripts" / "echo.sh"
KEY = "241265fbe6830934a9a4ad3e69335f64a42153b663aa5b0017cb1ea1217b2bab"
# ----- tiny HTTP helpers (used for setup, ground truth, and the 0.6 model) -----
def http(method, url, body=None, headers=None):
data = body.encode() if isinstance(body, str) else body
req = urllib.request.Request(url, data=data, method=method,
headers={"Authorization": f"Bearer {KEY}", **(headers or {})})
try:
with urllib.request.urlopen(req, timeout=10) as r:
return r.status, r.read().decode("utf-8", "replace")
except urllib.error.HTTPError as e:
return e.code, e.read().decode("utf-8", "replace")
except Exception as e:
return 0, str(e)
class Eval:
def __init__(self, base, cpt):
self.base, self.cpt = base, cpt
def reset(self): http("POST", f"{self.base}/__debug__reset")
def seed(self, path, content): http("PUT", f"{self.base}/vault/{path}", content)
def ground(self, path):
st, body = http("GET", f"{self.base}/__debug__?path={path}")
return None if body == "<<MISSING>>" else body
def toks(self, text): return round(len(text) / self.cpt)
def echo(self, *args):
env = dict(os.environ, ECHO_BASE=self.base, ECHO_KEY=KEY, ECHO_VERIFY="1", ECHO_LOCK_TTL="900")
p = subprocess.run(["bash", str(ECHO), *args], capture_output=True, text=True, env=env)
return p.returncode
# ---- faithful 0.6 recipe text (what the model emitted), for token accounting ----
def recipe_get(path):
return (f'curl -s -H "Authorization: Bearer {KEY}" '
f'"https://echoapi.alwisp.com/vault/{path}"')
def recipe_put(path):
return (f"cat > /tmp/obs_file.md << 'EOF'\n<body>\nEOF\n\n"
f'curl -s -X PUT -H "Authorization: Bearer {KEY}" '
f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_file.md '
f'"https://echoapi.alwisp.com/vault/{path}"')
def recipe_post(path):
return (f"cat > /tmp/obs_entry.md << 'EOF'\n<line>\nEOF\n\n"
f'curl -s -X POST -H "Authorization: Bearer {KEY}" '
f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_entry.md '
f'"https://echoapi.alwisp.com/vault/{path}"')
def recipe_patch(path, target):
return (f"cat > /tmp/obs_patch.md << 'EOF'\n<body>\nEOF\n\n"
f'curl -s -X PATCH -H "Authorization: Bearer {KEY}" '
f'-H "Operation: append" -H "Target-Type: heading" -H "Target: {target}" '
f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_patch.md '
f'"https://echoapi.alwisp.com/vault/{path}"')
def tmpfile(content):
f = tempfile.NamedTemporaryFile("w", suffix=".md", delete=False); f.write(content); f.close()
return f.name
# ---------------------------------------------------------------------------
# Scenarios. Each defines seed(), and a run for each method that returns:
# {emit: <text model generates>, claimed_ok: bool, detected: bool}
# plus a ground-truth check producing {persisted, duplicates}.
# ---------------------------------------------------------------------------
def scenarios(ev):
out = []
# S1 — agent-log append where the heading is MISSING (-> 400 invalid-target)
def s1():
path, tgt = "journal/daily/2026-06-19.md", "2026-06-19::Agent Log"
line = "- 2026-06-19: EVALMARK-s1"
def seed(): ev.seed(path, "---\ntype: daily-note\n---\n\n# 2026-06-19\n\n## Notes\n")
def m06():
http("PATCH", f"{ev.base}/vault/{path}", line,
{"Operation": "append", "Target-Type": "heading", "Target": tgt}) # status ignored
return {"emit": recipe_patch(path, tgt), "claimed_ok": True, "detected": False}
def m07():
rc = ev.echo("patch", path, "append", "heading", tgt, tmpfile(line))
return {"emit": f'"$ECHO" patch {path} append heading "{tgt}" /tmp/x.md',
"claimed_ok": rc == 0, "detected": rc != 0}
def gt():
c = ev.ground(path) or ""
return {"persisted": "EVALMARK-s1" in c, "duplicates": max(0, c.count("EVALMARK-s1") - 1)}
return dict(name="agent-log-missing-heading", fault="bad heading -> 400 invalid-target",
seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s1())
# S2 — scope switch (no fault; pure token comparison on a PATCH replace)
def s2():
path, tgt = "_agent/context/current-context.md", "Current Context::Scope"
def seed(): ev.seed(path, "---\ntype: context-bundle\n---\n\n# Current Context\n\n## Scope\nold scope\n")
def m06():
http("PATCH", f"{ev.base}/vault/{path}", "EVALMARK-s2 new scope",
{"Operation": "replace", "Target-Type": "heading", "Target": tgt})
return {"emit": recipe_patch(path, tgt), "claimed_ok": True, "detected": False}
def m07():
rc = ev.echo("patch", path, "replace", "heading", tgt, tmpfile("EVALMARK-s2 new scope"))
return {"emit": f'"$ECHO" patch {path} replace heading "{tgt}" /tmp/x.md',
"claimed_ok": rc == 0, "detected": rc != 0}
def gt():
c = ev.ground(path) or ""
return {"persisted": "EVALMARK-s2" in c, "duplicates": 0}
return dict(name="scope-switch", fault="(none)", seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s2())
# S3 — inbox capture issued TWICE (retry/replay): dedup vs duplicate line
def s3():
path = "inbox/captures/inbox.md"; line = "- 2026-06-19: EVALMARK-s3"
def seed(): ev.seed(path, "# Inbox\n")
def m06():
http("POST", f"{ev.base}/vault/{path}", line + "\n") # no idempotency check
http("POST", f"{ev.base}/vault/{path}", line + "\n")
return {"emit": recipe_post(path) + "\n# (repeated on retry)\n" + recipe_post(path),
"claimed_ok": True, "detected": False}
def m07():
rc1 = ev.echo("append", path, line)
rc2 = ev.echo("append", path, line) # second is skipped by read-before-POST
return {"emit": f'"$ECHO" append {path} "{line}"\n"$ECHO" append {path} "{line}"',
"claimed_ok": rc1 == 0 and rc2 == 0, "detected": False}
def gt():
c = ev.ground(path) or ""
return {"persisted": "EVALMARK-s3" in c, "duplicates": max(0, c.count("EVALMARK-s3") - 1)}
return dict(name="inbox-capture-replayed", fault="duplicate attempt (retry/replay)",
seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s3())
# S4 — session-log PUT under a flaky network (one-time 503 then success)
def s4():
path = "_agent/sessions/2026-06-19-1430-flaky-eval.md"
body = "---\ntype: session-log\n---\n\n# Session\nEVALMARK-s4\n"
def seed(): pass # file does not pre-exist
def m06():
http("PUT", f"{ev.base}/vault/{path}", body) # single shot, no retry, status ignored
return {"emit": recipe_put(path), "claimed_ok": True, "detected": False}
def m07():
rc = ev.echo("put", path, tmpfile(body)) # echo.sh retries the 503 once
return {"emit": f'"$ECHO" put {path} /tmp/x.md', "claimed_ok": rc == 0, "detected": rc != 0}
def gt():
c = ev.ground(path) or ""
return {"persisted": "EVALMARK-s4" in c, "duplicates": 0}
return dict(name="session-log-flaky-network", fault="one-time 503 (transient)",
seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s4())
# S5 — heartbeat PUT accepted-but-not-persisted (proxy hiccup): verify catches it
def s5():
path = "_agent/heartbeat/phantom-eval.md"; body = "EVALMARK-s5 @ 2026-06-19T14:30:00Z\n"
def seed(): pass
def m06():
http("PUT", f"{ev.base}/vault/{path}", body) # 200 returned; no read-back verify
return {"emit": recipe_put(path), "claimed_ok": True, "detected": False}
def m07():
rc = ev.echo("put", path, tmpfile(body)) # verify GET -> 404 -> die
return {"emit": f'"$ECHO" put {path} /tmp/x.md', "claimed_ok": rc == 0, "detected": rc != 0}
def gt():
c = ev.ground(path) or ""
return {"persisted": "EVALMARK-s5" in c, "duplicates": 0}
return dict(name="heartbeat-phantom-write", fault="accepted-but-not-persisted",
seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s5())
# S6 — cold-start load: 6 reads (no fault; pure token comparison)
def s6():
paths = ["_agent/echo-vault.md", "_agent/memory/semantic/operator-preferences.md",
"_agent/context/current-context.md", "_agent/heartbeat/last-session.md",
"journal/daily/2026-06-19.md", "inbox/captures/inbox.md"]
def seed():
for p in paths: ev.seed(p, f"# {p}\nEVALMARK-s6\n")
def m06():
for p in paths: http("GET", f"{ev.base}/vault/{p}")
return {"emit": "\n".join(recipe_get(p) for p in paths), "claimed_ok": True, "detected": False}
def m07():
for p in paths: ev.echo("get", p)
return {"emit": "\n".join(f'"$ECHO" get {p}' for p in paths), "claimed_ok": True, "detected": False}
def gt(): return {"persisted": True, "duplicates": 0}
return dict(name="cold-start-load-6-reads", fault="(none)", seed=seed, m06=m06, m07=m07, gt=gt)
out.append(s6())
return out
def run_method(ev, scn, key):
ev.reset(); scn["seed"]()
res = scn[key]()
g = scn["gt"]()
# a write op "should persist"; reads (s6) and replays are judged by their own gt
bad = (not g["persisted"] and scn["name"] != "cold-start-load-6-reads") or g["duplicates"] > 0
silent = bool(res["claimed_ok"] and bad)
return {"emit_tokens": ev.toks(res["emit"]), "claimed_ok": res["claimed_ok"],
"detected": res["detected"], "persisted": g["persisted"],
"duplicates": g["duplicates"], "silent_failure": silent}
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--port", type=int, default=8799)
ap.add_argument("--cpt", type=float, default=4.0, help="chars per token")
ap.add_argument("--recovery", type=int, default=1500, help="token penalty per silent failure (recovery loop)")
ap.add_argument("--detect-cost", type=int, default=80, help="token cost to re-issue a corrected call after a detected failure")
a = ap.parse_args()
base = f"http://127.0.0.1:{a.port}"
srv = subprocess.Popen([sys.executable, str(HERE / "mock_olrapi.py"), "--port", str(a.port)],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
try:
for _ in range(50): # wait for ready
try:
urllib.request.urlopen(f"{base}/__debug__reset", data=b"", timeout=1); break
except Exception: time.sleep(0.1)
ev = Eval(base, a.cpt)
rows, agg = [], {"06": {}, "07": {}}
for scn in scenarios(ev):
r06 = run_method(ev, scn, "m06")
r07 = run_method(ev, scn, "m07")
rows.append({"scenario": scn["name"], "fault": scn["fault"], "06": r06, "07": r07})
def totals(side):
gen = sum(r[side]["emit_tokens"] for r in rows)
sil = sum(1 for r in rows if r[side]["silent_failure"])
det = sum(1 for r in rows if r[side]["detected"])
dup = sum(r[side]["duplicates"] for r in rows)
eff = gen + sil * a.recovery + det * a.detect_cost
return dict(gen_tokens=gen, silent_failures=sil, detected_failures=det,
duplicates=dup, effective_tokens=eff)
T06, T07 = totals("06"), totals("07")
# ---- report ----
line = "=" * 78
print(f"\n{line}\nECHO MEMORY — 0.6 vs 0.7 A/B EVAL")
print(f"(mock OLRAPI; chars/token={a.cpt}, recovery-penalty={a.recovery}, detect-cost={a.detect_cost})\n{line}")
hdr = f"{'scenario':28} {'fault':32} {'gen06':>5} {'gen07':>5} {'silent06':>8} {'silent07':>8}"
print(hdr); print("-" * len(hdr))
for r in rows:
print(f"{r['scenario']:28} {r['fault']:32} "
f"{r['06']['emit_tokens']:>5} {r['07']['emit_tokens']:>5} "
f"{('YES' if r['06']['silent_failure'] else '-'):>8} "
f"{('YES' if r['07']['silent_failure'] else '-'):>8}")
print("-" * len(hdr))
def pct(old, new): return f"{(old-new)/old*100:+.0f}%" if old else "n/a"
print(f"\n{'metric':30} {'0.6':>10} {'0.7':>10} {'delta':>10}")
print("-" * 62)
for label, k in [("generated tokens", "gen_tokens"),
("silent failures", "silent_failures"),
("duplicate lines", "duplicates"),
("detected (loud) failures", "detected_failures"),
("effective tokens (incl. recovery)", "effective_tokens")]:
o, n = T06[k], T07[k]
d = pct(o, n) if "token" in k else f"{n-o:+d}"
print(f"{label:30} {o:>10} {n:>10} {d:>10}")
wrows = [r for r in rows if r['scenario'] != 'cold-start-load-6-reads']
denom = len(wrows)
sf06 = sum(1 for r in wrows if not r['06']['silent_failure']) # silent-error-free
sf07 = sum(1 for r in wrows if not r['07']['silent_failure'])
pr06 = sum(1 for r in wrows if r['06']['persisted']) # write actually landed
pr07 = sum(1 for r in wrows if r['07']['persisted'])
print("-" * 62)
print(f"{'silent-error-free ops':30} {str(sf06)+'/'+str(denom):>10} {str(sf07)+'/'+str(denom):>10}")
print(f"{'writes actually persisted':30} {str(pr06)+'/'+str(denom):>10} {str(pr07)+'/'+str(denom):>10}")
print(f" (the 2 un-persisted 0.7 ops are bad-heading + phantom: not persistable in a single")
print(f" call, but 0.7 fails LOUD so the agent's ensure-heading/retry path can recover.)")
print(f"\nNOTE: recovery-penalty and detect-cost are tunable ASSUMPTIONS, not measured.")
print(f" gen tokens are a chars/{a.cpt} proxy for output tokens of the I/O layer only.")
print(f" This harness measures mechanics, not model reasoning quality.\n")
report = {"params": vars(a), "rows": rows, "totals": {"0.6": T06, "0.7": T07},
"silent_error_free": {"0.6": f"{sf06}/{denom}", "0.7": f"{sf07}/{denom}"},
"writes_persisted": {"0.6": f"{pr06}/{denom}", "0.7": f"{pr07}/{denom}"}}
(HERE / "results" / "latest.json").write_text(json.dumps(report, indent=2))
print(f"wrote {HERE/'results'/'latest.json'}")
finally:
srv.terminate()
if __name__ == "__main__":
main()