ver-0.7

2026-06-19 21:12:14 -05:00
parent a860819bfb
commit 2fc3a0a80b
26 changed files with 1559 additions and 89 deletions
@@ -0,0 +1,87 @@
+# echo-memory eval — 0.6 vs 0.7 A/B harness
+
+A reproducible, credential-free A/B comparison of the plugin **before** (0.6: raw-curl
+recipes from `SKILL.md`) and **after** (0.7: the shipped `scripts/echo.sh` client) the
+hardening work. It quantifies the claims in the comparative analysis: token cost of the
+I/O layer, and the rate of **silent write failures**.
+
+## Run it
+
+```bash
+cd eval
+python3 run_eval.py                 # default params
+python3 run_eval.py --recovery 2500 # sensitivity-test the recovery assumption
+python3 run_eval.py --cpt 3.5       # different chars/token proxy
+```
+
+No network, no API key, no live vault. Pure stdlib (Python 3 + bash for `echo.sh`).
+Results table prints to stdout and a machine-readable copy lands in `results/latest.json`.
+
+## How it works
+
+- **`mock_olrapi.py`** — a deterministic mock of the Obsidian Local REST API surface the
+  plugin uses, reproducing its real behaviors and quirks (404 shape, the `/vault//`
+  double-slash 400, directory listings with `dir/` entries, `PATCH` heading targets that
+  return `400 invalid-target / 40080` when the heading is absent). Faults are triggered by
+  path markers so one server serves every scenario:
+  - `flaky` in the path → first write returns `503`, then succeeds (tests retry).
+  - `phantom` in the path → `PUT` returns `200` but does **not** persist (tests read-back verify).
+  - a `PATCH` to a missing heading → `400` (the silent-write-loss trigger).
+- **`run_eval.py`** — for each scenario, runs both methods against a freshly reset +
+  re-seeded server (so faults are identical for both), then reads ground truth back
+  **independently** from the mock. The 0.7 side executes the *actual shipped `echo.sh`*;
+  the 0.6 side faithfully models the documented recipe (real HTTP, but no status check,
+  no retry, no verify, no dedupe).
+
+## Metrics
+
+| metric | meaning |
+|---|---|
+| `gen_tokens` | output tokens the model must generate for the op (`len(emitted)/cpt` proxy) |
+| `silent_failure` | method **reported success** but ground truth is wrong (lost write or dup) — and nobody noticed |
+| `detected` | the method surfaced the failure (nonzero exit) instead of hiding it |
+| `effective_tokens` | `gen + silent_failures*recovery + detected*detect_cost` |
+| `silent-error-free ops` | the headline accuracy number |
+| `writes actually persisted` | did the single op land (separate from "was it silent") |
+
+## Scenarios
+
+1. **agent-log-missing-heading** — `PATCH` append to a note lacking the target heading (`400`).
+2. **scope-switch** — clean `PATCH replace` (no fault; pure token comparison).
+3. **inbox-capture-replayed** — same capture issued twice (retry/replay): dedup vs duplicate.
+4. **session-log-flaky-network** — one-time `503`: retry vs single-shot.
+5. **heartbeat-phantom-write** — accepted-but-not-persisted: read-back verify vs none.
+6. **cold-start-load-6-reads** — 6 GETs (no fault; pure token comparison).
+
+## Representative result (defaults)
+
+```
+generated tokens             723 -> 174   (+76% fewer)
+silent failures                4 ->   0   (-4)
+duplicate lines                1 ->   0   (-1)
+silent-error-free ops        1/5 -> 5/5
+effective tokens (assumed)  6723 -> 334
+```
+
+## Honest caveats
+
+- **Mechanics, not reasoning.** This measures the deterministic plumbing differences. It
+  does **not** measure model judgment (routing choices, prose quality) — that needs a live model.
+- **`recovery` and `detect-cost` are assumptions**, not measurements. The headline
+  "silent failures: 4 → 0" is a hard count from ground truth; the `effective_tokens` figure
+  is a model on top of it — tune `--recovery` to see the sensitivity.
+- **`gen_tokens` is a `chars/cpt` proxy** for the I/O layer only, not a tokenizer count, and
+  excludes the one-time `+12%` SKILL.md context cost noted in the analysis (that's a
+  per-session context cost, not per-op).
+
+## Extending to a live-model run (optional)
+
+To measure real model behavior and true token counts:
+1. Define the same six scenarios as natural-language tasks (e.g. "log a session note for X").
+2. Run each twice — once with the 0.6 skill files, once with 0.7 — through the Agent SDK
+   against the **mock** server (point `ECHO_BASE` at it) so faults stay deterministic.
+3. Record `usage.output_tokens` per task from the API and whether the vault ended correct
+   (same ground-truth read used here).
+
+The mock + ground-truth checks in this harness are reusable as-is for that; only the driver
+changes from "scripted ops" to "model-driven ops".
@@ -0,0 +1,172 @@
+#!/usr/bin/env python3
+"""mock_olrapi.py — a deterministic mock of the Obsidian Local REST API surface the
+echo-memory plugin uses, with controllable fault injection.
+
+It reproduces the real API's observed behavior and quirks so the A/B eval can run
+without credentials and without touching the live vault:
+
+  GET  /vault/<path>            200 + body, or 404 {errorCode:40400}
+  GET  /vault/ (or dir + '/')   200 {"files":[... , "sub/"]}  (dirs end in '/')
+  GET  /vault//                 400  (the double-slash quirk)
+  GET  ... Accept: document-map 200 {"headings":[...]}
+  PUT  /vault/<path>            201/200, stores body, creates parents
+  POST /vault/<path>            200, appends (creates if absent)
+  PATCH/vault/<path>            heading target missing -> 400 {errorCode:40080}; else apply
+  POST /search/simple/?query=   200 []
+  DELETE /vault/<path>          200 / 404
+
+Fault injection is triggered by markers in the path (so a single server serves every
+scenario deterministically):
+  * path contains "flaky"   -> the FIRST write (PUT/POST) to that path returns 503,
+                               subsequent writes succeed (tests retry-on-5xx).
+  * path contains "phantom" -> PUT returns 200 but does NOT persist (accepted-but-lost;
+                               tests read-back verify).
+  * a PATCH heading whose Target heading is absent from the stored doc -> 400 40080
+                               (the silent-write-loss trigger).
+
+Debug (out of band, for the harness ground-truth checks):
+  GET  /__debug__?path=<p>      -> raw stored content, or "<<MISSING>>"
+  POST /__debug__reset          -> clear all state + fault memory
+"""
+import json, re, sys, argparse
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from urllib.parse import urlparse, parse_qs, unquote
+
+STATE = {}            # vault-path -> content (str)
+FLAKY_FIRED = set()   # paths that already consumed their one-time 503
+
+def headings_of(text):
+    return [m.group(1).strip() for m in re.finditer(r"(?m)^#{1,6}\s+(.*)$", text or "")]
+
+def heading_present(text, target):
+    # target may be a full "A::B" path; match its leaf heading line anywhere in the doc
+    leaf = target.split("::")[-1].strip()
+    return any(h == leaf or h.endswith("::" + leaf) or h == target for h in headings_of(text)) \
+        or bool(re.search(r"(?m)^#{1,6}\s+" + re.escape(leaf) + r"\s*$", text or ""))
+
+class H(BaseHTTPRequestHandler):
+    def log_message(self, *a): pass  # quiet
+
+    def _send(self, code, body="", ctype="text/markdown"):
+        b = body.encode() if isinstance(body, str) else body
+        self.send_response(code)
+        self.send_header("Content-Type", ctype)
+        self.send_header("Content-Length", str(len(b)))
+        self.end_headers()
+        self.wfile.write(b)
+
+    def _json(self, code, obj):
+        self._send(code, json.dumps(obj, indent=2), "application/json")
+
+    def _read_body(self):
+        n = int(self.headers.get("Content-Length", 0) or 0)
+        return self.rfile.read(n).decode("utf-8", "replace") if n else ""
+
+    def _vpath(self):
+        p = urlparse(self.path).path
+        if p.startswith("/vault/"):
+            return p[len("/vault/"):]   # keep trailing slash if present
+        return None
+
+    # ---- debug -------------------------------------------------------------
+    def _maybe_debug(self):
+        u = urlparse(self.path)
+        if u.path == "/__debug__":
+            q = parse_qs(u.query)
+            path = unquote(q.get("path", [""])[0])
+            self._send(200, STATE.get(path, "<<MISSING>>"))
+            return True
+        if u.path == "/__debug__reset":
+            STATE.clear(); FLAKY_FIRED.clear()
+            self._json(200, {"ok": True})
+            return True
+        return False
+
+    def do_GET(self):
+        if self._maybe_debug(): return
+        raw = urlparse(self.path).path
+        if raw.startswith("/search/simple"):
+            return self._json(200, [])
+        vp = self._vpath()
+        if vp is None:
+            return self._json(404, {"errorCode": 40400, "message": "Not Found"})
+        if "//" in self.path.replace("/vault/", "", 1):   # double-slash quirk
+            return self._json(400, {"errorCode": 40000, "message": "Bad Request"})
+        if vp == "" or vp.endswith("/"):                  # directory listing
+            prefix = vp
+            kids = set()
+            for k in STATE:
+                if k.startswith(prefix) and k != prefix:
+                    rest = k[len(prefix):]
+                    kids.add(rest.split("/")[0] + ("/" if "/" in rest else ""))
+            return self._json(200, {"files": sorted(kids)})
+        if vp in STATE:
+            if "document-map" in (self.headers.get("Accept", "")):
+                return self._json(200, {"headings": headings_of(STATE[vp]),
+                                        "blocks": [], "frontmatterFields": []})
+            return self._send(200, STATE[vp])
+        return self._json(404, {"errorCode": 40400, "message": "Not Found"})
+
+    def _flaky_once(self, vp):
+        if "flaky" in vp and vp not in FLAKY_FIRED:
+            FLAKY_FIRED.add(vp)
+            return True
+        return False
+
+    def do_PUT(self):
+        vp = self._vpath(); body = self._read_body()
+        if vp is None: return self._json(400, {"message": "bad path"})
+        if self._flaky_once(vp):
+            return self._json(503, {"message": "Service Unavailable"})
+        if "phantom" in vp:                  # accepted but NOT persisted
+            return self._send(200, "")
+        STATE[vp] = body
+        return self._send(200, "")
+
+    def do_POST(self):
+        if self._maybe_debug(): return
+        raw = urlparse(self.path).path
+        if raw.startswith("/search/simple"):
+            return self._json(200, [])
+        vp = self._vpath(); body = self._read_body()
+        if vp is None: return self._json(400, {"message": "bad path"})
+        if self._flaky_once(vp):
+            return self._json(503, {"message": "Service Unavailable"})
+        STATE[vp] = STATE.get(vp, "") + body  # append
+        return self._send(200, "")
+
+    def do_PATCH(self):
+        vp = self._vpath(); body = self._read_body()
+        if vp is None: return self._json(400, {"message": "bad path"})
+        ttype = self.headers.get("Target-Type", "")
+        op    = self.headers.get("Operation", "append")
+        target = self.headers.get("Target", "")
+        cur = STATE.get(vp, "")
+        if ttype == "heading":
+            if not heading_present(cur, target):
+                return self._json(400, {"errorCode": 40080, "message": "invalid-target"})
+            # apply: naive append/prepend/replace around the heading line
+            STATE[vp] = cur + ("\n" + body if op != "prepend" else body + "\n")
+            return self._send(200, "")
+        if ttype == "frontmatter":
+            STATE[vp] = cur + f"\n<fm:{target}={body}>"
+            return self._send(200, "")
+        STATE[vp] = cur + "\n" + body
+        return self._send(200, "")
+
+    def do_DELETE(self):
+        vp = self._vpath()
+        if vp in STATE:
+            del STATE[vp]; return self._send(200, "")
+        return self._json(404, {"errorCode": 40400, "message": "Not Found"})
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--port", type=int, default=8799)
+    a = ap.parse_args()
+    srv = ThreadingHTTPServer(("127.0.0.1", a.port), H)
+    print(f"mock-olrapi listening on http://127.0.0.1:{a.port}", flush=True)
+    srv.serve_forever()
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,154 @@
+{
+  "params": {
+    "port": 8799,
+    "cpt": 4.0,
+    "recovery": 3000,
+    "detect_cost": 80
+  },
+  "rows": [
+    {
+      "scenario": "agent-log-missing-heading",
+      "fault": "bad heading -> 400 invalid-target",
+      "06": {
+        "emit_tokens": 92,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": false,
+        "duplicates": 0,
+        "silent_failure": true
+      },
+      "07": {
+        "emit_tokens": 22,
+        "claimed_ok": false,
+        "detected": true,
+        "persisted": false,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    },
+    {
+      "scenario": "scope-switch",
+      "fault": "(none)",
+      "06": {
+        "emit_tokens": 93,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      },
+      "07": {
+        "emit_tokens": 24,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    },
+    {
+      "scenario": "inbox-capture-replayed",
+      "fault": "duplicate attempt (retry/replay)",
+      "06": {
+        "emit_tokens": 144,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 1,
+        "silent_failure": true
+      },
+      "07": {
+        "emit_tokens": 33,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    },
+    {
+      "scenario": "session-log-flaky-network",
+      "fault": "one-time 503 (transient)",
+      "06": {
+        "emit_tokens": 74,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": false,
+        "duplicates": 0,
+        "silent_failure": true
+      },
+      "07": {
+        "emit_tokens": 17,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    },
+    {
+      "scenario": "heartbeat-phantom-write",
+      "fault": "accepted-but-not-persisted",
+      "06": {
+        "emit_tokens": 71,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": false,
+        "duplicates": 0,
+        "silent_failure": true
+      },
+      "07": {
+        "emit_tokens": 14,
+        "claimed_ok": false,
+        "detected": true,
+        "persisted": false,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    },
+    {
+      "scenario": "cold-start-load-6-reads",
+      "fault": "(none)",
+      "06": {
+        "emit_tokens": 249,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      },
+      "07": {
+        "emit_tokens": 64,
+        "claimed_ok": true,
+        "detected": false,
+        "persisted": true,
+        "duplicates": 0,
+        "silent_failure": false
+      }
+    }
+  ],
+  "totals": {
+    "0.6": {
+      "gen_tokens": 723,
+      "silent_failures": 4,
+      "detected_failures": 0,
+      "duplicates": 1,
+      "effective_tokens": 12723
+    },
+    "0.7": {
+      "gen_tokens": 174,
+      "silent_failures": 0,
+      "detected_failures": 2,
+      "duplicates": 0,
+      "effective_tokens": 334
+    }
+  },
+  "silent_error_free": {
+    "0.6": "1/5",
+    "0.7": "5/5"
+  },
+  "writes_persisted": {
+    "0.6": "2/5",
+    "0.7": "3/5"
+  }
+}
@@ -0,0 +1,303 @@
+#!/usr/bin/env python3
+"""run_eval.py — A/B eval of echo-memory 0.6 (raw curl, documented recipes) vs
+0.7 (the shipped echo.sh client) over representative memory operations, with
+fault injection.
+
+Design
+------
+* A mock Obsidian REST API (mock_olrapi.py) gives deterministic behavior + faults,
+  so no credentials are needed and the real vault is never touched.
+* The 0.7 side runs the ACTUAL shipped scripts/echo.sh (status-checked, retry, verify,
+  idempotent append).
+* The 0.6 side faithfully models the documented raw-curl recipes: it performs the
+  same HTTP but does NOT inspect status, does NOT retry, does NOT verify, and does
+  NOT dedupe — exactly what SKILL.md 0.6 told the model to emit.
+* Ground truth is read back independently from the mock after each method, so a
+  "silent failure" (method reported success but the vault is actually wrong, and
+  nobody noticed) is detected regardless of what the method claimed.
+* Each (scenario, method) runs against a freshly reset + re-seeded server, so faults
+  (e.g. one-time 503) are identical for both methods.
+
+Metrics
+-------
+* gen_tokens     : output tokens the MODEL must generate for the op (len(emitted)/CPT).
+* silent_failure : claimed success BUT ground truth is wrong (lost write or duplicate).
+* detected       : the method surfaced the failure (loud, retryable) instead of hiding it.
+* effective_tokens = gen_tokens + silent_failures*RECOVERY + detected*DETECT_COST
+                     (RECOVERY/DETECT_COST are labeled assumptions, tune via env).
+
+Usage:  python3 run_eval.py [--port 8799] [--cpt 4] [--recovery 1500] [--detect-cost 80]
+"""
+import os, sys, json, time, subprocess, tempfile, argparse, urllib.request, urllib.error
+from pathlib import Path
+
+HERE = Path(__file__).resolve().parent
+ECHO = HERE.parent / "echo-memory.plugin.src" / "skills" / "echo-memory" / "scripts" / "echo.sh"
+KEY  = "241265fbe6830934a9a4ad3e69335f64a42153b663aa5b0017cb1ea1217b2bab"
+
+# ----- tiny HTTP helpers (used for setup, ground truth, and the 0.6 model) -----
+def http(method, url, body=None, headers=None):
+    data = body.encode() if isinstance(body, str) else body
+    req = urllib.request.Request(url, data=data, method=method,
+                                 headers={"Authorization": f"Bearer {KEY}", **(headers or {})})
+    try:
+        with urllib.request.urlopen(req, timeout=10) as r:
+            return r.status, r.read().decode("utf-8", "replace")
+    except urllib.error.HTTPError as e:
+        return e.code, e.read().decode("utf-8", "replace")
+    except Exception as e:
+        return 0, str(e)
+
+class Eval:
+    def __init__(self, base, cpt):
+        self.base, self.cpt = base, cpt
+    def reset(self): http("POST", f"{self.base}/__debug__reset")
+    def seed(self, path, content): http("PUT", f"{self.base}/vault/{path}", content)
+    def ground(self, path):
+        st, body = http("GET", f"{self.base}/__debug__?path={path}")
+        return None if body == "<<MISSING>>" else body
+    def toks(self, text): return round(len(text) / self.cpt)
+    def echo(self, *args):
+        env = dict(os.environ, ECHO_BASE=self.base, ECHO_KEY=KEY, ECHO_VERIFY="1", ECHO_LOCK_TTL="900")
+        p = subprocess.run(["bash", str(ECHO), *args], capture_output=True, text=True, env=env)
+        return p.returncode
+
+# ---- faithful 0.6 recipe text (what the model emitted), for token accounting ----
+def recipe_get(path):
+    return (f'curl -s -H "Authorization: Bearer {KEY}" '
+            f'"https://echoapi.alwisp.com/vault/{path}"')
+def recipe_put(path):
+    return (f"cat > /tmp/obs_file.md << 'EOF'\n<body>\nEOF\n\n"
+            f'curl -s -X PUT -H "Authorization: Bearer {KEY}" '
+            f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_file.md '
+            f'"https://echoapi.alwisp.com/vault/{path}"')
+def recipe_post(path):
+    return (f"cat > /tmp/obs_entry.md << 'EOF'\n<line>\nEOF\n\n"
+            f'curl -s -X POST -H "Authorization: Bearer {KEY}" '
+            f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_entry.md '
+            f'"https://echoapi.alwisp.com/vault/{path}"')
+def recipe_patch(path, target):
+    return (f"cat > /tmp/obs_patch.md << 'EOF'\n<body>\nEOF\n\n"
+            f'curl -s -X PATCH -H "Authorization: Bearer {KEY}" '
+            f'-H "Operation: append" -H "Target-Type: heading" -H "Target: {target}" '
+            f'-H "Content-Type: text/markdown" --data-binary @/tmp/obs_patch.md '
+            f'"https://echoapi.alwisp.com/vault/{path}"')
+
+def tmpfile(content):
+    f = tempfile.NamedTemporaryFile("w", suffix=".md", delete=False); f.write(content); f.close()
+    return f.name
+
+# ---------------------------------------------------------------------------
+# Scenarios. Each defines seed(), and a run for each method that returns:
+#   {emit: <text model generates>, claimed_ok: bool, detected: bool}
+# plus a ground-truth check producing {persisted, duplicates}.
+# ---------------------------------------------------------------------------
+def scenarios(ev):
+    out = []
+
+    # S1 — agent-log append where the heading is MISSING (-> 400 invalid-target)
+    def s1():
+        path, tgt = "journal/daily/2026-06-19.md", "2026-06-19::Agent Log"
+        line = "- 2026-06-19: EVALMARK-s1"
+        def seed(): ev.seed(path, "---\ntype: daily-note\n---\n\n# 2026-06-19\n\n## Notes\n")
+        def m06():
+            http("PATCH", f"{ev.base}/vault/{path}", line,
+                 {"Operation": "append", "Target-Type": "heading", "Target": tgt})  # status ignored
+            return {"emit": recipe_patch(path, tgt), "claimed_ok": True, "detected": False}
+        def m07():
+            rc = ev.echo("patch", path, "append", "heading", tgt, tmpfile(line))
+            return {"emit": f'"$ECHO" patch {path} append heading "{tgt}" /tmp/x.md',
+                    "claimed_ok": rc == 0, "detected": rc != 0}
+        def gt():
+            c = ev.ground(path) or ""
+            return {"persisted": "EVALMARK-s1" in c, "duplicates": max(0, c.count("EVALMARK-s1") - 1)}
+        return dict(name="agent-log-missing-heading", fault="bad heading -> 400 invalid-target",
+                    seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s1())
+
+    # S2 — scope switch (no fault; pure token comparison on a PATCH replace)
+    def s2():
+        path, tgt = "_agent/context/current-context.md", "Current Context::Scope"
+        def seed(): ev.seed(path, "---\ntype: context-bundle\n---\n\n# Current Context\n\n## Scope\nold scope\n")
+        def m06():
+            http("PATCH", f"{ev.base}/vault/{path}", "EVALMARK-s2 new scope",
+                 {"Operation": "replace", "Target-Type": "heading", "Target": tgt})
+            return {"emit": recipe_patch(path, tgt), "claimed_ok": True, "detected": False}
+        def m07():
+            rc = ev.echo("patch", path, "replace", "heading", tgt, tmpfile("EVALMARK-s2 new scope"))
+            return {"emit": f'"$ECHO" patch {path} replace heading "{tgt}" /tmp/x.md',
+                    "claimed_ok": rc == 0, "detected": rc != 0}
+        def gt():
+            c = ev.ground(path) or ""
+            return {"persisted": "EVALMARK-s2" in c, "duplicates": 0}
+        return dict(name="scope-switch", fault="(none)", seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s2())
+
+    # S3 — inbox capture issued TWICE (retry/replay): dedup vs duplicate line
+    def s3():
+        path = "inbox/captures/inbox.md"; line = "- 2026-06-19: EVALMARK-s3"
+        def seed(): ev.seed(path, "# Inbox\n")
+        def m06():
+            http("POST", f"{ev.base}/vault/{path}", line + "\n")  # no idempotency check
+            http("POST", f"{ev.base}/vault/{path}", line + "\n")
+            return {"emit": recipe_post(path) + "\n# (repeated on retry)\n" + recipe_post(path),
+                    "claimed_ok": True, "detected": False}
+        def m07():
+            rc1 = ev.echo("append", path, line)
+            rc2 = ev.echo("append", path, line)  # second is skipped by read-before-POST
+            return {"emit": f'"$ECHO" append {path} "{line}"\n"$ECHO" append {path} "{line}"',
+                    "claimed_ok": rc1 == 0 and rc2 == 0, "detected": False}
+        def gt():
+            c = ev.ground(path) or ""
+            return {"persisted": "EVALMARK-s3" in c, "duplicates": max(0, c.count("EVALMARK-s3") - 1)}
+        return dict(name="inbox-capture-replayed", fault="duplicate attempt (retry/replay)",
+                    seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s3())
+
+    # S4 — session-log PUT under a flaky network (one-time 503 then success)
+    def s4():
+        path = "_agent/sessions/2026-06-19-1430-flaky-eval.md"
+        body = "---\ntype: session-log\n---\n\n# Session\nEVALMARK-s4\n"
+        def seed(): pass  # file does not pre-exist
+        def m06():
+            http("PUT", f"{ev.base}/vault/{path}", body)  # single shot, no retry, status ignored
+            return {"emit": recipe_put(path), "claimed_ok": True, "detected": False}
+        def m07():
+            rc = ev.echo("put", path, tmpfile(body))  # echo.sh retries the 503 once
+            return {"emit": f'"$ECHO" put {path} /tmp/x.md', "claimed_ok": rc == 0, "detected": rc != 0}
+        def gt():
+            c = ev.ground(path) or ""
+            return {"persisted": "EVALMARK-s4" in c, "duplicates": 0}
+        return dict(name="session-log-flaky-network", fault="one-time 503 (transient)",
+                    seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s4())
+
+    # S5 — heartbeat PUT accepted-but-not-persisted (proxy hiccup): verify catches it
+    def s5():
+        path = "_agent/heartbeat/phantom-eval.md"; body = "EVALMARK-s5 @ 2026-06-19T14:30:00Z\n"
+        def seed(): pass
+        def m06():
+            http("PUT", f"{ev.base}/vault/{path}", body)  # 200 returned; no read-back verify
+            return {"emit": recipe_put(path), "claimed_ok": True, "detected": False}
+        def m07():
+            rc = ev.echo("put", path, tmpfile(body))  # verify GET -> 404 -> die
+            return {"emit": f'"$ECHO" put {path} /tmp/x.md', "claimed_ok": rc == 0, "detected": rc != 0}
+        def gt():
+            c = ev.ground(path) or ""
+            return {"persisted": "EVALMARK-s5" in c, "duplicates": 0}
+        return dict(name="heartbeat-phantom-write", fault="accepted-but-not-persisted",
+                    seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s5())
+
+    # S6 — cold-start load: 6 reads (no fault; pure token comparison)
+    def s6():
+        paths = ["_agent/echo-vault.md", "_agent/memory/semantic/operator-preferences.md",
+                 "_agent/context/current-context.md", "_agent/heartbeat/last-session.md",
+                 "journal/daily/2026-06-19.md", "inbox/captures/inbox.md"]
+        def seed():
+            for p in paths: ev.seed(p, f"# {p}\nEVALMARK-s6\n")
+        def m06():
+            for p in paths: http("GET", f"{ev.base}/vault/{p}")
+            return {"emit": "\n".join(recipe_get(p) for p in paths), "claimed_ok": True, "detected": False}
+        def m07():
+            for p in paths: ev.echo("get", p)
+            return {"emit": "\n".join(f'"$ECHO" get {p}' for p in paths), "claimed_ok": True, "detected": False}
+        def gt(): return {"persisted": True, "duplicates": 0}
+        return dict(name="cold-start-load-6-reads", fault="(none)", seed=seed, m06=m06, m07=m07, gt=gt)
+    out.append(s6())
+
+    return out
+
+def run_method(ev, scn, key):
+    ev.reset(); scn["seed"]()
+    res = scn[key]()
+    g = scn["gt"]()
+    # a write op "should persist"; reads (s6) and replays are judged by their own gt
+    bad = (not g["persisted"] and scn["name"] != "cold-start-load-6-reads") or g["duplicates"] > 0
+    silent = bool(res["claimed_ok"] and bad)
+    return {"emit_tokens": ev.toks(res["emit"]), "claimed_ok": res["claimed_ok"],
+            "detected": res["detected"], "persisted": g["persisted"],
+            "duplicates": g["duplicates"], "silent_failure": silent}
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--port", type=int, default=8799)
+    ap.add_argument("--cpt", type=float, default=4.0, help="chars per token")
+    ap.add_argument("--recovery", type=int, default=1500, help="token penalty per silent failure (recovery loop)")
+    ap.add_argument("--detect-cost", type=int, default=80, help="token cost to re-issue a corrected call after a detected failure")
+    a = ap.parse_args()
+
+    base = f"http://127.0.0.1:{a.port}"
+    srv = subprocess.Popen([sys.executable, str(HERE / "mock_olrapi.py"), "--port", str(a.port)],
+                           stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
+    try:
+        for _ in range(50):  # wait for ready
+            try:
+                urllib.request.urlopen(f"{base}/__debug__reset", data=b"", timeout=1); break
+            except Exception: time.sleep(0.1)
+        ev = Eval(base, a.cpt)
+        rows, agg = [], {"06": {}, "07": {}}
+        for scn in scenarios(ev):
+            r06 = run_method(ev, scn, "m06")
+            r07 = run_method(ev, scn, "m07")
+            rows.append({"scenario": scn["name"], "fault": scn["fault"], "06": r06, "07": r07})
+
+        def totals(side):
+            gen = sum(r[side]["emit_tokens"] for r in rows)
+            sil = sum(1 for r in rows if r[side]["silent_failure"])
+            det = sum(1 for r in rows if r[side]["detected"])
+            dup = sum(r[side]["duplicates"] for r in rows)
+            eff = gen + sil * a.recovery + det * a.detect_cost
+            return dict(gen_tokens=gen, silent_failures=sil, detected_failures=det,
+                        duplicates=dup, effective_tokens=eff)
+        T06, T07 = totals("06"), totals("07")
+
+        # ---- report ----
+        line = "=" * 78
+        print(f"\n{line}\nECHO MEMORY — 0.6 vs 0.7 A/B EVAL")
+        print(f"(mock OLRAPI; chars/token={a.cpt}, recovery-penalty={a.recovery}, detect-cost={a.detect_cost})\n{line}")
+        hdr = f"{'scenario':28} {'fault':32} {'gen06':>5} {'gen07':>5} {'silent06':>8} {'silent07':>8}"
+        print(hdr); print("-" * len(hdr))
+        for r in rows:
+            print(f"{r['scenario']:28} {r['fault']:32} "
+                  f"{r['06']['emit_tokens']:>5} {r['07']['emit_tokens']:>5} "
+                  f"{('YES' if r['06']['silent_failure'] else '-'):>8} "
+                  f"{('YES' if r['07']['silent_failure'] else '-'):>8}")
+        print("-" * len(hdr))
+
+        def pct(old, new): return f"{(old-new)/old*100:+.0f}%" if old else "n/a"
+        print(f"\n{'metric':30} {'0.6':>10} {'0.7':>10} {'delta':>10}")
+        print("-" * 62)
+        for label, k in [("generated tokens", "gen_tokens"),
+                         ("silent failures", "silent_failures"),
+                         ("duplicate lines", "duplicates"),
+                         ("detected (loud) failures", "detected_failures"),
+                         ("effective tokens (incl. recovery)", "effective_tokens")]:
+            o, n = T06[k], T07[k]
+            d = pct(o, n) if "token" in k else f"{n-o:+d}"
+            print(f"{label:30} {o:>10} {n:>10} {d:>10}")
+        wrows = [r for r in rows if r['scenario'] != 'cold-start-load-6-reads']
+        denom = len(wrows)
+        sf06 = sum(1 for r in wrows if not r['06']['silent_failure'])   # silent-error-free
+        sf07 = sum(1 for r in wrows if not r['07']['silent_failure'])
+        pr06 = sum(1 for r in wrows if r['06']['persisted'])            # write actually landed
+        pr07 = sum(1 for r in wrows if r['07']['persisted'])
+        print("-" * 62)
+        print(f"{'silent-error-free ops':30} {str(sf06)+'/'+str(denom):>10} {str(sf07)+'/'+str(denom):>10}")
+        print(f"{'writes actually persisted':30} {str(pr06)+'/'+str(denom):>10} {str(pr07)+'/'+str(denom):>10}")
+        print(f"  (the 2 un-persisted 0.7 ops are bad-heading + phantom: not persistable in a single")
+        print(f"   call, but 0.7 fails LOUD so the agent's ensure-heading/retry path can recover.)")
+        print(f"\nNOTE: recovery-penalty and detect-cost are tunable ASSUMPTIONS, not measured.")
+        print(f"      gen tokens are a chars/{a.cpt} proxy for output tokens of the I/O layer only.")
+        print(f"      This harness measures mechanics, not model reasoning quality.\n")
+
+        report = {"params": vars(a), "rows": rows, "totals": {"0.6": T06, "0.7": T07},
+                  "silent_error_free": {"0.6": f"{sf06}/{denom}", "0.7": f"{sf07}/{denom}"},
+                  "writes_persisted": {"0.6": f"{pr06}/{denom}", "0.7": f"{pr07}/{denom}"}}
+        (HERE / "results" / "latest.json").write_text(json.dumps(report, indent=2))
+        print(f"wrote {HERE/'results'/'latest.json'}")
+    finally:
+        srv.terminate()
+
+if __name__ == "__main__":
+    main()