Files
Jason Stedwell 2fc3a0a80b ver-0.7
2026-06-19 21:12:14 -05:00

4.4 KiB

echo-memory eval — 0.6 vs 0.7 A/B harness

A reproducible, credential-free A/B comparison of the plugin before (0.6: raw-curl recipes from SKILL.md) and after (0.7: the shipped scripts/echo.sh client) the hardening work. It quantifies the claims in the comparative analysis: token cost of the I/O layer, and the rate of silent write failures.

Run it

cd eval
python3 run_eval.py                 # default params
python3 run_eval.py --recovery 2500 # sensitivity-test the recovery assumption
python3 run_eval.py --cpt 3.5       # different chars/token proxy

No network, no API key, no live vault. Pure stdlib (Python 3 + bash for echo.sh). Results table prints to stdout and a machine-readable copy lands in results/latest.json.

How it works

  • mock_olrapi.py — a deterministic mock of the Obsidian Local REST API surface the plugin uses, reproducing its real behaviors and quirks (404 shape, the /vault// double-slash 400, directory listings with dir/ entries, PATCH heading targets that return 400 invalid-target / 40080 when the heading is absent). Faults are triggered by path markers so one server serves every scenario:
    • flaky in the path → first write returns 503, then succeeds (tests retry).
    • phantom in the path → PUT returns 200 but does not persist (tests read-back verify).
    • a PATCH to a missing heading → 400 (the silent-write-loss trigger).
  • run_eval.py — for each scenario, runs both methods against a freshly reset + re-seeded server (so faults are identical for both), then reads ground truth back independently from the mock. The 0.7 side executes the actual shipped echo.sh; the 0.6 side faithfully models the documented recipe (real HTTP, but no status check, no retry, no verify, no dedupe).

Metrics

metric meaning
gen_tokens output tokens the model must generate for the op (len(emitted)/cpt proxy)
silent_failure method reported success but ground truth is wrong (lost write or dup) — and nobody noticed
detected the method surfaced the failure (nonzero exit) instead of hiding it
effective_tokens gen + silent_failures*recovery + detected*detect_cost
silent-error-free ops the headline accuracy number
writes actually persisted did the single op land (separate from "was it silent")

Scenarios

  1. agent-log-missing-headingPATCH append to a note lacking the target heading (400).
  2. scope-switch — clean PATCH replace (no fault; pure token comparison).
  3. inbox-capture-replayed — same capture issued twice (retry/replay): dedup vs duplicate.
  4. session-log-flaky-network — one-time 503: retry vs single-shot.
  5. heartbeat-phantom-write — accepted-but-not-persisted: read-back verify vs none.
  6. cold-start-load-6-reads — 6 GETs (no fault; pure token comparison).

Representative result (defaults)

generated tokens             723 -> 174   (+76% fewer)
silent failures                4 ->   0   (-4)
duplicate lines                1 ->   0   (-1)
silent-error-free ops        1/5 -> 5/5
effective tokens (assumed)  6723 -> 334

Honest caveats

  • Mechanics, not reasoning. This measures the deterministic plumbing differences. It does not measure model judgment (routing choices, prose quality) — that needs a live model.
  • recovery and detect-cost are assumptions, not measurements. The headline "silent failures: 4 → 0" is a hard count from ground truth; the effective_tokens figure is a model on top of it — tune --recovery to see the sensitivity.
  • gen_tokens is a chars/cpt proxy for the I/O layer only, not a tokenizer count, and excludes the one-time +12% SKILL.md context cost noted in the analysis (that's a per-session context cost, not per-op).

Extending to a live-model run (optional)

To measure real model behavior and true token counts:

  1. Define the same six scenarios as natural-language tasks (e.g. "log a session note for X").
  2. Run each twice — once with the 0.6 skill files, once with 0.7 — through the Agent SDK against the mock server (point ECHO_BASE at it) so faults stay deterministic.
  3. Record usage.output_tokens per task from the API and whether the vault ended correct (same ground-truth read used here).

The mock + ground-truth checks in this harness are reusable as-is for that; only the driver changes from "scripted ops" to "model-driven ops".