88 lines
4.4 KiB
Markdown
88 lines
4.4 KiB
Markdown
# echo-memory eval — 0.6 vs 0.7 A/B harness
|
|
|
|
A reproducible, credential-free A/B comparison of the plugin **before** (0.6: raw-curl
|
|
recipes from `SKILL.md`) and **after** (0.7: the shipped `scripts/echo.sh` client) the
|
|
hardening work. It quantifies the claims in the comparative analysis: token cost of the
|
|
I/O layer, and the rate of **silent write failures**.
|
|
|
|
## Run it
|
|
|
|
```bash
|
|
cd eval
|
|
python3 run_eval.py # default params
|
|
python3 run_eval.py --recovery 2500 # sensitivity-test the recovery assumption
|
|
python3 run_eval.py --cpt 3.5 # different chars/token proxy
|
|
```
|
|
|
|
No network, no API key, no live vault. Pure stdlib (Python 3 + bash for `echo.sh`).
|
|
Results table prints to stdout and a machine-readable copy lands in `results/latest.json`.
|
|
|
|
## How it works
|
|
|
|
- **`mock_olrapi.py`** — a deterministic mock of the Obsidian Local REST API surface the
|
|
plugin uses, reproducing its real behaviors and quirks (404 shape, the `/vault//`
|
|
double-slash 400, directory listings with `dir/` entries, `PATCH` heading targets that
|
|
return `400 invalid-target / 40080` when the heading is absent). Faults are triggered by
|
|
path markers so one server serves every scenario:
|
|
- `flaky` in the path → first write returns `503`, then succeeds (tests retry).
|
|
- `phantom` in the path → `PUT` returns `200` but does **not** persist (tests read-back verify).
|
|
- a `PATCH` to a missing heading → `400` (the silent-write-loss trigger).
|
|
- **`run_eval.py`** — for each scenario, runs both methods against a freshly reset +
|
|
re-seeded server (so faults are identical for both), then reads ground truth back
|
|
**independently** from the mock. The 0.7 side executes the *actual shipped `echo.sh`*;
|
|
the 0.6 side faithfully models the documented recipe (real HTTP, but no status check,
|
|
no retry, no verify, no dedupe).
|
|
|
|
## Metrics
|
|
|
|
| metric | meaning |
|
|
|---|---|
|
|
| `gen_tokens` | output tokens the model must generate for the op (`len(emitted)/cpt` proxy) |
|
|
| `silent_failure` | method **reported success** but ground truth is wrong (lost write or dup) — and nobody noticed |
|
|
| `detected` | the method surfaced the failure (nonzero exit) instead of hiding it |
|
|
| `effective_tokens` | `gen + silent_failures*recovery + detected*detect_cost` |
|
|
| `silent-error-free ops` | the headline accuracy number |
|
|
| `writes actually persisted` | did the single op land (separate from "was it silent") |
|
|
|
|
## Scenarios
|
|
|
|
1. **agent-log-missing-heading** — `PATCH` append to a note lacking the target heading (`400`).
|
|
2. **scope-switch** — clean `PATCH replace` (no fault; pure token comparison).
|
|
3. **inbox-capture-replayed** — same capture issued twice (retry/replay): dedup vs duplicate.
|
|
4. **session-log-flaky-network** — one-time `503`: retry vs single-shot.
|
|
5. **heartbeat-phantom-write** — accepted-but-not-persisted: read-back verify vs none.
|
|
6. **cold-start-load-6-reads** — 6 GETs (no fault; pure token comparison).
|
|
|
|
## Representative result (defaults)
|
|
|
|
```
|
|
generated tokens 723 -> 174 (+76% fewer)
|
|
silent failures 4 -> 0 (-4)
|
|
duplicate lines 1 -> 0 (-1)
|
|
silent-error-free ops 1/5 -> 5/5
|
|
effective tokens (assumed) 6723 -> 334
|
|
```
|
|
|
|
## Honest caveats
|
|
|
|
- **Mechanics, not reasoning.** This measures the deterministic plumbing differences. It
|
|
does **not** measure model judgment (routing choices, prose quality) — that needs a live model.
|
|
- **`recovery` and `detect-cost` are assumptions**, not measurements. The headline
|
|
"silent failures: 4 → 0" is a hard count from ground truth; the `effective_tokens` figure
|
|
is a model on top of it — tune `--recovery` to see the sensitivity.
|
|
- **`gen_tokens` is a `chars/cpt` proxy** for the I/O layer only, not a tokenizer count, and
|
|
excludes the one-time `+12%` SKILL.md context cost noted in the analysis (that's a
|
|
per-session context cost, not per-op).
|
|
|
|
## Extending to a live-model run (optional)
|
|
|
|
To measure real model behavior and true token counts:
|
|
1. Define the same six scenarios as natural-language tasks (e.g. "log a session note for X").
|
|
2. Run each twice — once with the 0.6 skill files, once with 0.7 — through the Agent SDK
|
|
against the **mock** server (point `ECHO_BASE` at it) so faults stay deterministic.
|
|
3. Record `usage.output_tokens` per task from the API and whether the vault ended correct
|
|
(same ground-truth read used here).
|
|
|
|
The mock + ground-truth checks in this harness are reusable as-is for that; only the driver
|
|
changes from "scripted ops" to "model-driven ops".
|