Commit Graph

3 Commits

Author SHA1 Message Date
Igor Lins e Silva 8df7b9bf2c benchmarks: add --llm-backend ollama for non-Anthropic rerank
The rerank pipeline was hardcoded to Anthropic's /v1/messages.
Add a backend flag so the same code path can be exercised with
any OpenAI-compatible endpoint — local Ollama, Ollama Cloud,
or any gateway that speaks /v1/chat/completions.

Enables independent verification of the "100% with Haiku rerank"
claim by running the full benchmark with a different LLM family
(e.g. minimax-m2.7:cloud) and zero Anthropic dependency.

Both longmemeval_bench.py and locomo_bench.py:
 - llm_rerank*() gain backend= / base_url= kwargs
 - CLI: --llm-backend {anthropic,ollama}, --llm-base-url
 - API key required only when backend=anthropic (diary/palace modes still require it)
 - Parse last integer in response (reasoning models emit multi-int output)
 - Fallback to message.reasoning when content is empty
 - Raise max_tokens to 1024 for reasoning models
2026-04-14 21:20:14 -03:00
travisBREAKS 89206107fa fix(bench): remove hardcoded credential paths from benchmark runners (#177)
The `_load_api_key()` function in longmemeval_bench.py and locomo_bench.py
searched for API keys in a fixed path (`~/.config/lu/keys.json`) using
personal key names (`anthropic_milla`, `anthropic_claude_code_main`).

This leaks internal infrastructure details into the public codebase and
trains contributors to store credentials in a non-standard location
rather than using the standard ANTHROPIC_API_KEY env var.

Simplified to: CLI flag > env var > empty string. Updated help text
and HYBRID_MODE.md docs to match.

Co-authored-by: Tadao <tadao@travisfixes.com>
2026-04-11 23:14:36 -07:00
bensig 0f8fa8c7d5 bench: add benchmark runners, results docs, and test suite
Benchmarks: LongMemEval, LoCoMo, ConvoMem, MemBench runners with
methodology docs and hybrid retrieval analysis.

Tests: config, miner, convo_miner, normalize — 9 tests, all passing.
2026-04-04 18:33:42 -07:00