Commit Graph

7 Commits

Author SHA1 Message Date
Igor Lins e Silva 61d02e10fe benchmarks: add v3.3.0 reproduction results + 50/450 split
Addresses #875: every internal BENCHMARKS.md claim reproduced
on Linux x86_64 (v3.3.0 tag, deterministic ChromaDB embeddings,
seed=42 for the LongMemEval dev/held-out split).

Scorecard — all reproduce exactly:

  LongMemEval
    raw R@5                            96.6% (500/500)   
    hybrid_v4 held-out 450 R@5         98.4% (442/450)   
    hybrid_v4 + minimax rerank R@5     99.2% (496/500)   *
    hybrid_v4 + minimax rerank R@10   100.0% (500/500)   *

  LoCoMo (session, top-10)
    raw                                60.3% (1986q)     
    hybrid v5                          88.9% (1986q)     

  ConvoMem all-categories (250 items)   92.9%            
  MemBench all-categories (8500)        80.3%            

* The minimax-m2.7:cloud rerank run replicates the "100%" claim
  with a different LLM family (no Anthropic dependency). R@10 is
  a perfect reproduction; R@5 misses 4 questions that the
  published Haiku run caught — consistent with BENCHMARKS.md's own
  disclosure that hybrid_v4 includes three question-specific fixes
  developed by inspecting misses, i.e. teaching to the test.

The committed 50/450 split is the deterministic (seed=42) split
BENCHMARKS.md references but wasn't previously in the repo.

Full result JSONLs include every question, every retrieved id,
and every score — auditable end-to-end.
2026-04-14 21:21:11 -03:00
Igor Lins e Silva ca0682abe3 benchmarks: apply ruff-format to llm_rerank (trivial line wrap) 2026-04-14 21:20:54 -03:00
Igor Lins e Silva 8df7b9bf2c benchmarks: add --llm-backend ollama for non-Anthropic rerank
The rerank pipeline was hardcoded to Anthropic's /v1/messages.
Add a backend flag so the same code path can be exercised with
any OpenAI-compatible endpoint — local Ollama, Ollama Cloud,
or any gateway that speaks /v1/chat/completions.

Enables independent verification of the "100% with Haiku rerank"
claim by running the full benchmark with a different LLM family
(e.g. minimax-m2.7:cloud) and zero Anthropic dependency.

Both longmemeval_bench.py and locomo_bench.py:
 - llm_rerank*() gain backend= / base_url= kwargs
 - CLI: --llm-backend {anthropic,ollama}, --llm-base-url
 - API key required only when backend=anthropic (diary/palace modes still require it)
 - Parse last integer in response (reasoning models emit multi-int output)
 - Fallback to message.reasoning when content is empty
 - Raise max_tokens to 1024 for reasoning models
2026-04-14 21:20:14 -03:00
travisBREAKS 89206107fa fix(bench): remove hardcoded credential paths from benchmark runners (#177)
The `_load_api_key()` function in longmemeval_bench.py and locomo_bench.py
searched for API keys in a fixed path (`~/.config/lu/keys.json`) using
personal key names (`anthropic_milla`, `anthropic_claude_code_main`).

This leaks internal infrastructure details into the public codebase and
trains contributors to store credentials in a non-standard location
rather than using the standard ANTHROPIC_API_KEY env var.

Simplified to: CLI flag > env var > empty string. Updated help text
and HYBRID_MODE.md docs to match.

Co-authored-by: Tadao <tadao@travisfixes.com>
2026-04-11 23:14:36 -07:00
travisBREAKS d8b2db696f fix(bench): remove global SSL verification bypass in convomem_bench (#176)
The module-level `ssl._create_default_https_context = ssl._create_unverified_context`
disables certificate verification for ALL urllib requests in the process,
not just the benchmark's HuggingFace downloads. This silently exposes
the benchmark runner to MITM attacks.

If a specific environment needs to skip verification (e.g. corporate proxy),
users can set `PYTHONHTTPSVERIFY=0` or pass a custom ssl context per-request
rather than globally patching the ssl module.

Co-authored-by: Tadao <tadao@travisfixes.com>
2026-04-11 23:14:12 -07:00
bensig 6d8c462219 fix: resolve ruff lint and format errors across codebase
Fix E402 import ordering, F841 unused variable, F541 unnecessary
f-strings, F401 unused import, and auto-format 6 files.
2026-04-04 18:37:17 -07:00
bensig 0f8fa8c7d5 bench: add benchmark runners, results docs, and test suite
Benchmarks: LongMemEval, LoCoMo, ConvoMem, MemBench runners with
methodology docs and hybrid retrieval analysis.

Tests: config, miner, convo_miner, normalize — 9 tests, all passing.
2026-04-04 18:33:42 -07:00