- Run ruff format on all benchmark files (fixes CI lint job)
- Fix check_regression() substring ambiguity: ordered keyword matching
so "latency_improvement_pct" is correctly classified as higher-is-better
- Update stale comments in conftest.py referencing wrong fixture
- Add pytest addopts to skip benchmark/slow/stress markers by default
Concentrates all drawers into a single wing+room to isolate the
embedding model's retrieval limit independent of palace filtering.
Confirms recall degrades to ~0.4-0.5 at 5K drawers per room even
with wing+room filters applied — the spatial structure helps by
keeping buckets small, but can't fix the underlying embedding ceiling.