The miner upserted one drawer per ChromaDB call, paying tokenizer +
ONNX session setup per chunk. The embedding device was CPU-only because
no EmbeddingFunction was ever wired through the backend.
Two changes, each a speedup in its own right; stacked they give ~10x
end-to-end on a medium corpus (20 files, 568 drawers):
1. Batched upsert. `process_file` and `_file_chunks_locked` now collect
all chunks of a file into a single `collection.upsert(...)` so the
embedding model runs one forward pass per file instead of N.
2. Hardware-accelerated embedding function. New `mempalace/embedding.py`
wraps `ONNXMiniLM_L6_V2` with configurable `preferred_providers`.
`MEMPALACE_EMBEDDING_DEVICE` (or `embedding_device` in config.json)
selects auto / cpu / cuda / coreml / dml. Unavailable accelerators
log a warning and fall back to CPU.
The factory subclasses `ONNXMiniLM_L6_V2` and spoofs its `name()` to
`"default"` so the persisted EF identity matches existing palaces
created with ChromaDB's bare `DefaultEmbeddingFunction` -- same
model, same 384-dim vectors, no rebuild needed when turning GPU on.
`ChromaBackend.get_collection` / `create_collection` now pass the
resolved EF on every call so miner writes and searcher reads agree.
Benchmarks (i9-12900KF + RTX 3090, medium scenario, 568 drawers):
per-chunk + CPU 19.77s · 29 drw/s (baseline)
batched + CPU 8.07s · 70 drw/s (2.4x)
batched + CUDA 2.15s · 264 drw/s (9.2x)
Reproducible via `benchmarks/mine_bench.py`.
Install paths:
pip install mempalace[gpu] # NVIDIA CUDA
pip install mempalace[dml] # DirectML (Windows)
pip install mempalace[coreml] # macOS Neural Engine
Mine header now prints `Device: cpu|cuda|...` so users can confirm the
accelerator engaged.