perf(mining): batch per-chunk upserts and add optional GPU acceleration

The miner upserted one drawer per ChromaDB call, paying tokenizer +
ONNX session setup per chunk. The embedding device was CPU-only because
no EmbeddingFunction was ever wired through the backend.

Two changes, each a speedup in its own right; stacked they give ~10x
end-to-end on a medium corpus (20 files, 568 drawers):

1. Batched upsert. `process_file` and `_file_chunks_locked` now collect
   all chunks of a file into a single `collection.upsert(...)` so the
   embedding model runs one forward pass per file instead of N.

2. Hardware-accelerated embedding function. New `mempalace/embedding.py`
   wraps `ONNXMiniLM_L6_V2` with configurable `preferred_providers`.
   `MEMPALACE_EMBEDDING_DEVICE` (or `embedding_device` in config.json)
   selects auto / cpu / cuda / coreml / dml. Unavailable accelerators
   log a warning and fall back to CPU.

   The factory subclasses `ONNXMiniLM_L6_V2` and spoofs its `name()` to
   `"default"` so the persisted EF identity matches existing palaces
   created with ChromaDB's bare `DefaultEmbeddingFunction` -- same
   model, same 384-dim vectors, no rebuild needed when turning GPU on.

   `ChromaBackend.get_collection` / `create_collection` now pass the
   resolved EF on every call so miner writes and searcher reads agree.

Benchmarks (i9-12900KF + RTX 3090, medium scenario, 568 drawers):

  per-chunk + CPU   19.77s ·  29 drw/s   (baseline)
  batched   + CPU    8.07s ·  70 drw/s   (2.4x)
  batched   + CUDA   2.15s · 264 drw/s   (9.2x)

Reproducible via `benchmarks/mine_bench.py`.

Install paths:
  pip install mempalace[gpu]       # NVIDIA CUDA
  pip install mempalace[dml]       # DirectML (Windows)
  pip install mempalace[coreml]    # macOS Neural Engine

Mine header now prints `Device: cpu|cuda|...` so users can confirm the
accelerator engaged.
This commit is contained in:
Igor Lins e Silva
2026-04-24 19:42:35 -03:00
parent 7a757916b3
commit a4868a3589
8 changed files with 784 additions and 61 deletions
+8
View File
@@ -53,6 +53,14 @@ chroma = "mempalace.backends.chroma:ChromaBackend"
[project.optional-dependencies]
dev = ["pytest>=7.0", "pytest-cov>=4.0", "ruff>=0.4.0", "psutil>=5.9"]
spellcheck = ["autocorrect>=2.0"]
# Hardware acceleration for the ONNX embedding model. Install exactly one:
# pip install mempalace[gpu] — NVIDIA CUDA
# pip install mempalace[dml] — DirectML (Windows AMD/Intel/NVIDIA)
# pip install mempalace[coreml] — macOS Neural Engine
# After install, set MEMPALACE_EMBEDDING_DEVICE=cuda|dml|coreml (or "auto").
gpu = ["onnxruntime-gpu>=1.16"]
dml = ["onnxruntime-directml>=1.16"]
coreml = ["onnxruntime>=1.16"]
[dependency-groups]
dev = ["pytest>=7.0", "pytest-cov>=4.0", "ruff>=0.4.0", "psutil>=5.9"]