fix(miner): harden Windows mine against ONNX bad_alloc + silent partial exits
Three small changes that together address the failure modes in #1296: 1. Add pnpm-lock.yaml and yarn.lock to SKIP_FILENAMES, mirroring the existing package-lock.json rule. A 24K-line pnpm-lock.yaml produced ~1124 chunks in one batch and tripped onnxruntime bad_alloc on Windows; pnpm/yarn lockfiles are no more useful to mine than npm's. 2. Skip any file that produces more than MAX_CHUNKS_PER_FILE (500) chunks, with a clear log line. Catches the broader class — generated CSV/JSON, build artifacts, etc. — that the named-file SKIP list will never fully cover. The cap is conservative (500 chunks * 800 chars ≈ 400 KB of source) so legitimate hand-written content still mines. 3. Print a partial-progress summary on any exception in _mine_impl, not just KeyboardInterrupt, then re-raise. Without this, an arbitrary exception (ONNX bad_alloc, chromadb HNSW error, OS fault) propagates silently — the operator sees only the last progress line and assumes the mine succeeded. The new path mirrors the KeyboardInterrupt summary (files_processed, drawers_filed, last_file) plus the exception type and message, then re-raises so the original traceback surfaces and the exit code is non-zero. Tests cover: SKIP_FILENAMES contents, the chunk-cap path returning (0, room) with no upserts, and the new mine-aborted summary surfacing both the partial counters and the exception class.
This commit is contained in:
@@ -66,6 +66,8 @@ SKIP_FILENAMES = {
|
||||
"mempal.yml",
|
||||
".gitignore",
|
||||
"package-lock.json",
|
||||
"pnpm-lock.yaml",
|
||||
"yarn.lock",
|
||||
}
|
||||
|
||||
CHUNK_SIZE = 800 # chars per drawer
|
||||
@@ -73,6 +75,13 @@ CHUNK_OVERLAP = 100 # overlap between chunks
|
||||
MIN_CHUNK_SIZE = 50 # skip tiny chunks
|
||||
DRAWER_UPSERT_BATCH_SIZE = 1000
|
||||
MAX_FILE_SIZE = 500 * 1024 * 1024 # 500 MB — skip files larger than this.
|
||||
# A single file producing more chunks than this is almost always a generated
|
||||
# artifact (CSV/JSON dump, lockfile not in SKIP_FILENAMES, etc.). Embedding
|
||||
# thousands of chunks from one file in one batch has triggered ONNX runtime
|
||||
# `bad allocation` errors on Windows (#1296). The cap is conservative: a
|
||||
# 500-chunk file at CHUNK_SIZE=800 is ~400 KB of source, which covers most
|
||||
# legitimate hand-written content while bounding the worst-case batch.
|
||||
MAX_CHUNKS_PER_FILE = 500
|
||||
# Long Claude Code sessions and large transcript exports routinely exceed
|
||||
# 10 MB. The cap exists as a defensive rail against pathological binary
|
||||
# files, not as a limit on legitimate text. Per-drawer size is bounded
|
||||
@@ -825,6 +834,13 @@ def process_file(
|
||||
room = detect_room(filepath, content, rooms, project_path)
|
||||
chunks = chunk_text(content, source_file)
|
||||
|
||||
if len(chunks) > MAX_CHUNKS_PER_FILE:
|
||||
print(
|
||||
f" ! [skip] {filepath.name[:50]:50} produced {len(chunks)} chunks "
|
||||
f"(> {MAX_CHUNKS_PER_FILE}); add to SKIP_FILENAMES or .gitignore"
|
||||
)
|
||||
return 0, room
|
||||
|
||||
if dry_run:
|
||||
print(f" [DRY RUN] {filepath.name} -> room:{room} ({len(chunks)} drawers)")
|
||||
return len(chunks), room
|
||||
@@ -1167,6 +1183,24 @@ def _mine_impl(
|
||||
"already-filed drawers are\n upserted idempotently and will not duplicate.\n"
|
||||
)
|
||||
sys.exit(130)
|
||||
except Exception as exc:
|
||||
# Without this, an arbitrary exception (ONNX bad_alloc, chromadb HNSW
|
||||
# error, OS fault) propagates and the process exits with no completion
|
||||
# banner — the operator sees only the final progress line and assumes
|
||||
# the mine succeeded (#1296). Print the partial-progress summary the
|
||||
# way we do for KeyboardInterrupt, then re-raise so the original
|
||||
# traceback still surfaces and the exit code is non-zero.
|
||||
print("\n\n Mine aborted by exception.")
|
||||
print(f" files_processed: {files_processed}/{len(files)}")
|
||||
print(f" drawers_filed: {total_drawers}")
|
||||
print(f" last_file: {last_file or '<none>'}")
|
||||
print(f" error: {type(exc).__name__}: {exc}")
|
||||
print(
|
||||
f"\n Re-run `mempalace mine {shlex.quote(project_dir)}` after addressing "
|
||||
"the cause — already-filed\n drawers are upserted idempotently and will "
|
||||
"not duplicate.\n"
|
||||
)
|
||||
raise
|
||||
finally:
|
||||
# Clean up the hooks-side PID lock if it points at us. Stale
|
||||
# entries already pass _pid_alive() == False on POSIX, but
|
||||
|
||||
Reference in New Issue
Block a user