Addresses Issue #333: AI agents prepending system prompts to search queries
causes embedding retrieval to collapse (89.8% → 1.0% R@10).
Mitigation approach (減災):
- New query_sanitizer.py with 4-stage pipeline:
Step 1: passthrough for short queries (≤200 chars)
Step 2: question extraction (finds ? sentences) → ~85-89% recovery
Step 3: tail sentence extraction → ~80-89% recovery
Step 4: tail truncation fallback → ~70-80% recovery
Worst case without sanitizer: 1.0% (catastrophic)
Worst case with sanitizer: ~70-80% (survivable)
- mcp_server.py: tool_search applies sanitizer before ChromaDB query
- MCP schema: query description warns agents not to include prompts
- New 'context' parameter separates background info from search intent
- Sanitizer metadata included in response when triggered
22 new tests covering all pipeline stages and real-world scenarios.
Made-with: Cursor