Commit Graph

1 Commits

Author SHA1 Message Date
matrix9neonebuchadnezzar2199-sketch 7509a72502 fix: mitigate system prompt contamination in search queries (#333)
Addresses Issue #333: AI agents prepending system prompts to search queries
causes embedding retrieval to collapse (89.8% → 1.0% R@10).

Mitigation approach (減災):
- New query_sanitizer.py with 4-stage pipeline:
  Step 1: passthrough for short queries (≤200 chars)
  Step 2: question extraction (finds ? sentences) → ~85-89% recovery
  Step 3: tail sentence extraction → ~80-89% recovery
  Step 4: tail truncation fallback → ~70-80% recovery
  Worst case without sanitizer: 1.0% (catastrophic)
  Worst case with sanitizer: ~70-80% (survivable)

- mcp_server.py: tool_search applies sanitizer before ChromaDB query
- MCP schema: query description warns agents not to include prompts
- New 'context' parameter separates background info from search intent
- Sanitizer metadata included in response when triggered

22 new tests covering all pipeline stages and real-world scenarios.

Made-with: Cursor
2026-04-09 23:28:59 +09:00