Project2026

Searchprobe

An adversarial benchmarking framework for neural search engines — Exa, Tavily, Brave, and SerpAPI — built to surface where production search actually fails, with the statistical rigor to back the claims.

Source ↗

Python · Streamlit · async · Pydantic · pytest · LLM-as-judge

13: embedding-grounded attack categories
70+: adversarial test pairs
4: providers benchmarked

Why

There was no rigorous, systematic way to compare neural search providers or to understand where they break. Averages hide failure modes; Searchprobe was built to expose them.

Architecture

Thirteen attack categories — grounded in embedding theory — generate adversarial test pairs that are run against every provider. Responses are scored by an LLM-as-judge, then put through real statistics rather than eyeballing.

  attack categories (13)
        │
        ▼
  test pairs (70+) ──▶ ┌ Exa     ┐
                       │ Tavily  │
                       │ Brave   ├─▶ responses ─▶ LLM-as-judge ─▶ scores
                       └ SerpAPI ┘                                  │
                                                                    ▼
                                       bootstrap CIs · Benjamini-Hochberg

  attack categories (13)
        │
        ▼
  test pairs (70+) ──▶ ┌ Exa     ┐
                       │ Tavily  │
                       │ Brave   ├─▶ responses ─▶ LLM-as-judge ─▶ scores
                       └ SerpAPI ┘                                  │
                                                                    ▼
                                       bootstrap CIs · Benjamini-Hochberg

What it found

With LLM-as-judge scoring plus bootstrap confidence intervals and Benjamini-Hochberg correction, the results aren't pass/fail averages — they're defensible. The recurring finding: most providers fail hard on negation and numeric queries.

← All work