I benchmarked code retrieval for AI coding agents on 60 tasks
A tuned grep beats my hybrid-retrieval MCP server on F1. Sverklo wins by 62× on token economy and 7-12× on tool-call count. Both numbers are real, both are in the same report, and the second one matters more for the people I'm trying to help.
Why this benchmark exists
I've spent the last six months building sverklo, a local-first MCP server that gives AI coding agents (Claude Code, Cursor, Windsurf) a real symbol graph instead of grep-based pattern matching. The product positioning has always been "stops the agent from hallucinating function names that don't exist in your codebase."
That positioning is hand-wavy without numbers. Six months in, I had no public benchmark. Whatever speed-of-iteration story I told myself was, I was telling myself.
So I built one: 60 hand-verified retrieval tasks across two real OSS codebases (expressjs/express and the sverklo repo itself), three baselines (naive grep, smart grep, sverklo), and metrics that measure both retrieval quality (F1, recall, precision) and the thing AI agents actually pay for (input tokens, tool calls, wall time).
Results live at sverklo.com/bench. Raw JSONL outputs are in the repo at benchmark/results/<timestamp>/. The harness runs in one npm command. Disagreements with my numbers are useful — file an issue with your machine spec.
The headline
| baseline | F1 | tokens | tool calls |
|---|---|---|---|
| naive-grep | 0.35 | 15,814 | 7.6 |
| smart-grep (tuned) | 0.67 | 731 | 11.8 |
| sverklo | 0.58 | 255 | 1.0 |
A tuned grep beats sverklo on F1 by 9 points. That's not what I expected when I started building this. If you can write a clean ripgrep invocation with language filters and definition-shaped patterns, you get higher F1 than my hybrid retrieval stack returns.
What sverklo wins on:
- 62× fewer tokens than naive grep (255 vs 15,814)
- 2.9× fewer tokens than smart grep (255 vs 731)
- 1 tool call vs grep's 7-12 per task
- ~1ms wall time after a 3.7-second cold start (the index build)
Why "tokens per correct answer" is the load-bearing metric
If you're standing at a terminal with rg, F1 is what matters. You read the matches. The agent isn't paying for them.
If you're an AI agent with a 200K token context window, every token has an opportunity cost. Burning 15,000 tokens on grep noise to find one function leaves you 14,750 fewer tokens for the actual change. The agent that gets the answer in 255 tokens has 14,750 more tokens to spend on doing the work.
The metric that actually matters is tokens per correct answer: input tokens divided by recall. The bench reports this for both gated (F1 ≥ 0.8) and ungated runs. For sverklo on the gated subset, it's 203 tokens per correct answer. For naive grep, 3,557. For smart grep, 165 — smart grep is genuinely competitive on per-correct-answer cost when its F1 lands.
The mistake I almost made: optimising for F1. The thing AI coding agents actually need is the cheapest correct retrieval, not the highest-precision retrieval that takes 12 tool calls to assemble.
Per-category: where each baseline shines
| Category | Best F1 | Best token economy |
|---|---|---|
| P1 — Definition lookup (n=20) | sverklo (0.75) | smart-grep (196 tok) |
| P2 — Reference finding (n=20) | smart-grep (0.81) | sverklo (157 tok) |
| P4 — File dependencies (n=10) | sverklo (0.86) | sverklo (74 tok) |
| P5 — Dead code (n=10) | smart-grep (0.55) | sverklo (579 tok, but F1 = 0.02) |
The pattern: sverklo wins on the slices where structural retrieval (the symbol graph, the import graph) directly answers the question. Definition lookup (P1) and file dependencies (P4) are exactly that. Reference finding (P2) turns out to be a regex problem grep handles well, because the reference patterns in JS/TS are syntactically uniform enough that \bsymbol\b works most of the time.
Where sverklo fails: the P5 dead-code slice
P5 is the embarrassing one. F1 = 0.02. sverklo_refs looks at the static call graph. It doesn't see dynamic invocations (this[methodName]()), it doesn't see deserialization-driven calls (JSON.parse + eval patterns), and it doesn't see calls through ORM proxies that spell themselves with template-string method names.
Smart-grep gets 0.55 on the same slice by aggressively reading whole files and matching loose patterns. The "loose" matters: it picks up a lot of false positives, but on dead-code detection a false positive is "this function is alive" — which is the safer error.
P5 is the next thing I'm fixing. The plan is to extend the reference graph with a runtime-trace mode (instrument the test suite, log actual call sites, merge into the static graph). I'll publish that as a new bench slice when it lands.
What this benchmark does NOT measure
The bench-primitives 60-task slice is small. It's two codebases, both TypeScript/JavaScript. It doesn't measure cross-repo retrieval. It doesn't measure end-to-end agent task completion. It doesn't measure the cost of indexing relative to retrieval (the cold-start is reported as a separate column rather than amortised in).
The next bench dataset I'm shipping is bench:swe — 65 SWE-Bench-style questions × 5 OSS repos (Express, NestJS, Vite, Prisma, FastAPI). First results are here. That measures end-to-end retrieval recall on grounded questions, not just primitive lookups.
Architecture: channelized RRF
The novel piece in sverklo's retrieval is channelized Reciprocal Rank Fusion. Most hybrid retrievers run RRF once over fts ∪ vector. Sverklo runs RRF per channel — FTS, vector, doc-section, path, symbol-name — and fuses the per-channel ranks with channel-specific weights. The path channel is weighted 1.5× because filename matches are precision-skewed: when a query's keywords match a filename, it's signal worth boosting.
The full architecture rationale is in RRF is doing 80% of the work if you want the deep dive on why per-channel weighting matters more than the embedding model choice.
Reproducing this
git clone https://github.com/sverklo/sverklo
npm install
npm run build
npm run bench:primitives
Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors the bench page tables. If your numbers differ, please file an issue with your machine spec and the run timestamp — I want the disagreements.
What's the takeaway
If you're choosing between grep and an MCP code-intelligence server for your AI coding agent today:
- If your codebase is small (~30 files), use
rg. The MCP server overhead doesn't pay back. - If you're standing at the terminal yourself doing exploration, learn smart-grep flags. F1 lands you in the right place.
- If you're running an AI coding agent on a larger codebase and the agent invents function names that don't exist in your repo, the retrieval-token-economy gap is real and material. Sverklo's 1-tool-call retrieval is what unlocks that.
Try it
Sverklo is MIT-licensed, runs entirely on your laptop with embedded SQLite + a local ONNX model. No API keys. No cloud. No telemetry by default.
npm install -g sverklo cd your-project sverklo init
Or read the full bench report first — including the slice where sverklo loses.
Cite this
@misc{sverklo_bench_primitives_2026,
title = {Sverklo bench:primitives — a 60-task retrieval evaluation for AI coding agents},
author = {Groshin, Nikita},
year = {2026},
doi = {10.5281/zenodo.19802051},
url = {https://sverklo.com/bench/}
}