bench:primitives — 60-task code retrieval evaluation
Three retrieval baselines, two real OSS codebases, 60 hand-verified tasks. Reproducible.
expressjs/express and sverklo/sverklo:
sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls;
smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls;
naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.
(255 vs 15,814)
(255 vs 731)
(grep needs 7.6 to 11.8)
All baselines
| baseline | n | F1 | recall | prec | tokens | tools | wall (ms) | cold (ms) | gated tok/correct |
|---|---|---|---|---|---|---|---|---|---|
| naive-grep | 60 | 0.35 | 0.56 | 0.29 | 15814 | 7.6 | 1302 | 0 | 3557 (n=10) |
| smart-grep | 60 | 0.67 | 0.81 | 0.62 | 731 | 11.8 | 215 | 0 | 165 (n=28) |
| sverklo | 60 | 0.58 | 0.73 | 0.57 | 255 | 1.0 | 1 | 3690 | 203 (n=25) |
Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.
Per-category breakdown
P1 — Definition lookup (n=20)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.15 | 23337 | 339 | 8.1 |
| smart-grep | 0.60 | 196 | 51 | 2.0 |
| sverklo | 0.75 | 283 | 0 | 1.0 |
Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.
P2 — Reference finding (n=20)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.39 | 21925 | 345 | 7.0 |
| smart-grep | 0.81 | 224 | 17 | 1.0 |
| sverklo | 0.56 | 157 | 0 | 1.0 |
Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.
P4 — File dependencies (n=10)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.51 | 2918 | 280 | 2.0 |
| smart-grep | 0.63 | 1058 | 16 | 2.0 |
| sverklo | 0.86 | 74 | 0 | 1.0 |
Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.
P5 — Dead code (n=10)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.50 | 1442 | 6164 | 13.5 |
| smart-grep | 0.55 | 2488 | 1138 | 63.0 |
| sverklo | 0.02 | 579 | 3 | 1.0 |
Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.
Where sverklo wins (full list)
| Task | Category | sverklo F1 | best grep F1 | sverklo tok | best grep tok |
|---|---|---|---|---|---|
express/ex-p1-02 | P1 | 1.00 | 0.00 | 769 | 10615 |
express/ex-p1-03 | P1 | 1.00 | 0.00 | 692 | 6844 |
express/ex-p1-09 | P1 | 1.00 | 0.00 | 128 | 5920 |
sverklo/sv-p4-05 | P4 | 1.00 | 0.50 | 50 | 874 |
express/ex-p4-04 | P4 | 1.00 | 0.50 | 36 | 3781 |
sverklo/sv-p4-04 | P4 | 1.00 | 0.67 | 42 | 928 |
express/ex-p4-05 | P4 | 1.00 | 0.68 | 41 | 1316 |
express/ex-p4-02 | P4 | 0.90 | 0.68 | 79 | 1345 |
sverklo/sv-p4-02 | P4 | 0.86 | 0.71 | 40 | 334 |
sverklo/sv-p4-03 | P4 | 0.86 | 0.75 | 59 | 754 |
sverklo/sv-p4-01 | P4 | 0.80 | 0.69 | 232 | 1373 |
Where sverklo loses (the honesty section)
If you skip this section, you're doing benchmark cherry-picking. We're not.
| Task | Category | sverklo F1 | best grep F1 | sverklo tok | best grep tok | note |
|---|---|---|---|---|---|---|
express/ex-p5-01 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p5-02 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p5-03 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p2-04 | P2 | 0.00 | 1.00 | 30 | 49 | missed |
sverklo/sv-p2-04 | P2 | 0.50 | 1.00 | 58 | 67 | vs smart-grep |
sverklo/sv-p2-06 | P2 | 0.40 | 0.83 | 137 | 205 | vs smart-grep |
express/ex-p2-01 | P2 | 0.27 | 0.63 | 442 | 701 | vs smart-grep |
The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.
What this benchmark does NOT measure
- Real coding-task latency. 60 retrieval primitives, not 60 end-to-end agent runs. The token economy translates to real-world savings, but a follow-up bench:swe (65 SWE-bench-style questions × 5 OSS repos) measures that and is on the public roadmap.
- Anything beyond Express + sverklo. The current slice is two TS/JS codebases. Coverage on Go, Python, Rust is the next dataset extension.
- Cross-repo retrieval. The bench is single-repo; sverklo's workspace and cross-repo features aren't exercised.
- Cost of indexing. The 3690ms cold-start figure is the index build for sverklo. We list it as a separate column rather than amortizing it into wall time, so you can decide whether your usage pattern justifies it.
Methodology
- Tasks: 60 hand-verified retrieval primitives across
expressjs/expressandsverklo/sverklo, distributed across P1 (definition lookup, n=20), P2 (reference finding, n=20), P4 (file dependencies, n=10), P5 (dead-code detection, n=10). - Metrics: F1, recall, precision, total input tokens, tool-call count, wall time. tokens_per_correct_answer = input_tokens / max(recall, 0.01) — lower is better. The gated column averages only over runs where F1 ≥ 0.8 — refusing to reward "found nothing cheaply".
- Tolerances: P1 uses ±3-line tolerance, P2 uses ±2 lines, P4/P5 use set membership.
- naive-grep floor:
grep -rn <sym> .then read top 10 files in full. - smart-grep: language filters, ±10-line context reads, definition-shaped patterns. The strong baseline.
- sverklo: spawns the MCP stdio server once per dataset; cold-start is the index build.
- Run environment: Apple Silicon laptop, Node 25, sverklo running with
--expose-gc, ONNX all-MiniLM-L6-v2 (~90 MB on disk).
Reproducing this
git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:primitives
Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.
Performance benchmarks (separate)
This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.
Cite this
If you reference this benchmark in academic work or comparison material:
@misc{sverklo_bench_primitives_2026,
title = {Sverklo bench:primitives — a 60-task retrieval evaluation for AI coding agents},
author = {Groshin, Nikita},
year = {2026},
doi = {10.5281/zenodo.19802051},
url = {https://sverklo.com/bench/}
}
Get started
If the token-economy numbers look interesting:
npm install -g sverklo
cd your-project
sverklo init
sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →