bench:primitives — 60-task code retrieval evaluation

Three retrieval baselines, two real OSS codebases, 60 hand-verified tasks. Reproducible.

Run: 2026-04-07T23:07:14Z · sverklo v0.2.11 at run-time · 60 tasks × 3 baselines = 180 runs · harness on GitHub

Headline. On 60 verified tasks across expressjs/express and sverklo/sverklo: sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls; smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls; naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.

62×

fewer tokens than naive grep
(255 vs 15,814)

2.9×

fewer tokens than tuned grep
(255 vs 731)

1.0

tool call per task
(grep needs 7.6 to 11.8)

All baselines

baseline	n	F1	recall	prec	tokens	tools	wall (ms)	cold (ms)	gated tok/correct
naive-grep	60	0.35	0.56	0.29	15814	7.6	1302	0	3557 (n=10)
smart-grep	60	0.67	0.81	0.62	731	11.8	215	0	165 (n=28)
sverklo	60	0.58	0.73	0.57	255	1.0	1	3690	203 (n=25)

Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.

Per-category breakdown

P1 — Definition lookup (n=20)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.15	23337	339	8.1
smart-grep	0.60	196	51	2.0
sverklo	0.75	283	0	1.0

Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.

P2 — Reference finding (n=20)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.39	21925	345	7.0
smart-grep	0.81	224	17	1.0
sverklo	0.56	157	0	1.0

Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.

P4 — File dependencies (n=10)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.51	2918	280	2.0
smart-grep	0.63	1058	16	2.0
sverklo	0.86	74	0	1.0

Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.

P5 — Dead code (n=10)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.50	1442	6164	13.5
smart-grep	0.55	2488	1138	63.0
sverklo	0.02	579	3	1.0

Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.

Where sverklo wins (full list)

Task	Category	sverklo F1	best grep F1	sverklo tok	best grep tok
`express/ex-p1-02`	P1	1.00	0.00	769	10615
`express/ex-p1-03`	P1	1.00	0.00	692	6844
`express/ex-p1-09`	P1	1.00	0.00	128	5920
`sverklo/sv-p4-05`	P4	1.00	0.50	50	874
`express/ex-p4-04`	P4	1.00	0.50	36	3781
`sverklo/sv-p4-04`	P4	1.00	0.67	42	928
`express/ex-p4-05`	P4	1.00	0.68	41	1316
`express/ex-p4-02`	P4	0.90	0.68	79	1345
`sverklo/sv-p4-02`	P4	0.86	0.71	40	334
`sverklo/sv-p4-03`	P4	0.86	0.75	59	754
`sverklo/sv-p4-01`	P4	0.80	0.69	232	1373

Where sverklo loses (the honesty section)

If you skip this section, you're doing benchmark cherry-picking. We're not.

Task	Category	sverklo F1	best grep F1	sverklo tok	best grep tok	note
`express/ex-p5-01`	P5	0.00	1.00	535	0	missed
`express/ex-p5-02`	P5	0.00	1.00	535	0	missed
`express/ex-p5-03`	P5	0.00	1.00	535	0	missed
`express/ex-p2-04`	P2	0.00	1.00	30	49	missed
`sverklo/sv-p2-04`	P2	0.50	1.00	58	67	vs smart-grep
`sverklo/sv-p2-06`	P2	0.40	0.83	137	205	vs smart-grep
`express/ex-p2-01`	P2	0.27	0.63	442	701	vs smart-grep

The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.

What this benchmark does NOT measure

Real coding-task latency. 60 retrieval primitives, not 60 end-to-end agent runs. The token economy translates to real-world savings, but a follow-up bench:swe (65 SWE-bench-style questions × 5 OSS repos) measures that and is on the public roadmap.
Anything beyond Express + sverklo. The current slice is two TS/JS codebases. Coverage on Go, Python, Rust is the next dataset extension.
Cross-repo retrieval. The bench is single-repo; sverklo's workspace and cross-repo features aren't exercised.
Cost of indexing. The 3690ms cold-start figure is the index build for sverklo. We list it as a separate column rather than amortizing it into wall time, so you can decide whether your usage pattern justifies it.

Methodology

Tasks: 60 hand-verified retrieval primitives across expressjs/express and sverklo/sverklo, distributed across P1 (definition lookup, n=20), P2 (reference finding, n=20), P4 (file dependencies, n=10), P5 (dead-code detection, n=10).
Metrics: F1, recall, precision, total input tokens, tool-call count, wall time. tokens_per_correct_answer = input_tokens / max(recall, 0.01) — lower is better. The gated column averages only over runs where F1 ≥ 0.8 — refusing to reward "found nothing cheaply".
Tolerances: P1 uses ±3-line tolerance, P2 uses ±2 lines, P4/P5 use set membership.
naive-grep floor: grep -rn <sym> . then read top 10 files in full.
smart-grep: language filters, ±10-line context reads, definition-shaped patterns. The strong baseline.
sverklo: spawns the MCP stdio server once per dataset; cold-start is the index build.
Run environment: Apple Silicon laptop, Node 25, sverklo running with --expose-gc, ONNX all-MiniLM-L6-v2 (~90 MB on disk).

Reproducing this

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:primitives

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.

Performance benchmarks (separate)

This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.

Cite this

If you reference this benchmark in academic work or comparison material:

@misc{sverklo_bench_primitives_2026,
  title  = {Sverklo bench:primitives — a 60-task retrieval evaluation for AI coding agents},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/bench/}
}

Get started

If the token-economy numbers look interesting:

npm install -g sverklo
cd your-project
sverklo init

sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →