bench:primitives — 60-task code retrieval evaluation

Three retrieval baselines, two real OSS codebases, 60 hand-verified tasks. Reproducible.

Run: 2026-04-07T23:07:14Z · sverklo v0.2.11 at run-time · 60 tasks × 3 baselines = 180 runs · harness on GitHub
Headline. On 60 verified tasks across expressjs/express and sverklo/sverklo: sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls; smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls; naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.
62×
fewer tokens than naive grep
(255 vs 15,814)
2.9×
fewer tokens than tuned grep
(255 vs 731)
1.0
tool call per task
(grep needs 7.6 to 11.8)

All baselines

baselinenF1recallprectokenstoolswall (ms)cold (ms)gated tok/correct
naive-grep600.350.560.29158147.6130203557 (n=10)
smart-grep600.670.810.6273111.82150165 (n=28)
sverklo600.580.730.572551.013690203 (n=25)

Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.

Per-category breakdown

P1 — Definition lookup (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.15233373398.1
smart-grep0.60196512.0
sverklo0.7528301.0

Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.

P2 — Reference finding (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.39219253457.0
smart-grep0.81224171.0
sverklo0.5615701.0

Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.

P4 — File dependencies (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.5129182802.0
smart-grep0.631058162.0
sverklo0.867401.0

Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.

P5 — Dead code (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.501442616413.5
smart-grep0.552488113863.0
sverklo0.0257931.0

Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.

Where sverklo wins (full list)

TaskCategorysverklo F1best grep F1sverklo tokbest grep tok
express/ex-p1-02P11.000.0076910615
express/ex-p1-03P11.000.006926844
express/ex-p1-09P11.000.001285920
sverklo/sv-p4-05P41.000.5050874
express/ex-p4-04P41.000.50363781
sverklo/sv-p4-04P41.000.6742928
express/ex-p4-05P41.000.68411316
express/ex-p4-02P40.900.68791345
sverklo/sv-p4-02P40.860.7140334
sverklo/sv-p4-03P40.860.7559754
sverklo/sv-p4-01P40.800.692321373

Where sverklo loses (the honesty section)

If you skip this section, you're doing benchmark cherry-picking. We're not.

TaskCategorysverklo F1best grep F1sverklo tokbest grep toknote
express/ex-p5-01P50.001.005350missed
express/ex-p5-02P50.001.005350missed
express/ex-p5-03P50.001.005350missed
express/ex-p2-04P20.001.003049missed
sverklo/sv-p2-04P20.501.005867vs smart-grep
sverklo/sv-p2-06P20.400.83137205vs smart-grep
express/ex-p2-01P20.270.63442701vs smart-grep

The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.

What this benchmark does NOT measure

Methodology

Reproducing this

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:primitives

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.

Performance benchmarks (separate)

This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.

Cite this

If you reference this benchmark in academic work or comparison material:

@misc{sverklo_bench_primitives_2026,
  title  = {Sverklo bench:primitives — a 60-task retrieval evaluation for AI coding agents},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/bench/}
}

Get started

If the token-economy numbers look interesting:

npm install -g sverklo
cd your-project
sverklo init

sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →