bench:primitives — 180-task code retrieval evaluation

Five retrieval baselines, six real OSS codebases (express, lodash, sverklo, requests, flask, fastapi), 180 hand-verified tasks. Reproducible. The naive baseline is what most agents do today — blind grep, ~22K tokens of unranked regex hits per task. See also: every open loss we still track — published because a bench that only shows wins is marketing, not measurement.

April 2026 run: 2026-04-07T23:07:14Z · 3 baselines (naive-grep, smart-grep, sverklo)
May 2026 update: 2026-05-02T23:35:52Z · +2 baselines (jcodemunch-mcp, GitNexus) per #25 · harness on GitHub · methodology + ground truth: sverklo/sverklo-bench
How to read this with the homepage comparison: the same-task story is not "Sverklo makes the agent smarter." It is "same agent, same task, better repo evidence before the edit." This benchmark measures retrieval primitives behind that evidence: definitions, references, file dependencies, and dead-code signals. It does not measure full end-to-end coding quality, and the loss section below stays part of the proof.
Historical May 2/3 bench-loop updates
May 2026 update — issue #25

Tom Hale (@HaleTom) asked if sverklo would benchmark itself against jcodemunch-mcp and GitNexus — two direct competitors in the local-first MCP code-intel space. The answer: yes, both are on the bench now.

Real findings, not the marketing version:
  • smart-grep ties sverklo on overall F1. A tuned grep with language filters is a stronger baseline than the literature in this space usually admits. The differentiation is on token economy and tool-call count, not raw F1.
  • jcodemunch wins P1 (definition lookup) at 0.65 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharp; we should learn from it.
  • Both jcodemunch and gitnexus return ~0 on P2 — by design. Both expose import-graph references rather than call-site references. That's a legitimate design choice, not a bug — flagged in a public exchange with @jgravelle on issue #25. Our P2 task as defined ("find every caller of X") assumes call-site semantics, which is the load-bearing axis for refactor blast-radius. If your workflow is "every file that imports module Y," import-graph-only is the right substrate and these tools win that subtask. Different retrieval models, different jobs.
  • Sverklo dominates P4 (file dependencies) at 0.84 vs next-best smart-grep at 0.40. Symbol graph + PageRank pays off here — a 44-point gap on the bench's most graph-shaped category.
  • Sverklo's token economy holds: 386 average input tokens vs jcodemunch's 5,351 and naive-grep's 17,169.
May 3, 2026 follow-on — bench-as-feedback-loop

Within hours of this update going live, @jgravelle shipped jcodemunch-mcp v1.80.7, then v1.80.8, then v1.80.9 — three releases addressing specific findings from this bench.

Confirmed on rerun (v1.80.8):
  • Avg input tokens: 5,351 → 1,388 (−74%). The token-bloat fix landed cleanly.
  • Express P5 recall: 0.00 → 1.00. The CommonJS module.exports re-export blind spot for createApplication is closed.
  • P1 unchanged at 0.65 (still leading on definition lookup). P2 unchanged (acknowledged design choice — see above).
Methodology gaps now resolved:
  • #27 — sv-p5 expected set refined: 6 confirmed-dead exported functions, methodology documented (commit 407359a).
  • #26 — Lodash 4.17.21 added as third dataset, 30 new tasks (10 P1 + 10 P2 + 5 P4 + 5 P5), commit 5fba805.
Updated 3-dataset numbers are below. The April-only table is preserved further down for historical reference.
May 13, 2026 — full 180-task rerun (sverklo v0.20.21, jcodemunch v1.81.1)

Two bench-loop iterations landed since the last published table: Full 180-task numbers (5 baselines × 6 datasets) are in the table below. Raw run: 2026-05-13T18-32-20-478Z.
May 13, 2026 — headline (180-task, 5 baselines, 6 datasets)

baselinenF1P1P2P4P5avg tokenstools/taskwarm (ms)
naive-grep1800.250.070.110.350.8322,7046.32,456
smart-grep1800.340.200.200.400.837143.21,130
sverklo v0.20.211800.580.630.270.840.836521.064
jcodemunch v1.81.11800.290.520.010.330.341,9071.218
gitnexus1800.300.350.000.270.836301.2718

jcodemunch P5 recall = 1.00, precision = 0.34, tokens = 10,172. The F1 is precision-bound (full recall + verbose list of candidates). With max_results=100 the prior baseline scored lower; the refresh per sverklo-bench#3 reveals jcodemunch's actual P5 ceiling.

May 4, 2026 PM — sverklo v0.20.2: lodash P1 recovered

Adding lodash to the bench (#26) exposed a blind spot in sverklo's own parser: findBraceEnd was naive character counting, so a { inside a string literal at lodash.js:6301 made every subsequent function declaration get absorbed into one ~11K-line chunk. Public methods (map, filter, reduce, etc.) never got their own chunks.

v0.20.2 (deedec2) ships two fixes: a string/regex/comment-aware brace counter, and exact-match priority in the lookup tool. Result on the same 90-task bench: Both jcodemunch and sverklo shipped lodash P1 fixes inside 36 hours of the original benchmark publication. That's what a public peer-reviewable benchmark is supposed to do.

May 4, 2026 PM — post-fix table (sverklo v0.20.2)

baselinenF1P1P2P4P5avg tokenstools/tasktok/correct (gated)
naive-grep900.290.100.180.530.6720,2786.52,403
smart-grep900.490.430.400.590.671,2204.9219
sverklo v0.20.2900.560.730.250.710.674691.0449
jcodemunch v1.80.9900.320.730.000.460.001,2671.2625
gitnexus900.250.270.000.300.673721.2207

Note: sverklo numbers are from the post-v0.20.2 single-baseline rerun (2026-05-04T19-38-11-592Z); other baselines are from the morning 5-baseline run (2026-05-04T14-13-23-716Z) and didn't change between runs (no commits affecting them). Reproducible from a fresh clone with npm run bench:quick; full numbers should match within run-to-run variance.

What changed:

How to reproduce: npm install -g sverklo@0.20.2 && cd /path/to/repo && npm run bench:quick. Bench harness lives at github.com/sverklo/sverklo/tree/main/benchmark. Issues #26, #27, #28 document the methodology iterations that produced this run.

May 4, 2026 AM — pre-fix table (sverklo v0.20.1, kept for diff)

baselinenF1P1P2P4P5avg tokenstools/tasktok/correct (gated)
naive-grep900.290.100.180.530.6720,2786.52,403
smart-grep900.490.430.400.590.671,2204.9219
sverklo v0.20.1900.450.300.340.760.674491.0337
jcodemunch v1.80.9900.320.730.000.460.001,2671.2625
gitnexus900.250.270.000.300.673721.2207

Pre-v0.20.2 numbers preserved here so the diff between the two runs is auditable. Raw data at benchmark/results/2026-05-04T14-13-23-716Z/.

May 2026 — All 5 baselines

baselinenF1P1P2P4P5avg tokenscold (ms)warm (ms)
naive-grep600.2900.150.260.430.5017,16904,779
smart-grep600.4500.400.460.490.501,21602,258
sverklo600.4490.450.270.750.503861,15938
jcodemunch600.2810.650.000.380.005,35171813
gitnexus600.2600.400.010.250.50543452584

How to read this: No single baseline dominates. Different tools win different categories. The story isn't "sverklo beats everything" — it's "different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for." Token economy + P4 are sverklo's clearest wins; P1 goes to jcodemunch; P2 to smart-grep.

Honest false positives we filed back to upstream:

Reproducer (requires uvx for jcodemunch and npm i -g gitnexus on PATH):

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build

# All 5 baselines
npm run bench:quick

# Single baseline
BASELINES=jcodemunch npm run bench:quick
BASELINES=gitnexus   npm run bench:quick
BASELINES=sverklo    npm run bench:quick

Original April 2026 run — 60 tasks, 3 baselines (naive-grep, smart-grep, sverklo). Click to expand.

Original numbers from the first public bench run (2026-04-07). The April run used a slightly older harness version; the May 4 PM run above is the canonical current data. Both are kept on this page so the reader can audit drift.

Headline (April 2026). On 60 verified tasks across expressjs/express and sverklo/sverklo: sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls; smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls; naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.
15,814 → 255
tokens per task
−98% vs naive grep
731 → 255
tokens per task
−65% vs tuned grep
7-12 → 1
tool calls per task
−87% on average

All baselines

baselinenF1recallprectokenstoolswall (ms)cold (ms)gated tok/correct
naive-grep600.350.560.29158147.6130203557 (n=10)
smart-grep600.670.810.6273111.82150165 (n=28)
sverklo600.580.730.572551.013690203 (n=25)

Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.

Per-category breakdown

P1 — Definition lookup (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.15233373398.1
smart-grep0.60196512.0
sverklo0.7528301.0

Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.

P2 — Reference finding (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.39219253457.0
smart-grep0.81224171.0
sverklo0.5615701.0

Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.

P4 — File dependencies (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.5129182802.0
smart-grep0.631058162.0
sverklo0.867401.0

Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.

P5 — Dead code (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.501442616413.5
smart-grep0.552488113863.0
sverklo0.0257931.0

Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.

Where sverklo wins (full list)

TaskCategorysverklo F1best grep F1sverklo tokbest grep tok
express/ex-p1-02P11.000.0076910615
express/ex-p1-03P11.000.006926844
express/ex-p1-09P11.000.001285920
sverklo/sv-p4-05P41.000.5050874
express/ex-p4-04P41.000.50363781
sverklo/sv-p4-04P41.000.6742928
express/ex-p4-05P41.000.68411316
express/ex-p4-02P40.900.68791345
sverklo/sv-p4-02P40.860.7140334
sverklo/sv-p4-03P40.860.7559754
sverklo/sv-p4-01P40.800.692321373

Where sverklo loses (the honesty section)

If you skip this section, you're doing benchmark cherry-picking. We're not.

TaskCategorysverklo F1best grep F1sverklo tokbest grep toknote
express/ex-p5-01P50.001.005350missed
express/ex-p5-02P50.001.005350missed
express/ex-p5-03P50.001.005350missed
express/ex-p2-04P20.001.003049missed
sverklo/sv-p2-04P20.501.005867vs smart-grep
sverklo/sv-p2-06P20.400.83137205vs smart-grep
express/ex-p2-01P20.270.63442701vs smart-grep

The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.

What this benchmark does NOT measure

Methodology

Reproducing this

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:quick                           # all baselines, all datasets
BASELINES=sverklo,jcodemunch npm run bench:quick   # single baseline filter
DATASETS=express npm run bench:quick               # single dataset filter

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.

Submitting a new baseline

If you maintain a code-search tool, code-intelligence MCP server, or retrieval system, you can have it benchmarked here on the same task suite. Open a PR to sverklo/sverklo adding benchmark/src/baselines/<your-tool>.ts — auto-bench CI runs on the PR within ~10 minutes against the express dataset (~30 tasks) and posts a results-table comment back. You don't need to run the harness locally first; CI does it. Methodology repo: github.com/sverklo/sverklo-bench. Workflow source: .github/workflows/auto-bench.yml. Tracking: sverklo-bench#4.

Performance benchmarks (separate)

This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.

Cite this

If you reference this benchmark in academic work or comparison material:

@misc{sverklo_bench_primitives_2026,
  title  = {Sverklo bench:primitives — a 90-task retrieval evaluation for AI coding agents},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/bench/}
}

A few details we sweated

Hand-verified ground truth. Every one of the 180 task answers was inspected by hand at the fixed commit. Auto-generated ground truth from existing tooling correlates with whatever generated it; we wanted the harness to be cleanly testable against any retrieval system, including future ones we haven't built.

The losing slice gets the same prominence as the winning slice. The dead-code (P5) F1 = 0.02 number lives in the same table as the wins, two scrolls apart. A bench that only releases when the maintainer wins is marketing; a bench that releases when the maintainer loses is a bench. The contribution is the harness, not the leaderboard.

Tokens-per-correct-answer is the primary axis, F1 is secondary. Most retrieval evaluations report F1 first because they were designed for human-facing search. AI agents inside bounded context windows pay for every token returned; that opportunity cost compounds across an editing session. We report both axes and explain the tradeoff in plain terms; the agent-relevant axis is the one that earned the headline callouts above.

Naive grep is the floor, not the strawman. The naive baseline runs grep -rn <sym> . then reads the top 10 matching files in full — the same thing a Claude Code agent does on its first 5 minutes against a fresh codebase. If your bench's naive baseline scores 5% F1, you're probably measuring against a strawman; ours scores 0.35, which matches what real agents actually achieve on real tasks.

Cold-start is a separate column, not amortized. Sverklo's index build is 3,690 ms on this corpus. We list it as its own column rather than averaging it into wall time so you can decide whether your usage pattern justifies the upfront cost. For a 10-task session it dominates; for a multi-hour session it disappears.

Raw JSONL output, not just aggregates. benchmark/results/<timestamp>/raw.jsonl has every task's input, the system's output, and the per-task scoring breakdown. Disagreements with our aggregates are useful — file an issue with your machine spec and run timestamp and we'll triage.

Six codebases including three Python frameworks is still small in absolute terms. The "What this benchmark does NOT measure" section above isn't decorative. The current 180-task slice (express, lodash, sverklo, requests, flask, fastapi) extended Python coverage in May 2026; Go / Rust / Java remain open dataset extensions. The limitations are explicitly listed.

Get started

If the token-economy numbers look interesting:

npm install -g sverklo
cd your-project
sverklo init

sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →