bench:primitives — 90-task code retrieval evaluation

Five retrieval baselines, three real OSS codebases (express, lodash, sverklo), 90 hand-verified tasks. Reproducible. The naive baseline is what most agents do today — blind grep, 20K tokens of unranked regex hits per task.

April 2026 run: 2026-04-07T23:07:14Z · 3 baselines (naive-grep, smart-grep, sverklo)
May 2026 update: 2026-05-02T23:35:52Z · +2 baselines (jcodemunch-mcp, GitNexus) per #25 · harness on GitHub · methodology + ground truth: sverklo/sverklo-bench
May 2026 update — issue #25

Tom Hale (@HaleTom) asked if sverklo would benchmark itself against jcodemunch-mcp and GitNexus — two direct competitors in the local-first MCP code-intel space. The answer: yes, both are on the bench now.

Real findings, not the marketing version:
May 3, 2026 follow-on — bench-as-feedback-loop

Within hours of this update going live, @jgravelle shipped jcodemunch-mcp v1.80.7, then v1.80.8, then v1.80.9 — three releases addressing specific findings from this bench.

Confirmed on rerun (v1.80.8): Methodology gaps now resolved: Updated 3-dataset numbers are below. The April-only table is preserved further down for historical reference.
May 4, 2026 PM — sverklo v0.20.2: lodash P1 recovered

Adding lodash to the bench (#26) exposed a blind spot in sverklo's own parser: findBraceEnd was naive character counting, so a { inside a string literal at lodash.js:6301 made every subsequent function declaration get absorbed into one ~11K-line chunk. Public methods (map, filter, reduce, etc.) never got their own chunks.

v0.20.2 (deedec2) ships two fixes: a string/regex/comment-aware brace counter, and exact-match priority in the lookup tool. Result on the same 90-task bench: Both jcodemunch and sverklo shipped lodash P1 fixes inside 36 hours of the original benchmark publication. That's what a public peer-reviewable benchmark is supposed to do.

May 4, 2026 PM — post-fix table (sverklo v0.20.2)

baselinenF1P1P2P4P5avg tokenstools/tasktok/correct (gated)
naive-grep900.290.100.180.530.6720,2786.52,403
smart-grep900.490.430.400.590.671,2204.9219
sverklo v0.20.2900.560.730.250.710.674691.0449
jcodemunch v1.80.9900.320.730.000.460.001,2671.2625
gitnexus900.250.270.000.300.673721.2207

Note: sverklo numbers are from the post-v0.20.2 single-baseline rerun (2026-05-04T19-38-11-592Z); other baselines are from the morning 5-baseline run (2026-05-04T14-13-23-716Z) and didn't change between runs (no commits affecting them). Reproducible from a fresh clone with npm run bench:quick; full numbers should match within run-to-run variance.

What changed:

How to reproduce: npm install -g sverklo@0.20.2 && cd /path/to/repo && npm run bench:quick. Bench harness lives at github.com/sverklo/sverklo/tree/main/benchmark. Issues #26, #27, #28 document the methodology iterations that produced this run.

May 4, 2026 AM — pre-fix table (sverklo v0.20.1, kept for diff)

baselinenF1P1P2P4P5avg tokenstools/tasktok/correct (gated)
naive-grep900.290.100.180.530.6720,2786.52,403
smart-grep900.490.430.400.590.671,2204.9219
sverklo v0.20.1900.450.300.340.760.674491.0337
jcodemunch v1.80.9900.320.730.000.460.001,2671.2625
gitnexus900.250.270.000.300.673721.2207

Pre-v0.20.2 numbers preserved here so the diff between the two runs is auditable. Raw data at benchmark/results/2026-05-04T14-13-23-716Z/.

May 2026 — All 5 baselines

baselinenF1P1P2P4P5avg tokenscold (ms)warm (ms)
naive-grep600.2900.150.260.430.5017,16904,779
smart-grep600.4500.400.460.490.501,21602,258
sverklo600.4490.450.270.750.503861,15938
jcodemunch600.2810.650.000.380.005,35171813
gitnexus600.2600.400.010.250.50543452584

How to read this: No single baseline dominates. Different tools win different categories. The story isn't "sverklo beats everything" — it's "different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for." Token economy + P4 are sverklo's clearest wins; P1 goes to jcodemunch; P2 to smart-grep.

Honest false positives we filed back to upstream:

Reproducer (requires uvx for jcodemunch and npm i -g gitnexus on PATH):

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build

# All 5 baselines
npm run bench:quick

# Single baseline
BASELINES=jcodemunch npm run bench:quick
BASELINES=gitnexus   npm run bench:quick
BASELINES=sverklo    npm run bench:quick

Original April 2026 run — 60 tasks, 3 baselines (naive-grep, smart-grep, sverklo). Click to expand.

Original numbers from the first public bench run (2026-04-07). The April run used a slightly older harness version; the May 4 PM run above is the canonical current data. Both are kept on this page so the reader can audit drift.

Headline (April 2026). On 60 verified tasks across expressjs/express and sverklo/sverklo: sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls; smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls; naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.
15,814 → 255
tokens per task
−98% vs naive grep
731 → 255
tokens per task
−65% vs tuned grep
7-12 → 1
tool calls per task
−87% on average

All baselines

baselinenF1recallprectokenstoolswall (ms)cold (ms)gated tok/correct
naive-grep600.350.560.29158147.6130203557 (n=10)
smart-grep600.670.810.6273111.82150165 (n=28)
sverklo600.580.730.572551.013690203 (n=25)

Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.

Per-category breakdown

P1 — Definition lookup (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.15233373398.1
smart-grep0.60196512.0
sverklo0.7528301.0

Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.

P2 — Reference finding (n=20)

baselineF1tokenswall (ms)tools
naive-grep0.39219253457.0
smart-grep0.81224171.0
sverklo0.5615701.0

Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.

P4 — File dependencies (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.5129182802.0
smart-grep0.631058162.0
sverklo0.867401.0

Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.

P5 — Dead code (n=10)

baselineF1tokenswall (ms)tools
naive-grep0.501442616413.5
smart-grep0.552488113863.0
sverklo0.0257931.0

Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.

Where sverklo wins (full list)

TaskCategorysverklo F1best grep F1sverklo tokbest grep tok
express/ex-p1-02P11.000.0076910615
express/ex-p1-03P11.000.006926844
express/ex-p1-09P11.000.001285920
sverklo/sv-p4-05P41.000.5050874
express/ex-p4-04P41.000.50363781
sverklo/sv-p4-04P41.000.6742928
express/ex-p4-05P41.000.68411316
express/ex-p4-02P40.900.68791345
sverklo/sv-p4-02P40.860.7140334
sverklo/sv-p4-03P40.860.7559754
sverklo/sv-p4-01P40.800.692321373

Where sverklo loses (the honesty section)

If you skip this section, you're doing benchmark cherry-picking. We're not.

TaskCategorysverklo F1best grep F1sverklo tokbest grep toknote
express/ex-p5-01P50.001.005350missed
express/ex-p5-02P50.001.005350missed
express/ex-p5-03P50.001.005350missed
express/ex-p2-04P20.001.003049missed
sverklo/sv-p2-04P20.501.005867vs smart-grep
sverklo/sv-p2-06P20.400.83137205vs smart-grep
express/ex-p2-01P20.270.63442701vs smart-grep

The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.

What this benchmark does NOT measure

Methodology

Reproducing this

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:quick                           # all baselines, all datasets
BASELINES=sverklo,jcodemunch npm run bench:quick   # single baseline filter
DATASETS=express npm run bench:quick               # single dataset filter

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.

Submitting a new baseline

If you maintain a code-search tool, code-intelligence MCP server, or retrieval system, you can have it benchmarked here on the same task suite. Open a PR to sverklo/sverklo adding benchmark/src/baselines/<your-tool>.ts — auto-bench CI runs on the PR within ~10 minutes against the express dataset (~30 tasks) and posts a results-table comment back. You don't need to run the harness locally first; CI does it. Methodology repo: github.com/sverklo/sverklo-bench. Workflow source: .github/workflows/auto-bench.yml. Tracking: sverklo-bench#4.

Performance benchmarks (separate)

This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.

Cite this

If you reference this benchmark in academic work or comparison material:

@misc{sverklo_bench_primitives_2026,
  title  = {Sverklo bench:primitives — a 90-task retrieval evaluation for AI coding agents},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/bench/}
}

A few details we sweated

Hand-verified ground truth. Every one of the 90 task answers was inspected by hand at the fixed commit. Auto-generated ground truth from existing tooling correlates with whatever generated it; we wanted the harness to be cleanly testable against any retrieval system, including future ones we haven't built.

The losing slice gets the same prominence as the winning slice. The dead-code (P5) F1 = 0.02 number lives in the same table as the wins, two scrolls apart. A bench that only releases when the maintainer wins is marketing; a bench that releases when the maintainer loses is a bench. The contribution is the harness, not the leaderboard.

Tokens-per-correct-answer is the primary axis, F1 is secondary. Most retrieval evaluations report F1 first because they were designed for human-facing search. AI agents inside bounded context windows pay for every token returned; that opportunity cost compounds across an editing session. We report both axes and explain the tradeoff in plain terms; the agent-relevant axis is the one that earned the headline callouts above.

Naive grep is the floor, not the strawman. The naive baseline runs grep -rn <sym> . then reads the top 10 matching files in full — the same thing a Claude Code agent does on its first 5 minutes against a fresh codebase. If your bench's naive baseline scores 5% F1, you're probably measuring against a strawman; ours scores 0.35, which matches what real agents actually achieve on real tasks.

Cold-start is a separate column, not amortized. Sverklo's index build is 3,690 ms on this corpus. We list it as its own column rather than averaging it into wall time so you can decide whether your usage pattern justifies the upfront cost. For a 10-task session it dominates; for a multi-hour session it disappears.

Raw JSONL output, not just aggregates. benchmark/results/<timestamp>/raw.jsonl has every task's input, the system's output, and the per-task scoring breakdown. Disagreements with our aggregates are useful — file an issue with your machine spec and run timestamp and we'll triage.

Three codebases is still small. We say so. The "What this benchmark does NOT measure" section above isn't decorative. The next dataset extension is Go / Python / Rust on 5+ codebases. The current 90-task slice is what we have today, with the limitations explicitly listed.

Get started

If the token-economy numbers look interesting:

npm install -g sverklo
cd your-project
sverklo init

sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →