How many tasks are in sverklo's benchmark?

90 hand-verified retrieval tasks across three OSS codebases (express 4.21.1, lodash 4.17.21, sverklo). Each codebase contributes 30 tasks distributed across P1 definition lookup (10), P2 reference finding (10), P4 file dependencies (5), and P5 dead-code detection (5).

What is sverklo's F1 score on code retrieval?

Sverklo v0.20.2 achieves F1 0.56 on the 90-task bench, edging smart-grep at 0.49. P1 (definition lookup): 0.73, tied with jcodemunch-mcp. P4 (file dependencies): 0.71, the best of any baseline. Smart-grep wins P2 reference finding at 0.40.

How does sverklo's token economy compare to grep?

On the 90-task bench, sverklo averages 469 input tokens per task at 1.0 tool call, vs naive grep's 20,278 tokens at 6.5 tool calls — 43× fewer input tokens than naive grep. Smart-grep is at 1,220 tokens with 4.9 tool calls.

Is sverklo's benchmark reproducible?

Yes. Clone github.com/sverklo/sverklo, run npm install && npm run bench:quick. All 5 baselines run on a 16GB Apple Silicon laptop in under 10 minutes. Methodology and ground truth live in the spun-out repo at github.com/sverklo/sverklo-bench. Raw JSONL output, summary.json, and report.md per run land in benchmark/results/ /.

Where does sverklo lose on the benchmark?

Smart-grep wins P2 reference finding (0.40 vs sverklo's 0.25) — tuned ripgrep on call-site references is genuinely competitive. Sverklo's P5 dead-code detection ties at 0.67. The page has a dedicated 'Where sverklo loses' section with task-level F1 scores below the wins.

What is the difference between sverklo, jcodemunch-mcp, and GitNexus?

All three are local-first MCP code-intelligence servers. Sverklo (MIT) builds a hybrid BM25 + ONNX-embedding + PageRank index with single-tool-call retrieval. jcodemunch-mcp uses tree-sitter symbol indexing and wins P1 outright at 0.73. GitNexus (PolyForm Noncommercial) builds a code knowledge graph in KuzuDB with Cypher queries; it returns ~0 on P2 reference finding because it tracks import sites rather than call sites — a legitimate design choice for refactor-by-module workflows.

bench:primitives — 90-task code retrieval evaluation

Five retrieval baselines, three real OSS codebases (express, lodash, sverklo), 90 hand-verified tasks. Reproducible. The naive baseline is what most agents do today — blind grep, 20K tokens of unranked regex hits per task.

April 2026 run: 2026-04-07T23:07:14Z · 3 baselines (naive-grep, smart-grep, sverklo)
May 2026 update: 2026-05-02T23:35:52Z · +2 baselines (jcodemunch-mcp, GitNexus) per #25 · harness on GitHub · methodology + ground truth: sverklo/sverklo-bench

May 2026 update — issue #25

Tom Hale (@HaleTom) asked if sverklo would benchmark itself against jcodemunch-mcp and GitNexus — two direct competitors in the local-first MCP code-intel space. The answer: yes, both are on the bench now.

Real findings, not the marketing version:

smart-grep ties sverklo on overall F1. A tuned grep with language filters is a stronger baseline than the literature in this space usually admits. The differentiation is on token economy and tool-call count, not raw F1.
jcodemunch wins P1 (definition lookup) at 0.65 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharp; we should learn from it.
Both jcodemunch and gitnexus return ~0 on P2 — by design. Both expose import-graph references rather than call-site references. That's a legitimate design choice, not a bug — flagged in a public exchange with @jgravelle on issue #25. Our P2 task as defined ("find every caller of X") assumes call-site semantics, which is the load-bearing axis for refactor blast-radius. If your workflow is "every file that imports module Y," import-graph-only is the right substrate and these tools win that subtask. Different retrieval models, different jobs.
Sverklo wins P4 (file dependencies) at 0.75 vs others 0.25–0.49. Symbol graph + PageRank pays off here.
Sverklo's token economy holds: 386 average input tokens vs jcodemunch's 5,351 and naive-grep's 17,169.

May 3, 2026 follow-on — bench-as-feedback-loop

Within hours of this update going live, @jgravelle shipped jcodemunch-mcp v1.80.7, then v1.80.8, then v1.80.9 — three releases addressing specific findings from this bench.

Confirmed on rerun (v1.80.8):

Avg input tokens: 5,351 → 1,388 (−74%). The token-bloat fix landed cleanly.
Express P5 recall: 0.00 → 1.00. The CommonJS module.exports re-export blind spot for createApplication is closed.
P1 unchanged at 0.65 (still leading on definition lookup). P2 unchanged (acknowledged design choice — see above).

Methodology gaps now resolved:

#27 — sv-p5 expected set refined: 6 confirmed-dead exported functions, methodology documented (commit 407359a).
#26 — Lodash 4.17.21 added as third dataset, 30 new tasks (10 P1 + 10 P2 + 5 P4 + 5 P5), commit 5fba805.

Updated 3-dataset numbers are below. The April-only table is preserved further down for historical reference.

May 4, 2026 PM — sverklo v0.20.2: lodash P1 recovered

Adding lodash to the bench (#26) exposed a blind spot in sverklo's own parser: findBraceEnd was naive character counting, so a { inside a string literal at lodash.js:6301 made every subsequent function declaration get absorbed into one ~11K-line chunk. Public methods (map, filter, reduce, etc.) never got their own chunks.

v0.20.2 (deedec2) ships two fixes: a string/regex/comment-aware brace counter, and exact-match priority in the lookup tool. Result on the same 90-task bench:

Sverklo lodash P1: 0/10 → 9/10 (only miss is merge, which is var merge = createAssigner(...) — function-call assignment the regex parser still can't trace).
Sverklo overall F1: 0.45 → 0.56 — sverklo is now the F1 leader, edging out smart-grep at 0.49.
Sverklo P1: 0.30 → 0.73 — ties jcodemunch.
Small P2/P4 regressions tracked in #28.

Both jcodemunch and sverklo shipped lodash P1 fixes inside 36 hours of the original benchmark publication. That's what a public peer-reviewable benchmark is supposed to do.

May 4, 2026 PM — post-fix table (sverklo v0.20.2)

baseline	n	F1	P1	P2	P4	P5	avg tokens	tools/task	tok/correct (gated)
naive-grep	90	0.29	0.10	0.18	0.53	0.67	20,278	6.5	2,403
smart-grep	90	0.49	0.43	0.40	0.59	0.67	1,220	4.9	219
sverklo v0.20.2	90	0.56	0.73	0.25	0.71	0.67	469	1.0	449
jcodemunch v1.80.9	90	0.32	0.73	0.00	0.46	0.00	1,267	1.2	625
gitnexus	90	0.25	0.27	0.00	0.30	0.67	372	1.2	207

Note: sverklo numbers are from the post-v0.20.2 single-baseline rerun (2026-05-04T19-38-11-592Z); other baselines are from the morning 5-baseline run (2026-05-04T14-13-23-716Z) and didn't change between runs (no commits affecting them). Reproducible from a fresh clone with npm run bench:quick; full numbers should match within run-to-run variance.

What changed:

Sverklo wins overall F1 (0.56), edging smart-grep (0.49). Pre-fix it was 0.45 (smart-grep led).
P1 is now a tie at 0.73 between sverklo and jcodemunch. Pre-fix sverklo was at 0.30 — both projects had to ship parser fixes (sverklo for IIFE brace-counting, jcodemunch for the 500KB file cap) to land here.
Sverklo still wins P4 (0.71), slightly down from 0.76 pre-fix. Tracked in #28.
Token economy: sverklo at 469 avg input tokens, ~3× lower than smart-grep (1,220) and jcodemunch (1,267). Single tool call per task vs grep's 4–7.
Smart-grep still wins P2 (0.40): tuned ripgrep on call-site references is genuinely competitive — sverklo at 0.25 here regressed slightly (#28).

How to reproduce: npm install -g sverklo@0.20.2 && cd /path/to/repo && npm run bench:quick. Bench harness lives at github.com/sverklo/sverklo/tree/main/benchmark. Issues #26, #27, #28 document the methodology iterations that produced this run.

May 4, 2026 AM — pre-fix table (sverklo v0.20.1, kept for diff)

baseline	n	F1	P1	P2	P4	P5	avg tokens	tools/task	tok/correct (gated)
naive-grep	90	0.29	0.10	0.18	0.53	0.67	20,278	6.5	2,403
smart-grep	90	0.49	0.43	0.40	0.59	0.67	1,220	4.9	219
sverklo v0.20.1	90	0.45	0.30	0.34	0.76	0.67	449	1.0	337
jcodemunch v1.80.9	90	0.32	0.73	0.00	0.46	0.00	1,267	1.2	625
gitnexus	90	0.25	0.27	0.00	0.30	0.67	372	1.2	207

Pre-v0.20.2 numbers preserved here so the diff between the two runs is auditable. Raw data at benchmark/results/2026-05-04T14-13-23-716Z/.

May 2026 — All 5 baselines

baseline	n	F1	P1	P2	P4	P5	avg tokens	cold (ms)	warm (ms)
naive-grep	60	0.290	0.15	0.26	0.43	0.50	17,169	0	4,779
smart-grep	60	0.450	0.40	0.46	0.49	0.50	1,216	0	2,258
sverklo	60	0.449	0.45	0.27	0.75	0.50	386	1,159	38
jcodemunch	60	0.281	0.65	0.00	0.38	0.00	5,351	718	13
gitnexus	60	0.260	0.40	0.01	0.25	0.50	543	452	584

How to read this: No single baseline dominates. Different tools win different categories. The story isn't "sverklo beats everything" — it's "different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for." Token economy + P4 are sverklo's clearest wins; P1 goes to jcodemunch; P2 to smart-grep.

Honest false positives we filed back to upstream:

jcodemunch flags Express's createApplication as dead code — CommonJS module.exports = X isn't modeled as a use site, so the only export of the entire module appears to have no callers.
gitnexus's impact analysis returns 0 affected modules for createApplication — same blind spot.
Both tools' P2 implementations are import-graph-only — should be advertised more prominently in their docs.

Reproducer (requires uvx for jcodemunch and npm i -g gitnexus on PATH):

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build

# All 5 baselines
npm run bench:quick

# Single baseline
BASELINES=jcodemunch npm run bench:quick
BASELINES=gitnexus   npm run bench:quick
BASELINES=sverklo    npm run bench:quick

Original April 2026 run — 60 tasks, 3 baselines (naive-grep, smart-grep, sverklo). Click to expand.

Original numbers from the first public bench run (2026-04-07). The April run used a slightly older harness version; the May 4 PM run above is the canonical current data. Both are kept on this page so the reader can audit drift.

Headline (April 2026). On 60 verified tasks across expressjs/express and sverklo/sverklo: sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls; smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls; naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.

15,814 → 255

tokens per task
−98% vs naive grep

731 → 255

tokens per task
−65% vs tuned grep

7-12 → 1

tool calls per task
−87% on average

All baselines

baseline	n	F1	recall	prec	tokens	tools	wall (ms)	cold (ms)	gated tok/correct
naive-grep	60	0.35	0.56	0.29	15814	7.6	1302	0	3557 (n=10)
smart-grep	60	0.67	0.81	0.62	731	11.8	215	0	165 (n=28)
sverklo	60	0.58	0.73	0.57	255	1.0	1	3690	203 (n=25)

Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.

Per-category breakdown

P1 — Definition lookup (n=20)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.15	23337	339	8.1
smart-grep	0.60	196	51	2.0
sverklo	0.75	283	0	1.0

Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.

P2 — Reference finding (n=20)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.39	21925	345	7.0
smart-grep	0.81	224	17	1.0
sverklo	0.56	157	0	1.0

Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.

P4 — File dependencies (n=10)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.51	2918	280	2.0
smart-grep	0.63	1058	16	2.0
sverklo	0.86	74	0	1.0

Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.

P5 — Dead code (n=10)

baseline	F1	tokens	wall (ms)	tools
naive-grep	0.50	1442	6164	13.5
smart-grep	0.55	2488	1138	63.0
sverklo	0.02	579	3	1.0

Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.

Where sverklo wins (full list)

Task	Category	sverklo F1	best grep F1	sverklo tok	best grep tok
`express/ex-p1-02`	P1	1.00	0.00	769	10615
`express/ex-p1-03`	P1	1.00	0.00	692	6844
`express/ex-p1-09`	P1	1.00	0.00	128	5920
`sverklo/sv-p4-05`	P4	1.00	0.50	50	874
`express/ex-p4-04`	P4	1.00	0.50	36	3781
`sverklo/sv-p4-04`	P4	1.00	0.67	42	928
`express/ex-p4-05`	P4	1.00	0.68	41	1316
`express/ex-p4-02`	P4	0.90	0.68	79	1345
`sverklo/sv-p4-02`	P4	0.86	0.71	40	334
`sverklo/sv-p4-03`	P4	0.86	0.75	59	754
`sverklo/sv-p4-01`	P4	0.80	0.69	232	1373

Where sverklo loses (the honesty section)

If you skip this section, you're doing benchmark cherry-picking. We're not.

Task	Category	sverklo F1	best grep F1	sverklo tok	best grep tok	note
`express/ex-p5-01`	P5	0.00	1.00	535	0	missed
`express/ex-p5-02`	P5	0.00	1.00	535	0	missed
`express/ex-p5-03`	P5	0.00	1.00	535	0	missed
`express/ex-p2-04`	P2	0.00	1.00	30	49	missed
`sverklo/sv-p2-04`	P2	0.50	1.00	58	67	vs smart-grep
`sverklo/sv-p2-06`	P2	0.40	0.83	137	205	vs smart-grep
`express/ex-p2-01`	P2	0.27	0.63	442	701	vs smart-grep

The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.

What this benchmark does NOT measure

Real coding-task latency. 90 retrieval primitives, not 90 end-to-end agent runs. The token economy translates to real-world savings, but a follow-up bench:swe (65 SWE-bench-style questions × 5 OSS repos) measures that and is on the public roadmap.
Anything beyond Express + lodash + sverklo. The current slice is three JS/TS codebases. Coverage on Go, Python, Rust is the next dataset extension.
Cross-repo retrieval. The bench is single-repo; sverklo's workspace and cross-repo features aren't exercised.
Cost of indexing. The 3690ms cold-start figure is the index build for sverklo. We list it as a separate column rather than amortizing it into wall time, so you can decide whether your usage pattern justifies it.

Methodology

Tasks: 90 hand-verified retrieval primitives across expressjs/express, lodash/lodash, and sverklo/sverklo. Each dataset contributes 30 tasks distributed across P1 (definition lookup, 10), P2 (reference finding, 10), P4 (file dependencies, 5), P5 (dead-code detection, 5).
Metrics: F1, recall, precision, total input tokens, tool-call count, wall time. tokens_per_correct_answer = input_tokens / max(recall, 0.01) — lower is better. The gated column averages only over runs where F1 ≥ 0.8 — refusing to reward "found nothing cheaply".
Tolerances: P1 uses ±3-line tolerance, P2 uses ±2 lines, P4/P5 use set membership.
naive-grep floor: grep -rn <sym> . then read top 10 files in full.
smart-grep: language filters, ±10-line context reads, definition-shaped patterns. The strong baseline.
sverklo: spawns the MCP stdio server once per dataset; cold-start is the index build.
Run environment: Apple Silicon laptop, Node 25, sverklo running with --expose-gc, ONNX all-MiniLM-L6-v2 (~90 MB on disk).

Reproducing this

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:quick                           # all baselines, all datasets
BASELINES=sverklo,jcodemunch npm run bench:quick   # single baseline filter
DATASETS=express npm run bench:quick               # single dataset filter

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.

Submitting a new baseline

If you maintain a code-search tool, code-intelligence MCP server, or retrieval system, you can have it benchmarked here on the same task suite. Open a PR to sverklo/sverklo adding benchmark/src/baselines/<your-tool>.ts — auto-bench CI runs on the PR within ~10 minutes against the express dataset (~30 tasks) and posts a results-table comment back. You don't need to run the harness locally first; CI does it. Methodology repo: github.com/sverklo/sverklo-bench. Workflow source: .github/workflows/auto-bench.yml. Tracking: sverklo-bench#4.

Performance benchmarks (separate)

This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.

Cite this

If you reference this benchmark in academic work or comparison material:

@misc{sverklo_bench_primitives_2026,
  title  = {Sverklo bench:primitives — a 90-task retrieval evaluation for AI coding agents},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/bench/}
}

A few details we sweated

Hand-verified ground truth. Every one of the 90 task answers was inspected by hand at the fixed commit. Auto-generated ground truth from existing tooling correlates with whatever generated it; we wanted the harness to be cleanly testable against any retrieval system, including future ones we haven't built.

The losing slice gets the same prominence as the winning slice. The dead-code (P5) F1 = 0.02 number lives in the same table as the wins, two scrolls apart. A bench that only releases when the maintainer wins is marketing; a bench that releases when the maintainer loses is a bench. The contribution is the harness, not the leaderboard.

Tokens-per-correct-answer is the primary axis, F1 is secondary. Most retrieval evaluations report F1 first because they were designed for human-facing search. AI agents inside bounded context windows pay for every token returned; that opportunity cost compounds across an editing session. We report both axes and explain the tradeoff in plain terms; the agent-relevant axis is the one that earned the headline callouts above.

Naive grep is the floor, not the strawman. The naive baseline runs grep -rn <sym> . then reads the top 10 matching files in full — the same thing a Claude Code agent does on its first 5 minutes against a fresh codebase. If your bench's naive baseline scores 5% F1, you're probably measuring against a strawman; ours scores 0.35, which matches what real agents actually achieve on real tasks.

Cold-start is a separate column, not amortized. Sverklo's index build is 3,690 ms on this corpus. We list it as its own column rather than averaging it into wall time so you can decide whether your usage pattern justifies the upfront cost. For a 10-task session it dominates; for a multi-hour session it disappears.

Raw JSONL output, not just aggregates. benchmark/results/<timestamp>/raw.jsonl has every task's input, the system's output, and the per-task scoring breakdown. Disagreements with our aggregates are useful — file an issue with your machine spec and run timestamp and we'll triage.

Three codebases is still small. We say so. The "What this benchmark does NOT measure" section above isn't decorative. The next dataset extension is Go / Python / Rust on 5+ codebases. The current 90-task slice is what we have today, with the limitations explicitly listed.

Get started

If the token-economy numbers look interesting:

npm install -g sverklo
cd your-project
sverklo init

sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →