bench:primitives — 90-task code retrieval evaluation
Five retrieval baselines, three real OSS codebases (express, lodash, sverklo), 90 hand-verified tasks. Reproducible. The naive baseline is what most agents do today — blind grep, 20K tokens of unranked regex hits per task.
Tom Hale (@HaleTom) asked if sverklo would benchmark itself against jcodemunch-mcp and GitNexus — two direct competitors in the local-first MCP code-intel space. The answer: yes, both are on the bench now.
Real findings, not the marketing version:
- smart-grep ties sverklo on overall F1. A tuned grep with language filters is a stronger baseline than the literature in this space usually admits. The differentiation is on token economy and tool-call count, not raw F1.
- jcodemunch wins P1 (definition lookup) at 0.65 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharp; we should learn from it.
- Both jcodemunch and gitnexus return ~0 on P2 — by design. Both expose import-graph references rather than call-site references. That's a legitimate design choice, not a bug — flagged in a public exchange with @jgravelle on issue #25. Our P2 task as defined ("find every caller of X") assumes call-site semantics, which is the load-bearing axis for refactor blast-radius. If your workflow is "every file that imports module Y," import-graph-only is the right substrate and these tools win that subtask. Different retrieval models, different jobs.
- Sverklo wins P4 (file dependencies) at 0.75 vs others 0.25–0.49. Symbol graph + PageRank pays off here.
- Sverklo's token economy holds: 386 average input tokens vs jcodemunch's 5,351 and naive-grep's 17,169.
Within hours of this update going live, @jgravelle shipped jcodemunch-mcp v1.80.7, then v1.80.8, then v1.80.9 — three releases addressing specific findings from this bench.
Confirmed on rerun (v1.80.8):
- Avg input tokens: 5,351 → 1,388 (−74%). The token-bloat fix landed cleanly.
- Express P5 recall: 0.00 → 1.00. The CommonJS
module.exportsre-export blind spot forcreateApplicationis closed. - P1 unchanged at 0.65 (still leading on definition lookup). P2 unchanged (acknowledged design choice — see above).
Adding lodash to the bench (#26) exposed a blind spot in sverklo's own parser:
findBraceEnd was naive character counting, so a { inside a string literal at lodash.js:6301 made every subsequent function declaration get absorbed into one ~11K-line chunk. Public methods (map, filter, reduce, etc.) never got their own chunks.
v0.20.2 (deedec2) ships two fixes: a string/regex/comment-aware brace counter, and exact-match priority in the lookup tool. Result on the same 90-task bench:
- Sverklo lodash P1: 0/10 → 9/10 (only miss is
merge, which isvar merge = createAssigner(...)— function-call assignment the regex parser still can't trace). - Sverklo overall F1: 0.45 → 0.56 — sverklo is now the F1 leader, edging out smart-grep at 0.49.
- Sverklo P1: 0.30 → 0.73 — ties jcodemunch.
- Small P2/P4 regressions tracked in #28.
May 4, 2026 PM — post-fix table (sverklo v0.20.2)
| baseline | n | F1 | P1 | P2 | P4 | P5 | avg tokens | tools/task | tok/correct (gated) |
|---|---|---|---|---|---|---|---|---|---|
| naive-grep | 90 | 0.29 | 0.10 | 0.18 | 0.53 | 0.67 | 20,278 | 6.5 | 2,403 |
| smart-grep | 90 | 0.49 | 0.43 | 0.40 | 0.59 | 0.67 | 1,220 | 4.9 | 219 |
| sverklo v0.20.2 | 90 | 0.56 | 0.73 | 0.25 | 0.71 | 0.67 | 469 | 1.0 | 449 |
| jcodemunch v1.80.9 | 90 | 0.32 | 0.73 | 0.00 | 0.46 | 0.00 | 1,267 | 1.2 | 625 |
| gitnexus | 90 | 0.25 | 0.27 | 0.00 | 0.30 | 0.67 | 372 | 1.2 | 207 |
Note: sverklo numbers are from the post-v0.20.2 single-baseline rerun (2026-05-04T19-38-11-592Z); other baselines are from the morning 5-baseline run (2026-05-04T14-13-23-716Z) and didn't change between runs (no commits affecting them). Reproducible from a fresh clone with npm run bench:quick; full numbers should match within run-to-run variance.
What changed:
- Sverklo wins overall F1 (0.56), edging smart-grep (0.49). Pre-fix it was 0.45 (smart-grep led).
- P1 is now a tie at 0.73 between sverklo and jcodemunch. Pre-fix sverklo was at 0.30 — both projects had to ship parser fixes (sverklo for IIFE brace-counting, jcodemunch for the 500KB file cap) to land here.
- Sverklo still wins P4 (0.71), slightly down from 0.76 pre-fix. Tracked in #28.
- Token economy: sverklo at 469 avg input tokens, ~3× lower than smart-grep (1,220) and jcodemunch (1,267). Single tool call per task vs grep's 4–7.
- Smart-grep still wins P2 (0.40): tuned ripgrep on call-site references is genuinely competitive — sverklo at 0.25 here regressed slightly (#28).
How to reproduce: npm install -g sverklo@0.20.2 && cd /path/to/repo && npm run bench:quick. Bench harness lives at github.com/sverklo/sverklo/tree/main/benchmark. Issues #26, #27, #28 document the methodology iterations that produced this run.
May 4, 2026 AM — pre-fix table (sverklo v0.20.1, kept for diff)
| baseline | n | F1 | P1 | P2 | P4 | P5 | avg tokens | tools/task | tok/correct (gated) |
|---|---|---|---|---|---|---|---|---|---|
| naive-grep | 90 | 0.29 | 0.10 | 0.18 | 0.53 | 0.67 | 20,278 | 6.5 | 2,403 |
| smart-grep | 90 | 0.49 | 0.43 | 0.40 | 0.59 | 0.67 | 1,220 | 4.9 | 219 |
| sverklo v0.20.1 | 90 | 0.45 | 0.30 | 0.34 | 0.76 | 0.67 | 449 | 1.0 | 337 |
| jcodemunch v1.80.9 | 90 | 0.32 | 0.73 | 0.00 | 0.46 | 0.00 | 1,267 | 1.2 | 625 |
| gitnexus | 90 | 0.25 | 0.27 | 0.00 | 0.30 | 0.67 | 372 | 1.2 | 207 |
Pre-v0.20.2 numbers preserved here so the diff between the two runs is auditable. Raw data at benchmark/results/2026-05-04T14-13-23-716Z/.
May 2026 — All 5 baselines
| baseline | n | F1 | P1 | P2 | P4 | P5 | avg tokens | cold (ms) | warm (ms) |
|---|---|---|---|---|---|---|---|---|---|
| naive-grep | 60 | 0.290 | 0.15 | 0.26 | 0.43 | 0.50 | 17,169 | 0 | 4,779 |
| smart-grep | 60 | 0.450 | 0.40 | 0.46 | 0.49 | 0.50 | 1,216 | 0 | 2,258 |
| sverklo | 60 | 0.449 | 0.45 | 0.27 | 0.75 | 0.50 | 386 | 1,159 | 38 |
| jcodemunch | 60 | 0.281 | 0.65 | 0.00 | 0.38 | 0.00 | 5,351 | 718 | 13 |
| gitnexus | 60 | 0.260 | 0.40 | 0.01 | 0.25 | 0.50 | 543 | 452 | 584 |
How to read this: No single baseline dominates. Different tools win different categories. The story isn't "sverklo beats everything" — it's "different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for." Token economy + P4 are sverklo's clearest wins; P1 goes to jcodemunch; P2 to smart-grep.
Honest false positives we filed back to upstream:
- jcodemunch flags Express's
createApplicationas dead code — CommonJSmodule.exports = Xisn't modeled as a use site, so the only export of the entire module appears to have no callers. - gitnexus's
impactanalysis returns 0 affected modules forcreateApplication— same blind spot. - Both tools' P2 implementations are import-graph-only — should be advertised more prominently in their docs.
Reproducer (requires uvx for jcodemunch and npm i -g gitnexus on PATH):
git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build
# All 5 baselines
npm run bench:quick
# Single baseline
BASELINES=jcodemunch npm run bench:quick
BASELINES=gitnexus npm run bench:quick
BASELINES=sverklo npm run bench:quick
Original April 2026 run — 60 tasks, 3 baselines (naive-grep, smart-grep, sverklo). Click to expand.
Original numbers from the first public bench run (2026-04-07). The April run used a slightly older harness version; the May 4 PM run above is the canonical current data. Both are kept on this page so the reader can audit drift.
expressjs/express and sverklo/sverklo:
sverklo achieves F1 0.58 with 255 average input tokens and 1.0 tool calls;
smart-grep (a tuned grep with language filters and definition-shaped patterns) achieves F1 0.67 with 731 tokens and 11.8 tool calls;
naive grep (the floor — grep -rn <sym> . then read top 10 files) achieves F1 0.35 with 15,814 tokens and 7.6 tool calls.
−98% vs naive grep
−65% vs tuned grep
−87% on average
All baselines
| baseline | n | F1 | recall | prec | tokens | tools | wall (ms) | cold (ms) | gated tok/correct |
|---|---|---|---|---|---|---|---|---|---|
| naive-grep | 60 | 0.35 | 0.56 | 0.29 | 15814 | 7.6 | 1302 | 0 | 3557 (n=10) |
| smart-grep | 60 | 0.67 | 0.81 | 0.62 | 731 | 11.8 | 215 | 0 | 165 (n=28) |
| sverklo | 60 | 0.58 | 0.73 | 0.57 | 255 | 1.0 | 1 | 3690 | 203 (n=25) |
Read this carefully: smart-grep is a strong baseline. A tuned grep with language filters and definition-shaped patterns has higher F1 (0.67 vs 0.58) on this 60-task slice. Sverklo wins on token economy and tool-call count by a large margin (62× fewer tokens than naive grep, 2.9× fewer than smart-grep, single tool call vs 7-12). For an AI agent with a 200K token context window, that's the load-bearing axis. For a human standing at a terminal with `rg`, smart-grep is fine.
Per-category breakdown
P1 — Definition lookup (n=20)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.15 | 23337 | 339 | 8.1 |
| smart-grep | 0.60 | 196 | 51 | 2.0 |
| sverklo | 0.75 | 283 | 0 | 1.0 |
Sverklo wins. Single tool call (sverklo_lookup) vs 8 grep iterations.
P2 — Reference finding (n=20)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.39 | 21925 | 345 | 7.0 |
| smart-grep | 0.81 | 224 | 17 | 1.0 |
| sverklo | 0.56 | 157 | 0 | 1.0 |
Smart-grep wins. Reference finding on Express/sverklo turns out to be a regex problem grep handles well; sverklo's symbol-graph helps less than we'd hoped on this slice. Token economy still favours sverklo.
P4 — File dependencies (n=10)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.51 | 2918 | 280 | 2.0 |
| smart-grep | 0.63 | 1058 | 16 | 2.0 |
| sverklo | 0.86 | 74 | 0 | 1.0 |
Sverklo wins decisively. sverklo_deps against the indexed import graph is what graph-based retrieval is supposed to do.
P5 — Dead code (n=10)
| baseline | F1 | tokens | wall (ms) | tools |
|---|---|---|---|---|
| naive-grep | 0.50 | 1442 | 6164 | 13.5 |
| smart-grep | 0.55 | 2488 | 1138 | 63.0 |
| sverklo | 0.02 | 579 | 3 | 1.0 |
Sverklo loses badly here. The current sverklo_refs doesn't catch dynamic invocations and deserialization-driven calls that smart-grep finds via aggressive whole-file reads. P5 is the next slice we plan to fix.
Where sverklo wins (full list)
| Task | Category | sverklo F1 | best grep F1 | sverklo tok | best grep tok |
|---|---|---|---|---|---|
express/ex-p1-02 | P1 | 1.00 | 0.00 | 769 | 10615 |
express/ex-p1-03 | P1 | 1.00 | 0.00 | 692 | 6844 |
express/ex-p1-09 | P1 | 1.00 | 0.00 | 128 | 5920 |
sverklo/sv-p4-05 | P4 | 1.00 | 0.50 | 50 | 874 |
express/ex-p4-04 | P4 | 1.00 | 0.50 | 36 | 3781 |
sverklo/sv-p4-04 | P4 | 1.00 | 0.67 | 42 | 928 |
express/ex-p4-05 | P4 | 1.00 | 0.68 | 41 | 1316 |
express/ex-p4-02 | P4 | 0.90 | 0.68 | 79 | 1345 |
sverklo/sv-p4-02 | P4 | 0.86 | 0.71 | 40 | 334 |
sverklo/sv-p4-03 | P4 | 0.86 | 0.75 | 59 | 754 |
sverklo/sv-p4-01 | P4 | 0.80 | 0.69 | 232 | 1373 |
Where sverklo loses (the honesty section)
If you skip this section, you're doing benchmark cherry-picking. We're not.
| Task | Category | sverklo F1 | best grep F1 | sverklo tok | best grep tok | note |
|---|---|---|---|---|---|---|
express/ex-p5-01 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p5-02 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p5-03 | P5 | 0.00 | 1.00 | 535 | 0 | missed |
express/ex-p2-04 | P2 | 0.00 | 1.00 | 30 | 49 | missed |
sverklo/sv-p2-04 | P2 | 0.50 | 1.00 | 58 | 67 | vs smart-grep |
sverklo/sv-p2-06 | P2 | 0.40 | 0.83 | 137 | 205 | vs smart-grep |
express/ex-p2-01 | P2 | 0.27 | 0.63 | 442 | 701 | vs smart-grep |
The dead-code (P5) miss is structural — sverklo's reference graph doesn't catch dynamic invocations and deserialization-driven calls. The reference-finding (P2) gap is closer; smart-grep's regex variants happen to match a few cases sverklo's symbol resolution doesn't.
What this benchmark does NOT measure
- Real coding-task latency. 90 retrieval primitives, not 90 end-to-end agent runs. The token economy translates to real-world savings, but a follow-up bench:swe (65 SWE-bench-style questions × 5 OSS repos) measures that and is on the public roadmap.
- Anything beyond Express + lodash + sverklo. The current slice is three JS/TS codebases. Coverage on Go, Python, Rust is the next dataset extension.
- Cross-repo retrieval. The bench is single-repo; sverklo's workspace and cross-repo features aren't exercised.
- Cost of indexing. The 3690ms cold-start figure is the index build for sverklo. We list it as a separate column rather than amortizing it into wall time, so you can decide whether your usage pattern justifies it.
Methodology
- Tasks: 90 hand-verified retrieval primitives across
expressjs/express,lodash/lodash, andsverklo/sverklo. Each dataset contributes 30 tasks distributed across P1 (definition lookup, 10), P2 (reference finding, 10), P4 (file dependencies, 5), P5 (dead-code detection, 5). - Metrics: F1, recall, precision, total input tokens, tool-call count, wall time. tokens_per_correct_answer = input_tokens / max(recall, 0.01) — lower is better. The gated column averages only over runs where F1 ≥ 0.8 — refusing to reward "found nothing cheaply".
- Tolerances: P1 uses ±3-line tolerance, P2 uses ±2 lines, P4/P5 use set membership.
- naive-grep floor:
grep -rn <sym> .then read top 10 files in full. - smart-grep: language filters, ±10-line context reads, definition-shaped patterns. The strong baseline.
- sverklo: spawns the MCP stdio server once per dataset; cold-start is the index build.
- Run environment: Apple Silicon laptop, Node 25, sverklo running with
--expose-gc, ONNX all-MiniLM-L6-v2 (~90 MB on disk).
Reproducing this
git clone https://github.com/sverklo/sverklo && cd sverklo
npm install
npm run build
npm run bench:quick # all baselines, all datasets
BASELINES=sverklo,jcodemunch npm run bench:quick # single baseline filter
DATASETS=express npm run bench:quick # single dataset filter
Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors this page's tables. Disagreements with our numbers are useful — file an issue with your machine spec and the run timestamp.
Submitting a new baseline
If you maintain a code-search tool, code-intelligence MCP server, or retrieval system, you can have it benchmarked here on the same task suite. Open a PR to sverklo/sverklo adding benchmark/src/baselines/<your-tool>.ts — auto-bench CI runs on the PR within ~10 minutes against the express dataset (~30 tasks) and posts a results-table comment back. You don't need to run the harness locally first; CI does it. Methodology repo: github.com/sverklo/sverklo-bench. Workflow source: .github/workflows/auto-bench.yml. Tracking: sverklo-bench#4.
Performance benchmarks (separate)
This page is the retrieval benchmark. We also publish performance numbers (cold index time, search latency, impact-analysis time) on five real OSS codebases at /benchmarks/. Both are reproducible from BENCHMARKS.md.
Cite this
If you reference this benchmark in academic work or comparison material:
@misc{sverklo_bench_primitives_2026,
title = {Sverklo bench:primitives — a 90-task retrieval evaluation for AI coding agents},
author = {Groshin, Nikita},
year = {2026},
doi = {10.5281/zenodo.19802051},
url = {https://sverklo.com/bench/}
}
A few details we sweated
Hand-verified ground truth. Every one of the 90 task answers was inspected by hand at the fixed commit. Auto-generated ground truth from existing tooling correlates with whatever generated it; we wanted the harness to be cleanly testable against any retrieval system, including future ones we haven't built.
The losing slice gets the same prominence as the winning slice. The dead-code (P5) F1 = 0.02 number lives in the same table as the wins, two scrolls apart. A bench that only releases when the maintainer wins is marketing; a bench that releases when the maintainer loses is a bench. The contribution is the harness, not the leaderboard.
Tokens-per-correct-answer is the primary axis, F1 is secondary. Most retrieval evaluations report F1 first because they were designed for human-facing search. AI agents inside bounded context windows pay for every token returned; that opportunity cost compounds across an editing session. We report both axes and explain the tradeoff in plain terms; the agent-relevant axis is the one that earned the headline callouts above.
Naive grep is the floor, not the strawman. The naive baseline runs grep -rn <sym> . then reads the top 10 matching files in full — the same thing a Claude Code agent does on its first 5 minutes against a fresh codebase. If your bench's naive baseline scores 5% F1, you're probably measuring against a strawman; ours scores 0.35, which matches what real agents actually achieve on real tasks.
Cold-start is a separate column, not amortized. Sverklo's index build is 3,690 ms on this corpus. We list it as its own column rather than averaging it into wall time so you can decide whether your usage pattern justifies the upfront cost. For a 10-task session it dominates; for a multi-hour session it disappears.
Raw JSONL output, not just aggregates. benchmark/results/<timestamp>/raw.jsonl has every task's input, the system's output, and the per-task scoring breakdown. Disagreements with our aggregates are useful — file an issue with your machine spec and run timestamp and we'll triage.
Three codebases is still small. We say so. The "What this benchmark does NOT measure" section above isn't decorative. The next dataset extension is Go / Python / Rust on 5+ codebases. The current 90-task slice is what we have today, with the limitations explicitly listed.
Get started
If the token-economy numbers look interesting:
npm install -g sverklo
cd your-project
sverklo init
sverklo init auto-detects which AI coding agents you have installed (Claude Code, Cursor, Windsurf, Zed, Antigravity) and writes the right MCP config files. Back to the homepage →