Engineering · Sverklo · 2026-05-02

I added two competitors to my own benchmark. One of them beat me at P1.

2026-05-02 ~7 min read by Nikita Groshin

A user named Tom Hale opened issue #25 on the sverklo repo last week, asking if I'd benchmark sverklo against two named competitors: jcodemunch-mcp and GitNexus. I committed to running them in 48 hours, then went and did. Three things I didn't expect:

smart-grep ties sverklo on overall F1. jcodemunch wins P1 outright. And both of the new competitors return ~0 on P2 because they only track import sites, not call sites.

The setup

Sverklo's bench:primitives is a 60-task hand-verified retrieval evaluation. Four task categories: P1 (definition lookup), P2 (reference finding), P4 (file dependencies), P5 (dead code). Two real OSS codebases (Express and sverklo itself). Three baselines: naive-grep (the floor), smart-grep (a tuned grep with language filters and definition-shaped patterns), and sverklo.

When Tom asked me to add jcodemunch-mcp and GitNexus — both direct competitors in the local-first MCP code-intel space — the right answer was obvious. The whole point of publishing a bench is to invite the comparison. Refusing to bench against named competitors looks like exactly what it is.

I shipped the two new baselines over a Saturday afternoon. The harness's Baseline interface is small (setupForDataset() + run(task) returning structured prediction + raw payload + tool-call count + timing), and both competitors expose enough surface to map cleanly to the four task categories:

P1 (definitions)  → search_symbols / context
P2 (references)   → find_references / impact
P4 (file deps)    → find_importers / cypher graph queries
P5 (dead code)    → get_dead_code_v2 / cypher

Both competitors run as subprocesses — jcodemunch as a persistent stdio MCP server (one process per dataset, query via JSON-RPC), GitNexus as one-CLI-spawn-per-query. Cold-start indexing happens once per dataset; warm queries amortize the cost.

The 5-baseline matrix

Run on Express's 4.21.1 tag, 30 tasks (the Express slice of the 60-task bench):

Baseline F1 P1 P2 P4 Tokens
naive-grep0.2900.150.260.4317,169
smart-grep0.4500.400.460.491,216
sverklo0.4490.450.270.75386
jcodemunch0.2810.650.000.385,351
gitnexus0.2600.400.010.25543

The headline narrative many people would expect — "the open-source MCP code-intel server beats grep by N×" — is not what these numbers say. What they say:

Surprise 1: smart-grep ties sverklo on overall F1

0.450 vs 0.449. That's not within margin of error; it's literally tied. A clever grep wrapper (language filters + definition-shaped patterns + ±10-line context reads) gets the same average accuracy across these 60 tasks as a full local-first MCP code-intel server with a tree-sitter index, an ONNX embedding model, a PageRank-ranked import graph, and four years of papers behind the design.

This was the first instinct-violating result. I knew smart-grep was a strong baseline — that's why it's in the bench instead of just naive-grep — but I expected sverklo's structured retrieval to pull ahead on aggregate. It didn't.

What did pull ahead is token economy. Smart-grep averages 1,216 tokens per task; sverklo averages 386. That's a 3× difference, and it stacks across an agent's full context window. For a human standing at a terminal with rg, smart-grep is genuinely fine. For an AI coding agent with bounded context, the differential is the load-bearing axis.

The honest version of "we beat grep" is: we beat naive grep cleanly (15× fewer tokens), we tie smart grep on accuracy and beat it on token cost (3× fewer), we lose smart-grep's simplicity advantage. Anyone publishing benchmarks against grep should be reporting smart-grep numbers, not just naive-grep.

Surprise 2: jcodemunch wins P1 (definition lookup)

0.65 F1 vs sverklo's 0.45 on the same 10 P1 tasks. That's a real gap, not noise.

P1 is the most basic retrieval primitive: "where is this symbol defined?" Sverklo's answer goes through hybrid search (BM25 + vector + PageRank). jcodemunch's tree-sitter symbol table is direct and apparently sharper on this slice.

What's interesting is that this is the slice where everyone "should" be tied — definition lookup is the easiest task type, and tree-sitter parsers across tools should all surface the same symbols. The gap suggests sverklo's ranking is over-weighting hits that aren't the canonical definition (e.g. a re-export or test fixture), and jcodemunch's lighter index is more decisive on the "first match wins" case.

This is on the next-up list to investigate. Honest credit where it's due: jcodemunch's symbol indexing is doing real work that sverklo isn't.

Surprise 3: jcodemunch and gitnexus return ~0 on P2

P2 is reference finding: "who calls this symbol?" Sverklo gets 0.27 on Express's 10 P2 tasks. Smart-grep gets 0.46. jcodemunch gets 0.00. GitNexus gets 0.01.

This isn't a bug. Both tools' reference-tracking is import-graph-only by design. They tell you which files import a symbol; they don't tell you where the symbol is called. For most agent workflows that's a real gap — an agent asking "what would break if I rename UserService.validate?" cares about call sites, not just import declarations.

The trace through Express makes this concrete: find_references(identifier: "createApplication") on jcodemunch returns 0 results. createApplication is the main export of the entire module — it's literally how anyone uses Express (const app = express()) — but no source file does import { createApplication }, because Express uses CommonJS module.exports = createApplication. jcodemunch's import-only tracking misses the entire usage surface.

I filed this back upstream as something worth surfacing in their docs, not a bug per se — the tool does what it says — but the limitation is currently invisible until you run it on a CommonJS codebase and notice every "find references" returns empty.

What's not in this writeup

Three things this benchmark doesn't measure that are worth saying out loud:

The reproducible setup

Anyone can run this:

# Prereqs:
#   uvx     (https://github.com/astral-sh/uv) — for jcodemunch
#   gitnexus — npm i -g gitnexus

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build

# All 5 baselines
npm run bench:quick

# Single baseline
BASELINES=jcodemunch npm run bench:quick
BASELINES=gitnexus   npm run bench:quick

Output lands in benchmark/results/<timestamp>/. Disagreements with these numbers are useful — open an issue with your machine spec and run timestamp.

Why publish numbers that aren't a clean win

The competitive default in the AI-coding-tools space right now is to publish the slice where you win and stay quiet on the slice where you don't. Greptile tweets the prompt-injection benchmark they win; Cursor links the agent-mode demo where their RAG works; Anthropic publishes the Claude vs GPT chart their model leads. It's not lying — it's selection.

The reason to do the opposite is that selection bias collapses on contact with users running their own tasks on their own codebases. Once an evaluator runs a third tool on the same fixtures, your selective publication becomes the thing that makes you look bad.

Tom asking the question forced the comparison. Running it forced the surprises. Publishing it forces the next round of work — figure out why sverklo's P1 ranking under-performs vs jcodemunch's tree-sitter resolution, decide whether to extend P2 reference tracking to call-graph-aware (sverklo already has the data) or accept that smart-grep covers the slice well enough.

The bench page (sverklo.com/bench) now shows all 5 baselines side-by-side with both the wins and losses. The losses are the section that makes the wins meaningful.

Try it

Sverklo is MIT-licensed and runs on your laptop. The bench is open and reproducible.

npm install -g sverklo
cd your-project
sverklo init

GitHub: sverklo/sverklo · Full 5-baseline bench page · Issue #25 — original ask + raw numbers

References

See also