← sverklo.com

MCP code-intel index

A reproducible ranking of MCP code-intelligence servers on retrieval quality, token economy, and tool-call count. Sverklo on its own board — including the slices it loses.

Smithery tells you it installs. Sverklo tells you if the code is rotting.

Latest results

baseline F1 P1 P2 P4 P5 tokens tools/task audit
sverklo us 0.58 0.70 0.29 0.78 0.75 498 1.0 B
smart-grep 0.41 0.33 0.30 0.46 0.75 963 4.1
jcodemunch 0.32 0.78 0.00 0.34 0.01 1,178 1.2 C
naive-grep 0.27 0.07 0.14 0.42 0.75 24,194 6.1
gitnexus 0.24 0.23 0.00 0.25 0.75 333 1.2 F

120 tasks across 4 datasets (express, lodash, sverklo, requests). Run 2026-05-07T16-51-40-288Z at sverklo c6a50e5. Published 2026-05-07 17:19 UTC.

Reproduce these numbers locally
git clone https://github.com/sverklo/sverklo && cd sverklo && git checkout c6a50e54fa5f36615f2f20c62cbd1f538494e060 && npm install && npm run build && npm run bench:quick

The harness writes summary.json, raw.jsonl, and report.md to benchmark/results/<timestamp>/. Diff against this page; file an issue at sverklo-bench/issues if your numbers differ.

Raw artifact for this run: sverklo/sverklo/tree/c6a50e54fa5f36615f2f20c62cbd1f538494e060/benchmark/results/2026-05-07T16-51-40-288Z

What this measures

F1 (overall + per category)

Hand-verified retrieval tasks scored on F1. Per-category breakdown so wins on definition lookup don't paper over losses on reference finding. P1: definition lookup. P2: reference finding. P4: file dependencies. P5: dead-code detection.

Tokens / task

Average input tokens the agent ingests per task. The load-bearing axis for AI agents inside bounded context windows. Naive grep returns ~20K; sverklo ~500. Lower is better.

Tools / task

Average tool calls per task. A baseline that wins F1 by making 12 calls is a different product than one that wins F1 in 1 call. Lower = less round-trip latency.

What's deliberately NOT a column

No composite score. No A-F letter grade. No "verdict." Each axis stays independent so the methodology survives critique. RFC #5 documents the metric set.

How to read this

Different baselines win different categories. Smart-grep beats sverklo on P2 reference finding (a tuned ripgrep is genuinely competitive on call-site lookups). Jcodemunch-mcp ties or beats sverklo on P1 definition lookup (their tree-sitter symbol indexing is sharp). Sverklo wins P4 file dependencies decisively (the symbol graph + PageRank is what graph-based retrieval is supposed to do). Naive grep is the floor.

The story isn't "sverklo beats everything." It's different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for. For agents inside bounded context windows, the token economy is the load-bearing axis; for human-facing search, F1 wins. The page above lets you sort either way.

Methodology

How a maintainer adds their tool

  1. Read CONTRIBUTING.md in the methodology repo.
  2. Open a PR against sverklo/sverklo adding benchmark/src/baselines/<your-tool>.ts implementing the Baseline interface.
  3. The auto-bench CI workflow runs on the PR (express dataset, ~10min), posts a results table, and uploads the raw artifact. You don't need to run anything locally.
  4. If the implementation is faithful to your tool's intended use, we merge. Then the next quarterly refresh picks it up here.

Disagreements with the methodology, the metric set, or specific task scoring: file an issue against sverklo-bench/issues. Open invitation. We've already shipped fixes to our own parser in response to bench findings (the bench-loop post documents the pattern).

Embed sverklo audit in your CI

If you maintain an MCP server (or any code-intel project), you can add the sverklo audit to your own CI in two lines. The audit runs on the GitHub Actions runner — your code never leaves the workflow:

- uses: sverklo/sverklo@main
  with:
    fail_on: ""        # or "F" to block merges on F-grade
    comment: "true"    # post idempotent PR comment

The Action posts a markdown comment with the overall grade plus a per-dimension table (dead code, circular deps, coupling, security). Idempotent — re-runs update in place. Methodology link in every comment so disagreements have a place to land. Source: sverklo/sverklo/action.yml.

Or request a one-time audit posted publicly at sverklo.com/report/<owner>/<repo>/: file the audit-request issue.

The wedge

Other surfaces in the MCP-server space score on different axes. Glama is a directory with letter grades on metadata quality. MseeP scores npm-audit-shaped security. PulseMCP curates editorially. The official Registry is neutral substrate, no opinion.

None of them measure whether the MCP server actually retrieves the right code. That's the axis above. If your team picks an MCP server based on README polish or install count, the failure mode shows up in production: the agent hallucinates symbol names because retrieval missed the relevant chunk. The bench above measures the failure mode directly.

Cite this

@misc{sverklo_mcp_index_2026,
  title  = {Sverklo MCP code-intel index — comparative evaluation of MCP retrieval servers},
  author = {Groshin, Nikita},
  year   = {2026},
  doi    = {10.5281/zenodo.19802051},
  url    = {https://sverklo.com/mcp/}
}

Sverklo is itself one of the baselines on this page. The numbers above include sverklo's losing slices — that's the point. Methodology + raw artifacts at github.com/sverklo/sverklo-bench. The reproducer command is one click in the table above.