MCP code-intel index
A reproducible ranking of MCP code-intelligence servers on retrieval quality, token economy, and tool-call count. Sverklo on its own board — including the slices it loses.
Smithery tells you it installs. Sverklo tells you if the code is rotting.
Latest results
| baseline | F1 | P1 | P2 | P4 | P5 | tokens | tools/task | audit |
|---|---|---|---|---|---|---|---|---|
| sverklo us | 0.58 | 0.70 | 0.29 | 0.78 | 0.75 | 498 | 1.0 | B |
| smart-grep | 0.41 | 0.33 | 0.30 | 0.46 | 0.75 | 963 | 4.1 | — |
| jcodemunch | 0.32 | 0.78 | 0.00 | 0.34 | 0.01 | 1,178 | 1.2 | C |
| naive-grep | 0.27 | 0.07 | 0.14 | 0.42 | 0.75 | 24,194 | 6.1 | — |
| gitnexus | 0.24 | 0.23 | 0.00 | 0.25 | 0.75 | 333 | 1.2 | F |
Reproduce these numbers locally
git clone https://github.com/sverklo/sverklo && cd sverklo && git checkout c6a50e54fa5f36615f2f20c62cbd1f538494e060 && npm install && npm run build && npm run bench:quick
The harness writes summary.json, raw.jsonl, and report.md to benchmark/results/<timestamp>/. Diff against this page; file an issue at sverklo-bench/issues if your numbers differ.
Raw artifact for this run: sverklo/sverklo/tree/c6a50e54fa5f36615f2f20c62cbd1f538494e060/benchmark/results/2026-05-07T16-51-40-288Z
What this measures
F1 (overall + per category)
Hand-verified retrieval tasks scored on F1. Per-category breakdown so wins on definition lookup don't paper over losses on reference finding. P1: definition lookup. P2: reference finding. P4: file dependencies. P5: dead-code detection.
Tokens / task
Average input tokens the agent ingests per task. The load-bearing axis for AI agents inside bounded context windows. Naive grep returns ~20K; sverklo ~500. Lower is better.
Tools / task
Average tool calls per task. A baseline that wins F1 by making 12 calls is a different product than one that wins F1 in 1 call. Lower = less round-trip latency.
What's deliberately NOT a column
No composite score. No A-F letter grade. No "verdict." Each axis stays independent so the methodology survives critique. RFC #5 documents the metric set.
How to read this
Different baselines win different categories. Smart-grep beats sverklo on P2 reference finding (a tuned ripgrep is genuinely competitive on call-site lookups). Jcodemunch-mcp ties or beats sverklo on P1 definition lookup (their tree-sitter symbol indexing is sharp). Sverklo wins P4 file dependencies decisively (the symbol graph + PageRank is what graph-based retrieval is supposed to do). Naive grep is the floor.
The story isn't "sverklo beats everything." It's different retrieval substrates have different strengths, and the load-bearing axis depends on what you're optimizing for. For agents inside bounded context windows, the token economy is the load-bearing axis; for human-facing search, F1 wins. The page above lets you sort either way.
Methodology
- 120 hand-verified tasks across four OSS codebases (express 4.21.1, lodash 4.17.21, sverklo, requests v2.32.3). 30 tasks per dataset distributed across P1 (10), P2 (10), P4 (5), P5 (5).
- Five baselines: naive-grep (the floor), smart-grep (tuned ripgrep), sverklo, jcodemunch-mcp, GitNexus.
- Tolerances: P1 ±3 lines, P2 ±2 lines, P4/P5 set membership. Documented in sverklo/sverklo-bench.
- Refresh cadence: quarterly. Maintainer-triggered via the
bench-refreshworkflow on the sverklo repo. No on-push auto-publish — keeps the ranking gameable only via discrete maintainer actions. - Sverklo on its own board: every published refresh includes sverklo's losing slices. The /bench/ page is the long-form artifact with the per-task breakdown of where sverklo loses.
How a maintainer adds their tool
- Read CONTRIBUTING.md in the methodology repo.
- Open a PR against sverklo/sverklo adding
benchmark/src/baselines/<your-tool>.tsimplementing theBaselineinterface. - The auto-bench CI workflow runs on the PR (express dataset, ~10min), posts a results table, and uploads the raw artifact. You don't need to run anything locally.
- If the implementation is faithful to your tool's intended use, we merge. Then the next quarterly refresh picks it up here.
Disagreements with the methodology, the metric set, or specific task scoring: file an issue against sverklo-bench/issues. Open invitation. We've already shipped fixes to our own parser in response to bench findings (the bench-loop post documents the pattern).
Embed sverklo audit in your CI
If you maintain an MCP server (or any code-intel project), you can add the sverklo audit to your own CI in two lines. The audit runs on the GitHub Actions runner — your code never leaves the workflow:
- uses: sverklo/sverklo@main
with:
fail_on: "" # or "F" to block merges on F-grade
comment: "true" # post idempotent PR comment
The Action posts a markdown comment with the overall grade plus a per-dimension table (dead code, circular deps, coupling, security). Idempotent — re-runs update in place. Methodology link in every comment so disagreements have a place to land. Source: sverklo/sverklo/action.yml.
Or request a one-time audit posted publicly at sverklo.com/report/<owner>/<repo>/: file the audit-request issue.
The wedge
Other surfaces in the MCP-server space score on different axes. Glama is a directory with letter grades on metadata quality. MseeP scores npm-audit-shaped security. PulseMCP curates editorially. The official Registry is neutral substrate, no opinion.
None of them measure whether the MCP server actually retrieves the right code. That's the axis above. If your team picks an MCP server based on README polish or install count, the failure mode shows up in production: the agent hallucinates symbol names because retrieval missed the relevant chunk. The bench above measures the failure mode directly.
Cite this
@misc{sverklo_mcp_index_2026,
title = {Sverklo MCP code-intel index — comparative evaluation of MCP retrieval servers},
author = {Groshin, Nikita},
year = {2026},
doi = {10.5281/zenodo.19802051},
url = {https://sverklo.com/mcp/}
}
Sverklo is itself one of the baselines on this page. The numbers above include sverklo's losing slices — that's the point. Methodology + raw artifacts at github.com/sverklo/sverklo-bench. The reproducer command is one click in the table above.