Engineering · Sverklo

bench:swe first results: where local-first code intelligence still misses

First complete cross-repository run of bench:swe across Express, NestJS, Vite, Prisma, and FastAPI. 38 of 65 questions hit perfect recall, 66.2% mean. The interesting part isn’t the headline number — it’s the failure pattern (deeply-nested core files), the v0.18 fix we shipped to address it, and the experiment that disproved the strong form of our diagnosis (+0.003 aggregate MRR, well within noise). What we got wrong and what it teaches.

What bench:swe actually measures

bench:swe is the cross-repository half of the benchmark suite released alongside the sverklo paper. It exists because the other internal benchmark sverklo runs — bench:research, scored against sverklo’s own codebase — sits at 97% perfect recall, and 97% is exactly what you should expect from an evaluation written by the same team that wrote the retriever’s synonym list. bench:research is a regression baseline. bench:swe is the actual evaluation.

The setup: 65 hand-written research-style questions across five popular open-source projects (Express 5.0.1, NestJS 10.4.7, Vite 6.0.7, Prisma 6.1.0, FastAPI 0.115.6), 13 questions per repo, all pinned to specific commits. Each question has required-evidence files as ground truth — the canonical files an engineer would need to read to answer it correctly. A retriever passes if its top-50 results contain every required file. We score avg_recall (mean fraction of evidence found per question) and perfect_recall (count of questions for which all evidence was found).

Two design constraints matter. First, none of these repositories are sverklo’s own code — the retriever has never been tuned against any of them. This is by construction: questions hand-written against your own repo will pass; questions written against code you’ve never seen are the only honest test of generalization. Second, the harness clones each repo fresh on every run from a pinned tag, so the numbers below are reproducible by anyone with npm run bench:swe.

The headline numbers

RepositoryTasksAvg recallPerfect recall
Vite 6.0.71388.5%11/13
FastAPI 0.115.61369.2%8/13
Express 5.0.11365.4%7/13
NestJS 10.4.71357.7%6/13
Prisma 6.1.01350.0%6/13
Aggregate6566.2%38/65 (58.5%)

Some quick observations before we get to the interesting part. Vite is the easiest target by a long way. It’s the most modern of the five, has the cleanest src/ tree, and most of its features have a dedicated file with a name that matches the feature directly (config.ts, preview.ts, watch.ts). The hybrid retriever’s symbol-name and path-token channels both light up immediately on Vite questions. Prisma is the hardest: a multi-package monorepo whose runtime logic lives at packages/client/src/runtime/core/…/, four nesting levels deep and behind several layers of indirection.

The aggregate of 38/65 perfect recall is a number I both wanted higher and am glad to have. Higher would be nicer for the launch announcement. But the gap between sverklo’s 97% on its own codebase and 58.5% out-of-distribution is exactly the gap that bench:swe was built to expose — and the failures, when you look at which questions miss, point cleanly at one specific weakness in the retrieval design that I now know how to fix.

The interesting part: deeply-nested core files

I went through every miss by hand. The pattern is so consistent it’s almost embarrassing. The questions that fail are not the ones whose answers live in obscure or rarely-touched files. They’re the questions whose canonical answer lives in a central, heavily-imported, structurally-important file that nearly every other component in the repository references — but that no individual question lexically matches.

Concrete examples from the run:

Notice what these have in common. The miss isn’t random. It’s a specific structural property of the file: every component imports it, but no individual feature-query lexically matches it. The retriever’s filename-as-signal channel and its symbol-name channel are too good at finding the feature-named files; they push the central god-file off the top-50 list because the feature files always rank higher on per-query relevance.

Why hybrid retrieval struggles here, exactly

Sverklo’s search runs five channels in parallel: FTS (BM25), code-chunk vectors, doc-chunk vectors, symbol-name match, and path-token match. Channelized Reciprocal Rank Fusion combines them. PageRank over the file-import graph is then applied as a tiebreaker.

This is the disconnect. PageRank correctly identifies injector.ts as one of the most structurally important files in NestJS. But because the per-channel rankings put feature-named files (circular-dependency-error.ts, request-scope.ts) at the top of every individual query’s candidate list, RRF fusion gives those feature files an unbeatable lead before PageRank even gets a chance to weigh in. The tiebreaker only matters when there are ties to break, and there aren’t.

What’s missing is a structural channel that follows imports backward from feature files to their centralizing implementations. If the per-channel candidates for “NestJS circular dependency resolution” surface circular-dependency-exception.ts, the retriever should automatically expand that candidate’s upstream graph — the files that import it, recursively — and add the heavily-cross-imported parents (injector.ts) to the candidate pool. PageRank knows which files those are; the per-query channels are too narrow to find them.

This is concretely the same gap that the existing filename-as-signal design pattern already addresses in the downstream direction (when a filename matches a query, all definitions in that file are added to the pool, even if their bodies don’t lexically match). The fix is to add a symmetric upstream traversal: when a feature file is surfaced, its high-PageRank parents in the import graph are too.

The fix we shipped. If a feature file F is surfaced by any per-channel retriever, and there exists a parent file P such that P imports F (transitively, depth ≤ 2) and P’s PageRank is in the top decile of the repository, add P to the candidate pool. Existing RRF fusion ranks it. Six lines of TypeScript, behind --expand-upstream in v0.18.

What happened when we shipped it

Then we ran bench:swe with the new channel enabled and compared against the baseline. To make rank improvements visible — the binary in-top-50 score is too coarse for that — we also added Mean Reciprocal Rank as a second metric.

RepoRecall (baseline)Recall (+upstream)MRR (baseline)MRR (+upstream)Δ MRR
express65.4%65.4%0.3800.388+0.008
nestjs57.7%57.7%0.2880.287−0.001
vite88.5%88.5%0.6220.629+0.007
prisma50.0%50.0%0.1100.1100
fastapi69.2%69.2%0.2750.2750
aggregate66.2%66.2%0.3350.338+0.003

Aggregate MRR improved by +0.003, which is well within bootstrap noise on n=65. Per-repo, three are flat-or-slightly-up and one (NestJS) is slightly down. The original prediction (“moves the needle 10 points”) was wrong, by an order of magnitude.

Two things explain it, and both are more interesting than the original hypothesis.

The channel works on individual queries; the metric doesn’t reward it. On the NestJS nest-circular-deps task, the upstream channel moves packages/core/injector/injector.ts from rank 29 to rank 5 — a 6× reciprocal-rank improvement on that one file. But binary in-top-50 recall doesn’t change (the file was already inside the cap), and adding new candidates to RRF inevitably shuffles other rankings, which mostly cancels out the per-file MRR win. Cleaner ranking interventions probably need a steeper rank-discount metric (NDCG with a sharp falloff) to surface clearly.

Some “missed” ground-truth files aren’t actually god-files. The clearest example: packages/core/injector/instance-loader.ts is missed in 4 NestJS tasks. PageRank: 0.0433. Importers: 6. The top-decile threshold for NestJS is 0.0884. So the channel correctly excludes it — it doesn’t pass the operational definition of structural centrality, even though my human reading flagged it as part of the canonical answer. That’s either evidence the ground truth is over-eager, or evidence that “god-file” is two distinct properties that I conflated. Either way, the original failure-pattern diagnosis from the manual audit was overstated.

This is the part of the result I find most useful. A neat improvement would have been a launch artifact; a refuted hypothesis is a research finding. The diagnosis was honest about being preliminary, and the experiment took it seriously enough to disprove it.

What this is not evidence of

Two things this finding is not:

It’s not evidence that hybrid retrieval doesn’t work. 38 of 65 questions hit perfect recall and another 14 hit ≥50%. On Vite, where the codebase is well-organized at the file-naming level, sverklo finds 11 of 13 answers perfectly. The headline number isn’t representing a fundamental retrieval ceiling; it’s representing one specific class of question (the “follow imports upstream” class) that the current design doesn’t address.

It’s not a benchmark of grep against sverklo. bench:swe currently scores only sverklo. To make it a cross-tool benchmark, the smart-grep and naive-grep baselines used in bench:primitives need to be wired up against the same questions. That’s on the v0.18 list. I expect grep to do badly here — these are research questions whose answers don’t live in single substrings — but the comparison should exist as evidence rather than as a hand-wave.

Why I’m publishing this

The honest aggregate of 58.5% perfect recall is uncomfortable to put on the homepage of a code-intelligence product. It is — obviously — not a number designed to sell installs.

But that’s the point. A benchmark you only release when you win is marketing; a benchmark you release when you lose is a benchmark. If a competitor system runs against the same harness next quarter and gets 70%, that’s a real result and a thing I should be measured against. If I quietly hold back the numbers until the next release closes the gap, the harness becomes vapor and the “reproducible benchmark” framing rings hollow.

The other thing transparency buys you is a clear research backlog. Going through the misses by hand — not just looking at the aggregate — produced a diagnosis (god-files are the failure mode), a hypothesis-driven fix (upstream-import-graph traversal channel), an experiment that disproved the strong form of the hypothesis (+0.003 aggregate MRR), and a sharper question to ask next (is the ground truth itself misaligned with structural centrality?). I would not have found any of that if I had just stared at “58.5%” for an hour.

Reproduce

The full result above takes about 15–20 minutes on a laptop, end to end:

npm install -g sverklo
git clone https://github.com/sverklo/sverklo
cd sverklo
npm run bench:swe

The harness clones each pinned repo into benchmark/.cache/swe/ on first run and reuses the cache afterward. Per-repo Markdown reports and a top-level aggregate land on stdout; structured JSON for further analysis lives at benchmark/results/<timestamp>/.

If you want to add a new repo, drop a JSONL file into benchmark/src/swe/datasets/ and a corresponding entry in repos.json. PRs welcome; the goal is for this benchmark to grow into something that can’t be gamed by tuning against any one of its constituents.

Read the full paper

The bench:swe headline numbers are in §V.D of the paper; the v0.18 upstream-channel experiment and the negative-result analysis are §V.E. CC BY 4.0, on Zenodo with a permanent DOI:

doi.org/10.5281/zenodo.19802051

What’s next

If you want to follow along, the repository is at github.com/sverklo/sverklo, the harness lives in benchmark/src/swe/, and the next paper revision will land on sverklo.com/research.