Engineering · Sverklo

bench:swe first results: where local-first code intelligence still misses

First complete cross-repository run of bench:swe across Express, NestJS, Vite, Prisma, and FastAPI. 38 of 65 questions hit perfect recall; aggregate mean recall is 66.2%. The headline number isn’t the interesting part. The interesting part is that the failures cluster, almost cleanly, in a single structural pattern that hybrid retrieval as currently implemented does not handle. Here’s what we found and what we’re doing about it.

What bench:swe actually measures

bench:swe is the cross-repository half of the benchmark suite released alongside the sverklo paper. It exists because the other internal benchmark sverklo runs — bench:research, scored against sverklo’s own codebase — sits at 97% perfect recall, and 97% is exactly what you should expect from an evaluation written by the same team that wrote the retriever’s synonym list. bench:research is a regression baseline. bench:swe is the actual evaluation.

The setup: 65 hand-written research-style questions across five popular open-source projects (Express 5.0.1, NestJS 10.4.7, Vite 6.0.7, Prisma 6.1.0, FastAPI 0.115.6), 13 questions per repo, all pinned to specific commits. Each question has required-evidence files as ground truth — the canonical files an engineer would need to read to answer it correctly. A retriever passes if its top-50 results contain every required file. We score avg_recall (mean fraction of evidence found per question) and perfect_recall (count of questions for which all evidence was found).

Two design constraints matter. First, none of these repositories are sverklo’s own code — the retriever has never been tuned against any of them. This is by construction: questions hand-written against your own repo will pass; questions written against code you’ve never seen are the only honest test of generalization. Second, the harness clones each repo fresh on every run from a pinned tag, so the numbers below are reproducible by anyone with npm run bench:swe.

The headline numbers

RepositoryTasksAvg recallPerfect recall
Vite 6.0.71388.5%11/13
FastAPI 0.115.61369.2%8/13
Express 5.0.11365.4%7/13
NestJS 10.4.71357.7%6/13
Prisma 6.1.01350.0%6/13
Aggregate6566.2%38/65 (58.5%)

Some quick observations before we get to the interesting part. Vite is the easiest target by a long way. It’s the most modern of the five, has the cleanest src/ tree, and most of its features have a dedicated file with a name that matches the feature directly (config.ts, preview.ts, watch.ts). The hybrid retriever’s symbol-name and path-token channels both light up immediately on Vite questions. Prisma is the hardest: a multi-package monorepo whose runtime logic lives at packages/client/src/runtime/core/…/, four nesting levels deep and behind several layers of indirection.

The aggregate of 38/65 perfect recall is a number I both wanted higher and am glad to have. Higher would be nicer for the launch announcement. But the gap between sverklo’s 97% on its own codebase and 58.5% out-of-distribution is exactly the gap that bench:swe was built to expose — and the failures, when you look at which questions miss, point cleanly at one specific weakness in the retrieval design that I now know how to fix.

The interesting part: deeply-nested core files

I went through every miss by hand. The pattern is so consistent it’s almost embarrassing. The questions that fail are not the ones whose answers live in obscure or rarely-touched files. They’re the questions whose canonical answer lives in a central, heavily-imported, structurally-important file that nearly every other component in the repository references — but that no individual question lexically matches.

Concrete examples from the run:

Notice what these have in common. The miss isn’t random. It’s a specific structural property of the file: every component imports it, but no individual feature-query lexically matches it. The retriever’s filename-as-signal channel and its symbol-name channel are too good at finding the feature-named files; they push the central god-file off the top-50 list because the feature files always rank higher on per-query relevance.

Why hybrid retrieval struggles here, exactly

Sverklo’s search runs five channels in parallel: FTS (BM25), code-chunk vectors, doc-chunk vectors, symbol-name match, and path-token match. Channelized Reciprocal Rank Fusion combines them. PageRank over the file-import graph is then applied as a tiebreaker.

This is the disconnect. PageRank correctly identifies injector.ts as one of the most structurally important files in NestJS. But because the per-channel rankings put feature-named files (circular-dependency-error.ts, request-scope.ts) at the top of every individual query’s candidate list, RRF fusion gives those feature files an unbeatable lead before PageRank even gets a chance to weigh in. The tiebreaker only matters when there are ties to break, and there aren’t.

What’s missing is a structural channel that follows imports backward from feature files to their centralizing implementations. If the per-channel candidates for “NestJS circular dependency resolution” surface circular-dependency-exception.ts, the retriever should automatically expand that candidate’s upstream graph — the files that import it, recursively — and add the heavily-cross-imported parents (injector.ts) to the candidate pool. PageRank knows which files those are; the per-query channels are too narrow to find them.

This is concretely the same gap that the existing filename-as-signal design pattern already addresses in the downstream direction (when a filename matches a query, all definitions in that file are added to the pool, even if their bodies don’t lexically match). The fix is to add a symmetric upstream traversal: when a feature file is surfaced, its high-PageRank parents in the import graph are too.

The one-line fix that probably moves the needle 10 points. If feature file F is surfaced by any per-channel retriever, and there exists a parent file P such that P imports F (transitively, depth ≤ 2) and P’s PageRank is in the top decile of the repository, add P to the candidate pool. Use the existing RRF fusion to rank it appropriately. Don’t over-engineer; this is six lines of TypeScript.

What this is not evidence of

Two things this finding is not:

It’s not evidence that hybrid retrieval doesn’t work. 38 of 65 questions hit perfect recall and another 14 hit ≥50%. On Vite, where the codebase is well-organized at the file-naming level, sverklo finds 11 of 13 answers perfectly. The headline number isn’t representing a fundamental retrieval ceiling; it’s representing one specific class of question (the “follow imports upstream” class) that the current design doesn’t address.

It’s not a benchmark of grep against sverklo. bench:swe currently scores only sverklo. To make it a cross-tool benchmark, the smart-grep and naive-grep baselines used in bench:primitives need to be wired up against the same questions. That’s on the v0.18 list. I expect grep to do badly here — these are research questions whose answers don’t live in single substrings — but the comparison should exist as evidence rather than as a hand-wave.

Why I’m publishing this

The honest aggregate of 58.5% perfect recall is uncomfortable to put on the homepage of a code-intelligence product. It is — obviously — not a number designed to sell installs.

But that’s the point. A benchmark you only release when you win is marketing; a benchmark you release when you lose is a benchmark. If a competitor system runs against the same harness next quarter and gets 70%, that’s a real result and a thing I should be measured against. If I quietly hold back the numbers until the next release closes the gap, the harness becomes vapor and the “reproducible benchmark” framing rings hollow.

The other thing transparency buys you is a clear research backlog. Going through the misses by hand — not just looking at the aggregate — produced an actionable diagnosis (god-files are the failure mode) and an actionable fix (upstream-import-graph traversal channel). I would not have found either if I had just stared at “58.5%” for an hour.

Reproduce

The full result above takes about 15–20 minutes on a laptop, end to end:

npm install -g sverklo
git clone https://github.com/sverklo/sverklo
cd sverklo
npm run bench:swe

The harness clones each pinned repo into benchmark/.cache/swe/ on first run and reuses the cache afterward. Per-repo Markdown reports and a top-level aggregate land on stdout; structured JSON for further analysis lives at benchmark/results/<timestamp>/.

If you want to add a new repo, drop a JSONL file into benchmark/src/swe/datasets/ and a corresponding entry in repos.json. PRs welcome; the goal is for this benchmark to grow into something that can’t be gamed by tuning against any one of its constituents.

Read the full paper

The bench:swe results above are documented in §V.D of the v1.1.0 paper, with a fuller failure analysis and a discussion of methodology limitations. CC BY 4.0, on Zenodo with a permanent DOI:

doi.org/10.5281/zenodo.19802051

What’s next

If you want to follow along, the repository is at github.com/sverklo/sverklo, the harness lives in benchmark/src/swe/, and the next paper revision will land on sverklo.com/research.