bench:swe first results: where local-first code intelligence still misses
First complete cross-repository run of bench:swe across Express, NestJS, Vite, Prisma, and FastAPI. 38 of 65 questions hit perfect recall; aggregate mean recall is 66.2%. The headline number isn’t the interesting part. The interesting part is that the failures cluster, almost cleanly, in a single structural pattern that hybrid retrieval as currently implemented does not handle. Here’s what we found and what we’re doing about it.
What bench:swe actually measures
bench:swe is the cross-repository half of the benchmark suite released alongside the sverklo paper. It exists because the other internal benchmark sverklo runs — bench:research, scored against sverklo’s own codebase — sits at 97% perfect recall, and 97% is exactly what you should expect from an evaluation written by the same team that wrote the retriever’s synonym list. bench:research is a regression baseline. bench:swe is the actual evaluation.
The setup: 65 hand-written research-style questions across five popular open-source projects (Express 5.0.1, NestJS 10.4.7, Vite 6.0.7, Prisma 6.1.0, FastAPI 0.115.6), 13 questions per repo, all pinned to specific commits. Each question has required-evidence files as ground truth — the canonical files an engineer would need to read to answer it correctly. A retriever passes if its top-50 results contain every required file. We score avg_recall (mean fraction of evidence found per question) and perfect_recall (count of questions for which all evidence was found).
Two design constraints matter. First, none of these repositories are sverklo’s own code — the retriever has never been tuned against any of them. This is by construction: questions hand-written against your own repo will pass; questions written against code you’ve never seen are the only honest test of generalization. Second, the harness clones each repo fresh on every run from a pinned tag, so the numbers below are reproducible by anyone with npm run bench:swe.
The headline numbers
| Repository | Tasks | Avg recall | Perfect recall |
|---|---|---|---|
| Vite 6.0.7 | 13 | 88.5% | 11/13 |
| FastAPI 0.115.6 | 13 | 69.2% | 8/13 |
| Express 5.0.1 | 13 | 65.4% | 7/13 |
| NestJS 10.4.7 | 13 | 57.7% | 6/13 |
| Prisma 6.1.0 | 13 | 50.0% | 6/13 |
| Aggregate | 65 | 66.2% | 38/65 (58.5%) |
Some quick observations before we get to the interesting part. Vite is the easiest target by a long way. It’s the most modern of the five, has the cleanest src/ tree, and most of its features have a dedicated file with a name that matches the feature directly (config.ts, preview.ts, watch.ts). The hybrid retriever’s symbol-name and path-token channels both light up immediately on Vite questions. Prisma is the hardest: a multi-package monorepo whose runtime logic lives at packages/client/src/runtime/core/…/, four nesting levels deep and behind several layers of indirection.
The aggregate of 38/65 perfect recall is a number I both wanted higher and am glad to have. Higher would be nicer for the launch announcement. But the gap between sverklo’s 97% on its own codebase and 58.5% out-of-distribution is exactly the gap that bench:swe was built to expose — and the failures, when you look at which questions miss, point cleanly at one specific weakness in the retrieval design that I now know how to fix.
The interesting part: deeply-nested core files
I went through every miss by hand. The pattern is so consistent it’s almost embarrassing. The questions that fail are not the ones whose answers live in obscure or rarely-touched files. They’re the questions whose canonical answer lives in a central, heavily-imported, structurally-important file that nearly every other component in the repository references — but that no individual question lexically matches.
Concrete examples from the run:
- Express —
lib/router/index.js. Missed in 5 of 13 Express tasks: router-mount, error-middleware, app-mount, stream-disconnect, router-method-chain. This single file holds the routing dispatch, the middleware error contract, the sub-app mounting logic, and the method-chain registration — all four feature areas the questions probe. But none of the questions name it directly; the queries describe the behavior, not the file. - NestJS —
packages/core/injector/injector.tsandinstance-loader.ts. Missed in 4 tasks across DI resolution, circular dependency handling, request-scope, and lifecycle hooks. These two files implement the dependency-injection runtime that everything in NestJS goes through. They’re imported by every module. They’re also impossible to find by lexical matching against a question like “how does NestJS resolve circular dependencies?” - Prisma —
packages/client/src/runtime/core/engines/library/LibraryEngine.ts. Missed in queries about query-engine IPC and connection pooling. This is the file that actually implements both, but its directory path is so generic (“core/engines/library”) and its filename so unspecific that neither BM25 nor dense embeddings rank it above feature-named files likeconnection-pool.tselsewhere in the tree. - FastAPI —
fastapi/applications.pyandfastapi/routing.py. Missed in CORS and startup-shutdown queries. TheFastAPIclass itself wires up CORS middleware viaapp.add_middleware()and registers startup handlers via@app.on_event(). But the question text describes “CORS middleware,” and the retriever finds files literally named for CORS in the dependency tree of FastAPI’s tests rather than the file that implements the registration.
Notice what these have in common. The miss isn’t random. It’s a specific structural property of the file: every component imports it, but no individual feature-query lexically matches it. The retriever’s filename-as-signal channel and its symbol-name channel are too good at finding the feature-named files; they push the central god-file off the top-50 list because the feature files always rank higher on per-query relevance.
Why hybrid retrieval struggles here, exactly
Sverklo’s search runs five channels in parallel: FTS (BM25), code-chunk vectors, doc-chunk vectors, symbol-name match, and path-token match. Channelized Reciprocal Rank Fusion combines them. PageRank over the file-import graph is then applied as a tiebreaker.
This is the disconnect. PageRank correctly identifies injector.ts as one of the most structurally important files in NestJS. But because the per-channel rankings put feature-named files (circular-dependency-error.ts, request-scope.ts) at the top of every individual query’s candidate list, RRF fusion gives those feature files an unbeatable lead before PageRank even gets a chance to weigh in. The tiebreaker only matters when there are ties to break, and there aren’t.
What’s missing is a structural channel that follows imports backward from feature files to their centralizing implementations. If the per-channel candidates for “NestJS circular dependency resolution” surface circular-dependency-exception.ts, the retriever should automatically expand that candidate’s upstream graph — the files that import it, recursively — and add the heavily-cross-imported parents (injector.ts) to the candidate pool. PageRank knows which files those are; the per-query channels are too narrow to find them.
This is concretely the same gap that the existing filename-as-signal design pattern already addresses in the downstream direction (when a filename matches a query, all definitions in that file are added to the pool, even if their bodies don’t lexically match). The fix is to add a symmetric upstream traversal: when a feature file is surfaced, its high-PageRank parents in the import graph are too.
What this is not evidence of
Two things this finding is not:
It’s not evidence that hybrid retrieval doesn’t work. 38 of 65 questions hit perfect recall and another 14 hit ≥50%. On Vite, where the codebase is well-organized at the file-naming level, sverklo finds 11 of 13 answers perfectly. The headline number isn’t representing a fundamental retrieval ceiling; it’s representing one specific class of question (the “follow imports upstream” class) that the current design doesn’t address.
It’s not a benchmark of grep against sverklo. bench:swe currently scores only sverklo. To make it a cross-tool benchmark, the smart-grep and naive-grep baselines used in bench:primitives need to be wired up against the same questions. That’s on the v0.18 list. I expect grep to do badly here — these are research questions whose answers don’t live in single substrings — but the comparison should exist as evidence rather than as a hand-wave.
Why I’m publishing this
The honest aggregate of 58.5% perfect recall is uncomfortable to put on the homepage of a code-intelligence product. It is — obviously — not a number designed to sell installs.
But that’s the point. A benchmark you only release when you win is marketing; a benchmark you release when you lose is a benchmark. If a competitor system runs against the same harness next quarter and gets 70%, that’s a real result and a thing I should be measured against. If I quietly hold back the numbers until the next release closes the gap, the harness becomes vapor and the “reproducible benchmark” framing rings hollow.
The other thing transparency buys you is a clear research backlog. Going through the misses by hand — not just looking at the aggregate — produced an actionable diagnosis (god-files are the failure mode) and an actionable fix (upstream-import-graph traversal channel). I would not have found either if I had just stared at “58.5%” for an hour.
Reproduce
The full result above takes about 15–20 minutes on a laptop, end to end:
npm install -g sverklo git clone https://github.com/sverklo/sverklo cd sverklo npm run bench:swe
The harness clones each pinned repo into benchmark/.cache/swe/ on first run and reuses the cache afterward. Per-repo Markdown reports and a top-level aggregate land on stdout; structured JSON for further analysis lives at benchmark/results/<timestamp>/.
If you want to add a new repo, drop a JSONL file into benchmark/src/swe/datasets/ and a corresponding entry in repos.json. PRs welcome; the goal is for this benchmark to grow into something that can’t be gamed by tuning against any one of its constituents.
Read the full paper
The bench:swe results above are documented in §V.D of the v1.1.0 paper, with a fuller failure analysis and a discussion of methodology limitations. CC BY 4.0, on Zenodo with a permanent DOI:
doi.org/10.5281/zenodo.19802051What’s next
- v0.18 retrieval change: upstream-import-graph traversal channel as described above. The hypothesis is that this lifts NestJS, Prisma, and Express by ~10–15 points each. If it doesn’t, that’s also a publishable result.
- Smart-grep and naive-grep baselines on bench:swe — required to make the suite a real cross-tool comparison rather than a sverklo-only regression harness.
- Independent annotators for at least one of the five repos’ ground-truth files. Right now I authored every question and every required-evidence list, which is a known weakness; getting an external engineer to verify even one repo’s ground truth would meaningfully strengthen the construct validity claim.
- Add Linux, Chromium, or another truly large codebase to test scaling. The current five repos are all in the 1k–10k file range. The interesting question is whether the failure pattern persists at 100k+.
If you want to follow along, the repository is at github.com/sverklo/sverklo, the harness lives in benchmark/src/swe/, and the next paper revision will land on sverklo.com/research.