/* analysis · 2026-05-15 · the audit gap */

1 million lines, 10,400 unsafe blocks, 6 days. The audit gap is the bottleneck.

2026-05-15 ~7 min read Part of: bench-loop posts

Yesterday PR #30412 merged. The Bun runtime — about a million lines, 6,755 commits, ~2,188 files — was rewritten from Zig to Rust. The work was executed by Claude Code agents under Anthropic, which now owns the Bun team. The HN thread hit 685 points in a day. The PR itself sits at 1,382 thumbs-up and 1,186 thumbs-down. That's not a normal merge.

"We haven't been typing code ourselves for many months now." — Jarred Sumner, on the Bun rewrite

If you read the comments, you'll notice something. The loudest objections are not "Rust was the wrong call" or "Zig deserved better." Those threads exist, but they're not the ones with the upvotes. The upvoted criticism is narrower and harder to dismiss:

  1. The rewrite ships with 10,400 unsafe blocks across 736 files. That undercuts the memory-safety pitch the rewrite was partly justified by.
  2. Agents wrote it. Sumner says he and his team haven't been typing the code for months.
  3. So who actually audits a million lines an agent shipped?

The third question is the one that matters. The first two are inputs to it.

The question is settled. The next one isn't.

It's tempting to write a post about whether AI agents should be writing production runtimes. Don't bother. The merge button got pressed. The benchmarks Bun publishes are roughly neutral-to-faster than the Zig version. The PR is in main. Anthropic, who builds the agent, owns the team that shipped the agent's output. The question of whether is over.

The question of how do you evaluate what shipped is wide open.

That's the gap. When a human writes a 500-line PR, the reviewer reads it. When an agent writes a million lines across two thousand files, "read the diff" stops being a coherent strategy. You need a measurement layer — something that can answer "what changed structurally, what's the blast radius, where's the risk concentrated" without depending on a second LLM to summarize the first one.

I work on sverklo, a local-first MCP code-intelligence server. Sverklo exists for exactly this question. Not as a replacement for review — humans still have to make the call — but as the instrument you point at a codebase before you decide where to spend your reviewing attention.

Here's what that looks like in practice.

What "audit" means when you can't read the diff

Two of sverklo's tools are directly relevant to the Bun situation: sverklo_audit and sverklo_impact.

sverklo_audit returns a structural health report — god nodes, hub files, dead code, security pattern hits, circular dependencies — graded A through F. Run against sverklo's own codebase last week, the agent-led dogfood run returned Grade A after the v0.20.27 vendored-path filter shipped, up from Grade B before. The grade is heuristic, not magic. But it surfaces the things you'd want a human reviewer to start with: which files have outsized fan-in, which symbols touch too much of the graph, where the unsafe-pattern density spikes.

For a rewrite the size of Bun's, what you'd actually want to know is not "is this codebase good" — it's "where is the risk concentrated." Grade A vs F isn't the interesting number. The interesting numbers are the per-file unsafe density, the hub-file list, and the circular-dep count. Those are the addresses for review attention.

sverklo_impact answers the second question: blast radius. Ask it about a symbol, and it returns every reference, grouped by file and call-site type. On our dogfood run this week, sverklo_impact "parseFile" returned 19 references across 4 files. That's the kind of answer you want before touching a hot path. If an agent proposes to change parseFile, knowing the 19 downstream callers up front is the difference between a routine refactor and a Friday afternoon you regret.

Neither tool tells you whether the agent's code is good. They tell you where to look. That's the gap between "the agent shipped" and "we know what shipped."

We just shipped 11 releases this week. With agents. Read this part carefully.

I want to be specific about why I think we have standing to write this post.

Between Monday and today, sverklo shipped versions v0.20.22 through v0.20.31 — eleven releases in a week. Most of that work was driven by Claude Code agents running against sverklo's own repo. The agents found bugs in sverklo by running sverklo against itself, then fixing what they found, then re-running.

That cycle is not clean. In v0.20.25 I made a path-lookup change. The dogfood audit agent, on its next pass, flagged a HIGH-severity security regression in that exact change — a SQL wildcard-injection opening I hadn't noticed (GLOB metachars in bound parameters). I shipped the fix in v0.20.29. The agent found the bug I introduced. That's the loop working. It is also exactly the kind of thing that would have made it to main if I'd been the only reviewer.

I'm not telling that story to brag about the catch. I'm telling it because it's the smaller-scale version of what just happened with Bun. An agent shipped code. A different agent, running a different tool, found a problem in it. The human in the loop — me — was a coordinator, not the auditor. The audit was tooling.

If you can't tell that story honestly at small scale, you have no business making claims about it at million-line scale. So: that's our scale, that's what the agents caught, and that's what we shipped.

What this is, and what it isn't

A few things sverklo is not:

What it is: a measurement layer. The kind of thing you point at a codebase — agent-written, human-written, or hybrid — when you need to know where to spend your attention.

The Bun rewrite makes the case for that measurement layer the way no marketing copy could. A million lines, ten thousand unsafe blocks, six days, agents at the keyboard, and now in production. If that's the new normal, "I'll read the diff" is not a strategy anyone can defend with a straight face.

On honesty about the tool itself

The reason this post can be written is that we don't sell sverklo on belief. We sell it on numbers you can verify.

The reason I'm pointing at those before asking you to install anything is that the appropriate response to "an agent shipped a million lines, trust the measurements" is to look at the measurements. Same standard. We publish where we lose so the wins are checkable.

If the audit gap is the bottleneck

The Bun rewrite is the canary, not the crisis. The crisis is that the next ten rewrites are coming, the agents are getting faster, and the review surface is getting wider. Some of those rewrites will go fine. Some of them will ship unsafe blocks no human ever read. The differentiator between those two outcomes is going to be the tooling around the agent, not the agent itself.

If you want to point a measurement layer at your own repo:

Try the measurement layer

npm install -g sverklo
sverklo init      # writes .mcp.json + CLAUDE.md
sverklo bench self    # measure cold-start + warm-call latency on YOUR repo

If you want to verify the retrieval claims before installing, the bench is the right place to start. F1, P@1, baselines, losses — all of it.

A million lines is a lot of code to take on faith. Don't.

Related