/* engineering · 2026-05-07 */

The bench is a feedback loop on two axes

2026-05-07 · Nikita Groshin · ~7 min read

Two iterations of the same pattern in one week, on different axes. Last weekend the bench compressed two maintainers' iteration on a parser bug into 36 hours. This week it compressed a Reddit suggestion into shipped CI in 24. Same loop, two surfaces. The mechanism is the part worth writing down.

Axis 1 — bugs (the lodash arc, 36 hours)

I added jcodemunch-mcp and GitNexus to sverklo's 60-task retrieval bench on May 2, then published the writeup with the parts where sverklo lost. Jake Gravelle (jcodemunch's maintainer) read the post within hours and shipped three back-to-back releases: v1.80.7 (CommonJS module.exports re-export chains), v1.80.8 (a 500 KB per-file size cap, because lodash is 548 KB), v1.80.9 (a monolithic-IIFE call-graph fallback). His lodash P1 went from 0/10 to 9/10 against the same task suite.

I then added lodash to my bench as a third dataset. That immediately exposed the symmetric bug on my side: the regex-based brace counter mis-counted braces inside string literals. Line 6301 of lodash.js has the literal '{\n/* [wrapped with ', and the unbalanced { made every function declaration after that line get absorbed into one ~11K-line chunk. Public methods (map, filter, reduce) never got their own chunks. Fix: a string-aware brace counter plus exact-match priority in the lookup tool. Sverklo v0.20.2 shipped on May 4. P1 went 0.30 → 0.73; overall F1 went 0.45 → 0.56.

Both projects landed lodash P1 fixes inside 36 hours of the original bench publication. Different parsers, different bugs, same effect. The bench made each side's blind spot visible to the other in a way no internal eval would have.

Axis 2 — friction (the auto-bench CI arc, 24 hours)

Three days later I posted a follow-up to r/mcp describing the loop. The post got modest engagement — score 1, 3 comments, no front-page traffic. But one of the comments came from u/d3vilzwrld, who buried a useful question in the last paragraph:

Have you considered a "run on submit" CI action where anyone submitting an MCP server gets a PR with benchmark results? That would make the eval self-repairing — new entrants validate their own numbers before claiming them.

This is the contribution-friction analog of the lodash arc. The lodash bug was about code drift; this was about process drift. Up to that point, contributing a baseline to sverklo-bench meant the maintainer (me) had to run the harness locally before merging. That doesn't scale past five baselines, and it forces every contributor to wait for maintainer ceremony before they can verify their own numbers. The friction was real and I hadn't seen it from inside.

I filed sverklo-bench#4 the same day and shipped .github/workflows/auto-bench.yml in the main sverklo repo the next morning. The workflow detects which baseline files a PR touches via git diff, runs the harness on the express dataset (~30 tasks; lodash and sverklo datasets are too slow for free-tier GitHub Actions), and posts a results-table comment back to the PR within ~10 minutes. Idempotent — re-running the workflow updates the same comment in place. Full results upload as a GH Actions artifact.

Total elapsed: ~24 hours from public ask to merged PR. Same arc shape: public surface → external maintainer (or contributor, in this case) responds → ship.

The pattern, written down

A public eval surface compresses iteration time across whatever axis you publish on. Bugs become trivial to file once the failure is reproducible. Friction becomes trivial to fix once a contributor names it.

The version with more careful boundaries:

Public surface. Not "we benchmarked it internally." A URL, a methodology document, a reproducer command. Both axes need this; the bug-fix axis needs it because Jake had to read the failure mode and trace it to his own parser; the friction axis needs it because d3vilzwrld had to see the contribution flow before he could critique it.
Honest losses. If the surface only publishes wins, the loop doesn't start. The lodash arc didn't fire because sverklo published its own wins; it fired because sverklo published the slice where jcodemunch beat it (P1 0.65 vs sverklo 0.45 in the original run). Jake had a reason to read the writeup. The auto-bench arc fired because the writeup explicitly named contribution friction (the May 4 follow-up post said "the harness shape is documented; submitting a baseline takes one PR"); d3vilzwrld noticed that "one PR" still required maintainer-side ceremony and named the gap.
Reproducibility. Both axes need a way to verify the response. Jake's three releases were merge-able only because anyone could re-run the bench against v1.80.9 and confirm the P5 recall went 0.00 → 1.00. Auto-bench is merge-able only because anyone can open a baseline-touching PR and watch the workflow comment back with their numbers.
Low-friction merge surface. If responding to bench findings requires the responder to submit a separate paper, write a separate eval, or wait for the maintainer to schedule a re-run, the loop stalls. Jake shipped three back-to-back releases because his project's CI was already fast and he had merge rights. Auto-bench shipped fast because GitHub Actions templates are well-documented and the bench harness already exposed BASELINES= and DATASETS= filters.

When this doesn't work

Don't generalize from two data points without naming the failure modes.

Closed evals. Most "we benchmarked it" claims in this category are closed: the eval set is private, the methodology is "trust us," and the numbers reset every release. The loop can't start because there's no surface to read.
Hand-waved methodology. Even with a public dataset, if the scoring code is closed or the tolerances are vague, the responder can't verify whether their fix actually helped. The bench-loop assumes tolerance-level disclosure (P1 ±3 lines, P2 ±2 lines, P4/P5 set membership). Without it, drift hides.
Maintainer cycle time exceeds bench cycle time. If the maintainer takes weeks to merge, contributors lose the iteration loop's heat. The auto-bench CI exists precisely to keep this from being the bottleneck.
Audience mismatch. The lodash arc worked because Jake reads r/mcp and cares about jcodemunch's reputation against an honest eval. If the publication channel doesn't reach the responders — or if the responders don't believe the eval is honest — nothing happens. Both arcs ran on r/mcp because that's where the relevant maintainers are.
The eval has to actually be wrong sometimes. If the bench is rigged so the maintainer always wins, no one bothers responding. Smart-grep tying sverklo on overall F1, jcodemunch winning P1 outright, are not bugs in the writeup; they're the precondition for anyone outside the project taking it seriously.

Where the bench goes next

Jake's r/mcp comment included an aside that's been on my mind since:

The bigger opportunity here is the potential genesis of an "MCP Server Arena" on par with what the leading AI/LLM/Chatbot arenas provide.

This is the competing maintainer — the one who got beaten on P1 in the original run — publicly suggesting that sverklo's bench evolve into category-wide infrastructure. sverklo-bench is now spun out of the main repo as its own audit surface; auto-bench CI runs on every baseline-touching PR; the contribution flow is documented in CONTRIBUTING.md. The category-arena framing — per-tool brackets (code-intel, browser-automation, data-extraction, shell-execution) with their own metric sets — is genuinely possible from where we are now, and the responders Jake is gesturing at are the same people the existing bench-loop has already pulled to the table.

None of which is sverklo-the-product. The bench is becoming category infrastructure that doesn't depend on any one project succeeding. That's the right shape for an eval surface — and it's why publishing your losses is load-bearing rather than embarrassing.

If you maintain a tool in a category with no shared eval

Two pieces of free advice and one offer.

The first piece of advice: publish methodology before results. The bench-loop fires on the methodology, not the leaderboard. Numbers age fast; methodology compounds. Sverklo's bench page leads with "5 baselines, 90 hand-verified tasks, ±3-line tolerance on P1, set membership on P4/P5" before any F1 numbers appear, because the methodology is the part that survives the next release.

The second: publish losses on the same page as wins. If you can't write a paragraph titled "Where this benchmark says we lose," the bench isn't going to be load-bearing. Pull the slice where a baseline beats you and put it adjacent to the slice where you beat the baseline. That's what makes the eval credible enough to drive responses.

The offer: if you maintain a tool in a category that doesn't have a shared eval, I'm happy to share the harness shape that runs sverklo's bench. The methodology repo is at github.com/sverklo/sverklo-bench; auto-bench CI is at .github/workflows/auto-bench.yml. Both are MIT-licensed and the design choices are documented. Open an issue if you want help wiring it up to your category.

The artifacts

bench:primitives — the 90-task retrieval evaluation across 5 baselines
sverklo/sverklo-bench — methodology + ground truth (the audit-trail repo)
auto-bench.yml — the GitHub Actions workflow that runs on baseline-touching PRs
"I added two competitors to my own benchmark" — the original 5-baseline writeup that started axis 1
r/mcp 1t4n9un — the follow-up post that started axis 2

If you're running an MCP server in a category without a shared bench, the harness shape is the part that compounds. Numbers come and go; the methodology stays.