/* bench losses · 180-task / 6-codebase run · 2026-05-12 */

Sverklo loses 42 of 180 bench tasks. Here are all of them.

Where sverklo's F1 was lower than at least one of naive-grep, smart-grep, jcodemunch-mcp, or gitnexus on the same task. Every loss, sorted by F1 delta. No selection, no aggregation, no spin. If a benchmark only publishes wins, treat it as marketing — not measurement.

42
tasks lost (of 180)
23%
loss rate
17 / 11 / 10 / 4
P2 / P1 / P5 / P4 losses
5
fixed in v0.20.19 (fastapi P5)

By category — what the pattern says

P2 reference finding — 17 losses, mostly to smart-grep

This is sverklo's largest loss surface and the most-fixable. Smart-grep wins on tasks where the symbol appears in a flat-namespace context (lodash, express) because literal string match with definition-pattern filtering is genuinely the right tool. Sverklo's hybrid (BM25 + embeddings + PageRank) over-thinks these. Mitigation paths under investigation: route flat-symbol P2 queries to a regex shortcut, or weight the lexical channel higher on single-token queries.

P1 definition lookup — 11 losses, split between fastapi (5) and lodash/sverklo/requests (6)

The fastapi P1 losses are correlated with the audit's Python-decorator gap (now fixed in v0.20.19, but P1 specifically is a separate retrieval path — these may persist). The other 6 are mostly multi-line class/function definitions where the chunker's line cap truncates the relevant span. The parser fix in v0.20.17 (paren-aware brace counting) recovered ~464 references on sverklo's own repo but doesn't address the chunk-truncation pattern these P1 tasks hit.

P5 dead-code detection — 10 losses

Two patterns. The fastapi cluster (5 tasks, all scored 0.00) was the audit's DECORATOR_ENTRY_POINT regex being TS/NestJS-only — every FastAPI route method fell through as a false-positive orphan. Fixed in v0.20.19 with Python decorator coverage; the next bench rerun should move these from 0.00 to 1.00. The sverklo-self P5 losses (5 tasks, 0.00 vs jcodemunch's 0.04) are different: jcodemunch's barely-above-zero F1 wins only because sverklo's orphan list contains 10 items vs jcodemunch's empty. Both are wrong; the bench's empty-expected scoring favors the wronger one.

P4 file dependencies — 4 losses

The smallest loss surface. Sverklo wins P4 overall by 44 points over the next baseline (0.84 vs smart-grep 0.40). The 4 losses are edge-case files with dynamic imports or test-only fixtures where the import graph is genuinely ambiguous. We don't have a near-term fix for these — they require runtime tracing, not static analysis.

Every loss, sorted by F1 delta

task cat sv-F1 their-F1 baseline Δ sv-tok their-tok note
sverklo/sv-p1-01P10.001.00jcodemunch+1.00530959
sverklo/sv-p1-03P10.001.00jcodemunch+1.00270874
express/ex-p2-04P20.001.00smart-grep+1.0011849
lodash/ld-p1-06P10.001.00smart-grep+1.0094644
requests/rq-p1-08P10.001.00jcodemunch+1.001,180937
requests/rq-p1-09P10.001.00jcodemunch+1.00611902
requests/rq-p1-10P10.001.00jcodemunch+1.00906896
fastapi/fa-p1-01P10.001.00gitnexus+1.001,0361,244
fastapi/fa-p1-02P10.001.00gitnexus+1.001,351924
fastapi/fa-p1-06P10.001.00gitnexus+1.001,3191,000
fastapi/fa-p1-07P10.001.00gitnexus+1.001,2802,550
fastapi/fa-p1-08P10.001.00gitnexus+1.00943298
fastapi/fa-p5-01P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-02P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-03P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-04P50.001.00naive-grep+1.008410fixed in v0.20.19
fastapi/fa-p5-05P50.001.00naive-grep+1.008410fixed in v0.20.19
sverklo/sv-p2-04P20.000.67jcodemunch+0.6714571
express/ex-p2-09P20.491.00smart-grep+0.511,193886
lodash/ld-p2-08P20.300.77smart-grep+0.471,1322,053
express/ex-p2-01P20.270.63smart-grep+0.36530701
express/ex-p2-06P20.671.00smart-grep+0.3315274
express/ex-p2-08P20.671.00smart-grep+0.3312327
express/ex-p4-03P40.671.00jcodemunch+0.33100232
sverklo/sv-p4-04P40.180.50gitnexus+0.3216239
express/ex-p2-10P20.580.89smart-grep+0.31666472
lodash/ld-p2-07P20.000.27smart-grep+0.273951,058
lodash/ld-p4-05P40.600.86smart-grep+0.261221,281
lodash/ld-p2-10P20.240.47smart-grep+0.23670995
lodash/ld-p2-03P20.110.34smart-grep+0.237736,782
lodash/ld-p4-04P40.500.71smart-grep+0.2199121
lodash/ld-p2-04P20.150.34smart-grep+0.196895,791
sverklo/sv-p2-06P20.000.18jcodemunch+0.18284240
lodash/ld-p2-05P20.090.25smart-grep+0.166716,310
express/ex-p2-03P20.420.58smart-grep+0.16405450
express/ex-p2-02P20.901.00smart-grep+0.10217263
fastapi/fa-p2-07P20.000.05gitnexus+0.051,9182,898
sverklo/sv-p5-01P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-02P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-03P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-04P50.000.04jcodemunch+0.046786,779
sverklo/sv-p5-05P50.000.04jcodemunch+0.046786,779

Source: benchmark/results/2026-05-12T17-11-45-853Z/raw.jsonl in github.com/sverklo/sverklo. The 138 tasks not listed here are tasks sverklo either won outright or tied. The bench harness writes raw, summary, and report files on every run — reproducible with npm run bench:quick from a clean clone.

Why publish this

Most public benchmarks for AI tooling show wins only. The bench's wins live on /bench/ and /mcp/. This page is the half that doesn't get published anywhere else: where we're worse, by how much, and what we know about it.

Three reasons it has to exist:

Found a failure mode we should fix? Open an issue. Bring the task ID + expected behavior + what sverklo returned. The bench-as-feedback-loop pattern has worked twice already — once with jcodemunch-mcp shipping lodash fixes inside 36h of the bench publication, once with sverklo's own Python parser bug fixed within the same week the requests dataset landed.

Reproduce

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build
npm run bench:quick  # ~30-45 min, 180 tasks × 5 baselines
# Outputs: benchmark/results/<timestamp>/{raw.jsonl, summary.json, report.md}

Raw artifact for this page: 2026-05-12T17-11-45-853Z. Bench methodology: METHODOLOGY.md.