/* bench losses · 180-task / 6-codebase run · May 2026 · open loss ledger */

Open benchmark loss ledger for the 180-task public run.

Where sverklo's F1 was lower than at least one of naive-grep, smart-grep, jcodemunch-mcp, or gitnexus on the same task. Every loss, sorted by F1 delta. No selection, no aggregation, no spin. If a benchmark only publishes wins, treat it as marketing — not measurement.

2026-05-13 rerun update

The fastapi P5 cluster (5 tasks) and the sverklo-self P5 cluster (5 tasks) are closed — both now tie the field at F1 0.83. v0.20.19 shipped the Python-decorator audit fix; the rerun moved those 10 tasks from "sverklo loses" to "sverklo ties or wins." Loss count on the May 13 run is ~22, not 42. The 22 still-open losses are concentrated in P2 (smart-grep wins) and P1 (mixed). Headline overall F1 moved 0.56 → 0.58. Full table on this page reflects the May 12 run (kept as historical record); the May 13 report is in benchmark/results/2026-05-13T18-32-20-478Z/.

42
historical May 12 losses (of 180)
10
closed in May 13 rerun (P5 cluster)
~22
still-open losses (P2 + P1)
17 / 11 / 10 / 4
P2 / P1 / P5 / P4 (May 12)

By category — what the pattern says

P2 reference finding — 17 losses, mostly to smart-grep

This is sverklo's largest loss surface and the most-fixable. Smart-grep wins on tasks where the symbol appears in a flat-namespace context (lodash, express) because literal string match with definition-pattern filtering is genuinely the right tool. Sverklo's hybrid (BM25 + embeddings + PageRank) over-thinks these. Mitigation paths under investigation: route flat-symbol P2 queries to a regex shortcut, or weight the lexical channel higher on single-token queries.

P1 definition lookup — 11 losses, split between fastapi (5) and lodash/sverklo/requests (6)

The fastapi P1 losses are correlated with the audit's Python-decorator gap (now fixed in v0.20.19, but P1 specifically is a separate retrieval path — these may persist). The other 6 are mostly multi-line class/function definitions where the chunker's line cap truncates the relevant span. The parser fix in v0.20.17 (paren-aware brace counting) recovered ~464 references on sverklo's own repo but doesn't address the chunk-truncation pattern these P1 tasks hit.

P5 dead-code detection — 10 losses (now closed on 2026-05-13)

Two patterns, both now resolved. The fastapi cluster (5 tasks, all scored 0.00 on May 12) was the audit's DECORATOR_ENTRY_POINT regex being TS/NestJS-only — every FastAPI route method fell through as a false-positive orphan. v0.20.19 added Python decorator coverage; the 2026-05-13 rerun confirms these tasks now score 1.00. The sverklo-self P5 losses (5 tasks where sverklo scored 0.00 vs jcodemunch's 0.04) are also closed — sverklo now ties the field on P5 at 0.83 overall. The interesting P5 story is now jcodemunch: after the sverklo-bench#3 baseline refresh removed the max_results=100 cap, jcodemunch's P5 recall jumped to 1.00 across all 30 tasks (precision 0.34, F1 0.34, tokens 10,172). It's now the only baseline with zero false negatives on dead-code.

P4 file dependencies — 4 losses

The smallest loss surface. Sverklo wins P4 overall by 44 points over the next baseline (0.84 vs smart-grep 0.40). The 4 losses are edge-case files with dynamic imports or test-only fixtures where the import graph is genuinely ambiguous. We don't have a near-term fix for these — they require runtime tracing, not static analysis.

Every loss, sorted by F1 delta

task cat sv-F1 their-F1 baseline Δ sv-tok their-tok note
sverklo/sv-p1-01P10.001.00jcodemunch+1.00530959
sverklo/sv-p1-03P10.001.00jcodemunch+1.00270874
express/ex-p2-04P20.001.00smart-grep+1.0011849
lodash/ld-p1-06P10.001.00smart-grep+1.0094644
requests/rq-p1-08P10.001.00jcodemunch+1.001,180937
requests/rq-p1-09P10.001.00jcodemunch+1.00611902
requests/rq-p1-10P10.001.00jcodemunch+1.00906896
fastapi/fa-p1-01P10.001.00gitnexus+1.001,0361,244
fastapi/fa-p1-02P10.001.00gitnexus+1.001,351924
fastapi/fa-p1-06P10.001.00gitnexus+1.001,3191,000
fastapi/fa-p1-07P10.001.00gitnexus+1.001,2802,550
fastapi/fa-p1-08P10.001.00gitnexus+1.00943298
fastapi/fa-p5-01P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-02P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-03P50.001.00naive-grep+1.008010fixed in v0.20.19
fastapi/fa-p5-04P50.001.00naive-grep+1.008410fixed in v0.20.19
fastapi/fa-p5-05P50.001.00naive-grep+1.008410fixed in v0.20.19
sverklo/sv-p2-04P20.000.67jcodemunch+0.6714571
express/ex-p2-09P20.491.00smart-grep+0.511,193886
lodash/ld-p2-08P20.300.77smart-grep+0.471,1322,053
express/ex-p2-01P20.270.63smart-grep+0.36530701
express/ex-p2-06P20.671.00smart-grep+0.3315274
express/ex-p2-08P20.671.00smart-grep+0.3312327
express/ex-p4-03P40.671.00jcodemunch+0.33100232
sverklo/sv-p4-04P40.180.50gitnexus+0.3216239
express/ex-p2-10P20.580.89smart-grep+0.31666472
lodash/ld-p2-07P20.000.27smart-grep+0.273951,058
lodash/ld-p4-05P40.600.86smart-grep+0.261221,281
lodash/ld-p2-10P20.240.47smart-grep+0.23670995
lodash/ld-p2-03P20.110.34smart-grep+0.237736,782
lodash/ld-p4-04P40.500.71smart-grep+0.2199121
lodash/ld-p2-04P20.150.34smart-grep+0.196895,791
sverklo/sv-p2-06P20.000.18jcodemunch+0.18284240
lodash/ld-p2-05P20.090.25smart-grep+0.166716,310
express/ex-p2-03P20.420.58smart-grep+0.16405450
express/ex-p2-02P20.901.00smart-grep+0.10217263
fastapi/fa-p2-07P20.000.05gitnexus+0.051,9182,898
sverklo/sv-p5-01P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-02P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-03P50.000.04jcodemunch+0.046386,779
sverklo/sv-p5-04P50.000.04jcodemunch+0.046786,779
sverklo/sv-p5-05P50.000.04jcodemunch+0.046786,779

Source: benchmark/results/2026-05-12T17-11-45-853Z/raw.jsonl in github.com/sverklo/sverklo. The 138 tasks not listed here are tasks sverklo either won outright or tied. The bench harness writes raw, summary, and report files on every run — reproducible with npm run bench:quick from a clean clone.

Why publish this

Most public benchmarks for AI tooling show wins only. The bench's wins live on /bench/ and /mcp/. This page is the half that doesn't get published anywhere else: where we're worse, by how much, and what we know about it.

Three reasons it has to exist:

Found a failure mode we should fix? Open an issue. Bring the task ID + expected behavior + what sverklo returned. The bench-as-feedback-loop pattern has worked twice already — once with jcodemunch-mcp shipping lodash fixes inside 36h of the bench publication, once with sverklo's own Python parser bug fixed within the same week the requests dataset landed.

Reproduce

git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build
npm run bench:quick  # ~30-45 min, 180 tasks × 5 baselines
# Outputs: benchmark/results/<timestamp>/{raw.jsonl, summary.json, report.md}

Raw artifact for this page: 2026-05-12T17-11-45-853Z. Bench methodology: METHODOLOGY.md.