By category — what the pattern says

P2 reference finding — 17 losses, mostly to smart-grep

This is sverklo's largest loss surface and the most-fixable. Smart-grep wins on tasks where the symbol appears in a flat-namespace context (lodash, express) because literal string match with definition-pattern filtering is genuinely the right tool. Sverklo's hybrid (BM25 + embeddings + PageRank) over-thinks these. Mitigation paths under investigation: route flat-symbol P2 queries to a regex shortcut, or weight the lexical channel higher on single-token queries.

P1 definition lookup — 11 losses, split between fastapi (5) and lodash/sverklo/requests (6)

The fastapi P1 losses are correlated with the audit's Python-decorator gap (now fixed in v0.20.19, but P1 specifically is a separate retrieval path — these may persist). The other 6 are mostly multi-line class/function definitions where the chunker's line cap truncates the relevant span. The parser fix in v0.20.17 (paren-aware brace counting) recovered ~464 references on sverklo's own repo but doesn't address the chunk-truncation pattern these P1 tasks hit.

P5 dead-code detection — 10 losses

Two patterns. The fastapi cluster (5 tasks, all scored 0.00) was the audit's DECORATOR_ENTRY_POINT regex being TS/NestJS-only — every FastAPI route method fell through as a false-positive orphan. Fixed in v0.20.19 with Python decorator coverage; the next bench rerun should move these from 0.00 to 1.00. The sverklo-self P5 losses (5 tasks, 0.00 vs jcodemunch's 0.04) are different: jcodemunch's barely-above-zero F1 wins only because sverklo's orphan list contains 10 items vs jcodemunch's empty. Both are wrong; the bench's empty-expected scoring favors the wronger one.

P4 file dependencies — 4 losses

The smallest loss surface. Sverklo wins P4 overall by 44 points over the next baseline (0.84 vs smart-grep 0.40). The 4 losses are edge-case files with dynamic imports or test-only fixtures where the import graph is genuinely ambiguous. We don't have a near-term fix for these — they require runtime tracing, not static analysis.

Every loss, sorted by F1 delta

task	cat	sv-F1	their-F1	baseline	Δ	sv-tok	their-tok	note
sverklo/sv-p1-01	P1	0.00	1.00	jcodemunch	+1.00	530	959
sverklo/sv-p1-03	P1	0.00	1.00	jcodemunch	+1.00	270	874
express/ex-p2-04	P2	0.00	1.00	smart-grep	+1.00	118	49
lodash/ld-p1-06	P1	0.00	1.00	smart-grep	+1.00	946	44
requests/rq-p1-08	P1	0.00	1.00	jcodemunch	+1.00	1,180	937
requests/rq-p1-09	P1	0.00	1.00	jcodemunch	+1.00	611	902
requests/rq-p1-10	P1	0.00	1.00	jcodemunch	+1.00	906	896
fastapi/fa-p1-01	P1	0.00	1.00	gitnexus	+1.00	1,036	1,244
fastapi/fa-p1-02	P1	0.00	1.00	gitnexus	+1.00	1,351	924
fastapi/fa-p1-06	P1	0.00	1.00	gitnexus	+1.00	1,319	1,000
fastapi/fa-p1-07	P1	0.00	1.00	gitnexus	+1.00	1,280	2,550
fastapi/fa-p1-08	P1	0.00	1.00	gitnexus	+1.00	943	298
fastapi/fa-p5-01	P5	0.00	1.00	naive-grep	+1.00	801	0	fixed in v0.20.19
fastapi/fa-p5-02	P5	0.00	1.00	naive-grep	+1.00	801	0	fixed in v0.20.19
fastapi/fa-p5-03	P5	0.00	1.00	naive-grep	+1.00	801	0	fixed in v0.20.19
fastapi/fa-p5-04	P5	0.00	1.00	naive-grep	+1.00	841	0	fixed in v0.20.19
fastapi/fa-p5-05	P5	0.00	1.00	naive-grep	+1.00	841	0	fixed in v0.20.19
sverklo/sv-p2-04	P2	0.00	0.67	jcodemunch	+0.67	145	71
express/ex-p2-09	P2	0.49	1.00	smart-grep	+0.51	1,193	886
lodash/ld-p2-08	P2	0.30	0.77	smart-grep	+0.47	1,132	2,053
express/ex-p2-01	P2	0.27	0.63	smart-grep	+0.36	530	701
express/ex-p2-06	P2	0.67	1.00	smart-grep	+0.33	152	74
express/ex-p2-08	P2	0.67	1.00	smart-grep	+0.33	123	27
express/ex-p4-03	P4	0.67	1.00	jcodemunch	+0.33	100	232
sverklo/sv-p4-04	P4	0.18	0.50	gitnexus	+0.32	162	39
express/ex-p2-10	P2	0.58	0.89	smart-grep	+0.31	666	472
lodash/ld-p2-07	P2	0.00	0.27	smart-grep	+0.27	395	1,058
lodash/ld-p4-05	P4	0.60	0.86	smart-grep	+0.26	122	1,281
lodash/ld-p2-10	P2	0.24	0.47	smart-grep	+0.23	670	995
lodash/ld-p2-03	P2	0.11	0.34	smart-grep	+0.23	773	6,782
lodash/ld-p4-04	P4	0.50	0.71	smart-grep	+0.21	99	121
lodash/ld-p2-04	P2	0.15	0.34	smart-grep	+0.19	689	5,791
sverklo/sv-p2-06	P2	0.00	0.18	jcodemunch	+0.18	284	240
lodash/ld-p2-05	P2	0.09	0.25	smart-grep	+0.16	671	6,310
express/ex-p2-03	P2	0.42	0.58	smart-grep	+0.16	405	450
express/ex-p2-02	P2	0.90	1.00	smart-grep	+0.10	217	263
fastapi/fa-p2-07	P2	0.00	0.05	gitnexus	+0.05	1,918	2,898
sverklo/sv-p5-01	P5	0.00	0.04	jcodemunch	+0.04	638	6,779
sverklo/sv-p5-02	P5	0.00	0.04	jcodemunch	+0.04	638	6,779
sverklo/sv-p5-03	P5	0.00	0.04	jcodemunch	+0.04	638	6,779
sverklo/sv-p5-04	P5	0.00	0.04	jcodemunch	+0.04	678	6,779
sverklo/sv-p5-05	P5	0.00	0.04	jcodemunch	+0.04	678	6,779

Source: benchmark/results/2026-05-12T17-11-45-853Z/raw.jsonl in github.com/sverklo/sverklo. The 138 tasks not listed here are tasks sverklo either won outright or tied. The bench harness writes raw, summary, and report files on every run — reproducible with npm run bench:quick from a clean clone.

Why publish this

Most public benchmarks for AI tooling show wins only. The bench's wins live on /bench/ and /mcp/. This page is the half that doesn't get published anywhere else: where we're worse, by how much, and what we know about it.

Three reasons it has to exist:

It tells you what sverklo is bad at before you install. If your work is mostly grep-shaped, flat-namespace, single-symbol P2 reference queries, smart-grep wins 17 of those 60 tasks. That's a real signal — pick the right tool.
It separates fixed-by-shipping from fundamental. The 5 fastapi P5 tasks scored 0.00 because of a Python-decorator gap in the audit's orphan detection. v0.20.19 ships the fix. The next bench run will move those to 1.00. P2 smart-grep losses are different — they're architectural, not bugs.
It is the only thing keeping the wins credible. "F1 0.56 leader on a 180-task bench" reads as marketing without this page. With this page, the wins are reviewable.

Found a failure mode we should fix? Open an issue. Bring the task ID + expected behavior + what sverklo returned. The bench-as-feedback-loop pattern has worked twice already — once with jcodemunch-mcp shipping lodash fixes inside 36h of the bench publication, once with sverklo's own Python parser bug fixed within the same week the requests dataset landed.

Sverklo loses 42 of 180 bench tasks. Here are all of them.