Where sverklo's F1 was lower than at least one of naive-grep, smart-grep, jcodemunch-mcp, or gitnexus on the same task. Every loss, sorted by F1 delta. No selection, no aggregation, no spin. If a benchmark only publishes wins, treat it as marketing — not measurement.
This is sverklo's largest loss surface and the most-fixable. Smart-grep wins on tasks where the symbol appears in a flat-namespace context (lodash, express) because literal string match with definition-pattern filtering is genuinely the right tool. Sverklo's hybrid (BM25 + embeddings + PageRank) over-thinks these. Mitigation paths under investigation: route flat-symbol P2 queries to a regex shortcut, or weight the lexical channel higher on single-token queries.
The fastapi P1 losses are correlated with the audit's Python-decorator gap (now fixed in v0.20.19, but P1 specifically is a separate retrieval path — these may persist). The other 6 are mostly multi-line class/function definitions where the chunker's line cap truncates the relevant span. The parser fix in v0.20.17 (paren-aware brace counting) recovered ~464 references on sverklo's own repo but doesn't address the chunk-truncation pattern these P1 tasks hit.
Two patterns. The fastapi cluster (5 tasks, all scored 0.00) was the audit's DECORATOR_ENTRY_POINT regex being TS/NestJS-only — every FastAPI route method fell through as a false-positive orphan. Fixed in v0.20.19 with Python decorator coverage; the next bench rerun should move these from 0.00 to 1.00. The sverklo-self P5 losses (5 tasks, 0.00 vs jcodemunch's 0.04) are different: jcodemunch's barely-above-zero F1 wins only because sverklo's orphan list contains 10 items vs jcodemunch's empty. Both are wrong; the bench's empty-expected scoring favors the wronger one.
The smallest loss surface. Sverklo wins P4 overall by 44 points over the next baseline (0.84 vs smart-grep 0.40). The 4 losses are edge-case files with dynamic imports or test-only fixtures where the import graph is genuinely ambiguous. We don't have a near-term fix for these — they require runtime tracing, not static analysis.
| task | cat | sv-F1 | their-F1 | baseline | Δ | sv-tok | their-tok | note |
|---|---|---|---|---|---|---|---|---|
| sverklo/sv-p1-01 | P1 | 0.00 | 1.00 | jcodemunch | +1.00 | 530 | 959 | |
| sverklo/sv-p1-03 | P1 | 0.00 | 1.00 | jcodemunch | +1.00 | 270 | 874 | |
| express/ex-p2-04 | P2 | 0.00 | 1.00 | smart-grep | +1.00 | 118 | 49 | |
| lodash/ld-p1-06 | P1 | 0.00 | 1.00 | smart-grep | +1.00 | 946 | 44 | |
| requests/rq-p1-08 | P1 | 0.00 | 1.00 | jcodemunch | +1.00 | 1,180 | 937 | |
| requests/rq-p1-09 | P1 | 0.00 | 1.00 | jcodemunch | +1.00 | 611 | 902 | |
| requests/rq-p1-10 | P1 | 0.00 | 1.00 | jcodemunch | +1.00 | 906 | 896 | |
| fastapi/fa-p1-01 | P1 | 0.00 | 1.00 | gitnexus | +1.00 | 1,036 | 1,244 | |
| fastapi/fa-p1-02 | P1 | 0.00 | 1.00 | gitnexus | +1.00 | 1,351 | 924 | |
| fastapi/fa-p1-06 | P1 | 0.00 | 1.00 | gitnexus | +1.00 | 1,319 | 1,000 | |
| fastapi/fa-p1-07 | P1 | 0.00 | 1.00 | gitnexus | +1.00 | 1,280 | 2,550 | |
| fastapi/fa-p1-08 | P1 | 0.00 | 1.00 | gitnexus | +1.00 | 943 | 298 | |
| fastapi/fa-p5-01 | P5 | 0.00 | 1.00 | naive-grep | +1.00 | 801 | 0 | fixed in v0.20.19 |
| fastapi/fa-p5-02 | P5 | 0.00 | 1.00 | naive-grep | +1.00 | 801 | 0 | fixed in v0.20.19 |
| fastapi/fa-p5-03 | P5 | 0.00 | 1.00 | naive-grep | +1.00 | 801 | 0 | fixed in v0.20.19 |
| fastapi/fa-p5-04 | P5 | 0.00 | 1.00 | naive-grep | +1.00 | 841 | 0 | fixed in v0.20.19 |
| fastapi/fa-p5-05 | P5 | 0.00 | 1.00 | naive-grep | +1.00 | 841 | 0 | fixed in v0.20.19 |
| sverklo/sv-p2-04 | P2 | 0.00 | 0.67 | jcodemunch | +0.67 | 145 | 71 | |
| express/ex-p2-09 | P2 | 0.49 | 1.00 | smart-grep | +0.51 | 1,193 | 886 | |
| lodash/ld-p2-08 | P2 | 0.30 | 0.77 | smart-grep | +0.47 | 1,132 | 2,053 | |
| express/ex-p2-01 | P2 | 0.27 | 0.63 | smart-grep | +0.36 | 530 | 701 | |
| express/ex-p2-06 | P2 | 0.67 | 1.00 | smart-grep | +0.33 | 152 | 74 | |
| express/ex-p2-08 | P2 | 0.67 | 1.00 | smart-grep | +0.33 | 123 | 27 | |
| express/ex-p4-03 | P4 | 0.67 | 1.00 | jcodemunch | +0.33 | 100 | 232 | |
| sverklo/sv-p4-04 | P4 | 0.18 | 0.50 | gitnexus | +0.32 | 162 | 39 | |
| express/ex-p2-10 | P2 | 0.58 | 0.89 | smart-grep | +0.31 | 666 | 472 | |
| lodash/ld-p2-07 | P2 | 0.00 | 0.27 | smart-grep | +0.27 | 395 | 1,058 | |
| lodash/ld-p4-05 | P4 | 0.60 | 0.86 | smart-grep | +0.26 | 122 | 1,281 | |
| lodash/ld-p2-10 | P2 | 0.24 | 0.47 | smart-grep | +0.23 | 670 | 995 | |
| lodash/ld-p2-03 | P2 | 0.11 | 0.34 | smart-grep | +0.23 | 773 | 6,782 | |
| lodash/ld-p4-04 | P4 | 0.50 | 0.71 | smart-grep | +0.21 | 99 | 121 | |
| lodash/ld-p2-04 | P2 | 0.15 | 0.34 | smart-grep | +0.19 | 689 | 5,791 | |
| sverklo/sv-p2-06 | P2 | 0.00 | 0.18 | jcodemunch | +0.18 | 284 | 240 | |
| lodash/ld-p2-05 | P2 | 0.09 | 0.25 | smart-grep | +0.16 | 671 | 6,310 | |
| express/ex-p2-03 | P2 | 0.42 | 0.58 | smart-grep | +0.16 | 405 | 450 | |
| express/ex-p2-02 | P2 | 0.90 | 1.00 | smart-grep | +0.10 | 217 | 263 | |
| fastapi/fa-p2-07 | P2 | 0.00 | 0.05 | gitnexus | +0.05 | 1,918 | 2,898 | |
| sverklo/sv-p5-01 | P5 | 0.00 | 0.04 | jcodemunch | +0.04 | 638 | 6,779 | |
| sverklo/sv-p5-02 | P5 | 0.00 | 0.04 | jcodemunch | +0.04 | 638 | 6,779 | |
| sverklo/sv-p5-03 | P5 | 0.00 | 0.04 | jcodemunch | +0.04 | 638 | 6,779 | |
| sverklo/sv-p5-04 | P5 | 0.00 | 0.04 | jcodemunch | +0.04 | 678 | 6,779 | |
| sverklo/sv-p5-05 | P5 | 0.00 | 0.04 | jcodemunch | +0.04 | 678 | 6,779 |
Source: benchmark/results/2026-05-12T17-11-45-853Z/raw.jsonl in github.com/sverklo/sverklo. The 138 tasks not listed here are tasks sverklo either won outright or tied. The bench harness writes raw, summary, and report files on every run — reproducible with npm run bench:quick from a clean clone.
Most public benchmarks for AI tooling show wins only. The bench's wins live on /bench/ and /mcp/. This page is the half that doesn't get published anywhere else: where we're worse, by how much, and what we know about it.
Three reasons it has to exist:
Found a failure mode we should fix? Open an issue. Bring the task ID + expected behavior + what sverklo returned. The bench-as-feedback-loop pattern has worked twice already — once with jcodemunch-mcp shipping lodash fixes inside 36h of the bench publication, once with sverklo's own Python parser bug fixed within the same week the requests dataset landed.
git clone https://github.com/sverklo/sverklo && cd sverklo
npm install && npm run build
npm run bench:quick # ~30-45 min, 180 tasks × 5 baselines
# Outputs: benchmark/results/<timestamp>/{raw.jsonl, summary.json, report.md}
Raw artifact for this page: 2026-05-12T17-11-45-853Z. Bench methodology: METHODOLOGY.md.