Late-interaction rerank made our F1 worse, not better
We wired a poor-man's late-interaction reranker through sverklo's lookup and refs tools — the two primitives our public bench actually exercises — ran the full 4-dataset 120-task evaluation three times deterministically, and F1 dropped from 0.5847 to 0.5551. Negative result. Issue #29 tracks the experiment; this post is the close-out.
get." Token-level semantic similarity dilutes the exact-match signal instead of sharpening it.What we wired, exactly
"Late-interaction rerank" usually means ColBERT v2: a model trained end-to-end to produce token vectors for queries and documents, scored at retrieval time by MaxSim — for each query token, take the max cosine similarity against any document token, then sum. The shipped systems that lift retrieval F1 by 5–15% (PLAID, ColBERT v2, modern variants) all use a model trained for this objective.
"Poor-man" means: take the same MiniLM-L6-v2 we already use for sentence embeddings, run the model with token-level outputs (the layer we currently mean-pool away), and use those token vectors as a stand-in for ColBERT's trained representations. Same MaxSim scoring, free model, no fine-tune. The hope was to get a non-trivial fraction of the published lift for almost no engineering cost. The cheapest possible thing to try.
The wiring landed in two call sites:
sverklo_lookup— pull a wider candidate pool from SQL (40 instead of the default 20), MaxSim-rerank against the symbol query, return top-K to the formatter.sverklo_refs— group references by file, take the highest-ranked chunk per file as the rerank input, reorder files by MaxSim score instead of PageRank.
Both are the call paths that the bench's P1 (definition lookup) and P2 (reference finding) tasks exercise. Earlier rerank wiring lived only in hybrid-search.ts, which the bench primitives don't go through — that's why three earlier A/B runs had returned identical numbers (F1 0.7483 = 0.7483 = 0.7483, a structural finding, not noise). The wiring this round was the actual experiment.
The numbers
Setup: BASELINES=sverklo,sverklo-rerank, full 4-dataset run (express + lodash + sverklo + requests), 120 hand-verified tasks, three runs back-to-back with deterministic seeds. The numbers were stable across runs to four decimal places, so what's below is one run; the three-run mean is identical.
| Configuration | F1 | P1 def | P2 refs | P4 deps | Tokens |
|---|---|---|---|---|---|
| sverklo (baseline) | 0.5847 | 0.700 | 0.290 | 0.780 | 498 |
| sverklo-rerank (poor-man) | 0.5551 | 0.625 | 0.290 | 0.780 | 498 |
| Δ (rerank − base) | −0.0296 | −0.075 | 0.000 | 0.000 | 0 |
P2 and P4 are unchanged because the rerank reordering doesn't change the binary "did the right file appear at all?" answer that those tasks score against — it only reorders the file list, and both the pre- and post-rerank lists contain the right files. Token cost is unchanged because the rerank pass touches ranking, not output.
The whole regression is concentrated in P1, the slice we expected to win the most. Definition lookup goes from 0.700 to 0.625 — a clean 7.5-point drop. The rerank made the tool actively worse at the most basic retrieval task.
Why it broke instead of helped
P1 tasks have a specific shape. The query is always a literal symbol name: get, Session, handleRequest, UserService. The right answer is the canonical definition site. Sverklo's existing pipeline ranks by SQL match-quality first (exact symbol-name match > prefix match > substring match), then breaks ties by PageRank. That ordering is already exactly what the task wants: literal name match wins, structural importance breaks ties.
What MaxSim scores instead is token-level semantic alignment between the query string and the chunk text. For a one-token query like get, this is almost meaningless — every JS file in lodash contains 50 tokens semantically near "get" (fetch, retrieve, read, load) and the chunk that defines get doesn't necessarily score highest under that metric. For a multi-token query like UserService the model decomposes it into user + service sub-tokens and aligns against any chunk with high "user" and "service" density — a test fixture mentioning user.service.test('something') can outrank the actual class definition.
This is the bias that ColBERT v2's training fixes. The model learns that literal token match outweighs semantic similarity for code queries. MiniLM, which was trained as a sentence-embedding model on natural-language pairs, has the opposite bias by default — it pulls similar-meaning tokens close together, which is the wrong signal here.
The mistake on our part was assuming "we already have the model, MaxSim is a cheap algorithmic addition" was the cheap experiment. The actual cheap experiment would have been to skip MiniLM and try ColBERT v2 directly. The model is the load-bearing component; MaxSim on top of an off-the-shelf encoder doesn't reproduce the published results because the published results aren't about MaxSim.
What this teaches about reranker scope
Two things worth saving from the wreckage:
1. Rerank helps where the underlying ranker is uncertain, not where it's already confident
The retrieval cases where rerank reliably wins in the literature are concept queries: "where does the auth token get refreshed?" — natural-language descriptions where BM25 matches lots of weakly-related files and the right answer needs semantic disambiguation. Those queries are the natural home for a reranker.
P1 def-lookup is the opposite shape. The query is the symbol name. SQL match-quality already discriminates exact > prefix > substring with perfect precision. There's no uncertainty for a reranker to resolve — there's only certainty for a reranker to pollute. We should have known this before wiring it.
2. Always A/B against the bench-exercising call path, not the easy-to-instrument one
The first three rerank A/B runs we ever did were inert because the rerank wiring lived in hybrid-search.ts and the bench's primitives go through lookup / refs / deps / audit directly. Three identical F1 readouts (0.7483 each time) looked like noise but were actually a structural finding: the bench was indifferent to the experiment. The fix in this round was to wire rerank into the actual bench-exercised paths — at which point the experiment did register, and registered as a regression. Negative result, but the harness at least stopped lying to us.
The lesson generalizes: if your benchmark says "no change" three runs in a row when you're sure your change should move it, the harness is the suspect, not the change. Bisect the call path before you bisect the algorithm.
What's next on the rerank track
The poor-man experiment is done. Issue #29 stays open because the original question — does real late-interaction rerank lift sverklo's F1? — is still unanswered. The next experiment is ColBERT v2 with a strong promotion gate:
- F1 lift ≥ +0.05 (one full bench point above poor-man's regression, two above noise) with bootstrap CI lower bound > 0.
- P1 doesn't regress. If the model breaks the symbol-name slice the way poor-man did, we don't ship it even if it lifts P2.
- Latency ≤ 50ms on M-series ANE. Sverklo's whole pitch is local-first with sub-second responses; a reranker that pushes lookup past a second isn't useful regardless of accuracy.
The gates are deliberately strict because the cost of shipping a reranker is permanent — every future query pays the latency tax, and rolling it back later breaks bench reproducibility. We'd rather ship nothing than ship the wrong thing.
Why publish a negative result
The honest answer is that I'd rather show the reproducer for "we tried X and it didn't work" than ship a quietly-removed feature flag and hope no one notices. The 36-hour bench loop that produced this result is on sverklo.com/mcp as one of two negative-result callouts (the other is a Python parser bug surfaced by adding the requests dataset, which did get fixed and lifted P4 from 0.10 to 1.00). Both are receipts for the same posture: the bench is the authority and the bench tells the truth.
The competitive default in the AI-tools-for-coding space is to ship the win and stay quiet on the loss. The default I'd rather establish for sverklo is: every experiment generates a number, every number gets published, the experiments that don't pan out are as much of the story as the ones that do. If sverklo were any other product I'd be tempted to shelf this and only mention rerank once we had a positive result. As a tool that exists to tell agents whether the code is rotting, the brand really has no other choice.
Reproduce locally
git clone https://github.com/sverklo/sverklo cd sverklo && npm install SVERKLO_RERANK=poor-man BASELINES=sverklo,sverklo-rerank npm run bench:quick
Output lands in benchmark/results/<timestamp>/. The run takes ~3 minutes on an M-series Mac. Disagreements with these numbers are useful — open an issue with your machine spec and run timestamp.
Issue #29 — full per-task breakdown and rerank wiring commits · sverklo.com/mcp — leaderboard with both negative-result callouts
References
- Issue tracking the experiment: github.com/sverklo/sverklo#29
- Rerank wiring (lookup):
src/server/tools/lookup.ts - Rerank wiring (refs):
src/server/tools/find-references.ts - Original ColBERT paper (Khattab & Zaharia, 2020): arxiv.org/abs/2004.12832
- ColBERT v2 (Santhanam et al., 2022): arxiv.org/abs/2112.01488
See also
- Reciprocal Rank Fusion is doing 80% of the work in our hybrid search — the combination function that the rerank was supposed to be on top of
- I added two competitors to my own benchmark. One of them beat me at P1. — the bench expansion that this experiment ran against
- Bench as feedback loop — why we run the same bench every time we change anything