sverklo.
Engineering · Sverklo

Reciprocal Rank Fusion is doing 80% of the work in our hybrid search

I tried half a dozen scoring schemes for combining BM25, vector similarity, and PageRank into a single ranked code-search result. Most required tuning weights, calibrating scores, and explaining the result to skeptical reviewers. Reciprocal Rank Fusion is three lines of math, has no tunable parameters, and beats every alternative I tested. Here's what it does, why it works, and why it should be your default.

The setup

Sverklo's job is to take a natural-language query like "where does the auth token get refreshed?" and return a ranked list of code chunks that an AI agent can read to answer the question. The agent then reads the top-N results, builds a mental model, and replies. The quality of the answer is bounded above by the quality of the retrieval — if the right code chunk isn't in the top results, the agent has no chance.

Almost everyone building this kind of tool reaches for a hybrid setup: multiple retrievers, each good at different things, combined into one final ranking. Sverklo combines three retrievers:

Three retrievers, three rankings, one final result. The interesting question — the one this post is about — isn't which retriever to pick. It's how to combine them.

The obvious combinations don't work

The first thing you try is a weighted sum:

final_score = w1 * bm25_score + w2 * cosine_similarity + w3 * pagerank

This fails for one specific, structural reason: the three signals are on incompatible scales.

SignalRangeDistribution
BM25[0, ∞)Long-tail; top result might score 12.4 while the 50th scores 0.3
Cosine similarity[-1, 1]Bunched near 0.6–0.9 for any plausible match
PageRank[0, 1]Tiny probabilities; top file in a 4k-file repo might score 0.003

You can't add these. Adding a BM25 score of 12.4 and a cosine similarity of 0.85 gives you 13.25, which is dominated by the BM25 term entirely. The cosine similarity might as well not exist.

The fix everyone tries next is to normalize each signal to [0, 1] first — min-max scaling, z-score, sigmoid, whatever. This works in theory and breaks in practice. Min-max scaling is unstable when one query has very few results (the min and max are too close together). Z-score requires you to know the population distribution in advance, which you don't. Sigmoid squashing throws away information at the extremes. Every option has a failure mode and every failure mode requires a tunable parameter to paper over.

The second thing you try is a learned ranking function. Train a model to combine the three scores. This works extremely well for web search, where Google has a billion queries a day to train on. It's a non-starter for sverklo, where the entire training set is "the user typed something three minutes ago."

The third thing you try is to pick a single signal and use that. Most embedding-only "code search" tools end up here. You lose the precision of BM25 and you lose the structural importance signal from PageRank, and your top result for UserService.authenticate is a function called AuthHelper.verify in a file your agent has never seen. This is what 90% of public RAG-over-code projects shipped in 2023–2025 look like.

The fourth thing you try, and the thing that actually works, is Reciprocal Rank Fusion.

What Reciprocal Rank Fusion is

RRF was published by Cormack, Clarke, and Büttcher in 2009 as a way to combine rankings from completely different retrieval systems without normalizing scores. The trick is to throw away the scores entirely and combine the ranks instead.

Each retriever produces an ordered list. For each item that appears in any list, you compute:

RRF(item) = Σ (1 / (k + rank_i(item)))   for each retriever i where the item appears

where:
  rank_i(item) = the 1-indexed position of item in retriever i's ranked list
  k = a constant, conventionally 60 (more on this in a moment)

Then you sort items by their RRF score, descending. That's it. That's the entire algorithm.

Three things are worth noticing immediately:

  1. There are no scale issues. Ranks are integers from 1 to N. The transformation 1/(k+rank) maps every retriever's output to the same range, regardless of whether the underlying scores were unbounded BM25 or [-1,1] cosine similarities.
  2. There are no tunable weights per retriever. Every signal contributes equally. You can't accidentally over-trust embeddings or over-trust grep — they each get the same vote-by-rank.
  3. The k constant matters less than you'd think. 60 is the value from the original paper and it's what the literature has converged on. Sverklo uses 60. I tried 30 and 100 and the results were almost identical — the function is logarithmic in k near typical values.

The implementation in sverklo is small enough to fit on screen:

// src/server/tools/recall.ts (similar code in src/search/)
const RRF_K = 60;

const rrfScores = new Map<number, number>();

// Signal A: BM25 / FTS
for (let rank = 0; rank < bm25Results.length; rank++) {
  const id = bm25Results[rank].id;
  rrfScores.set(id, (rrfScores.get(id) || 0) + 1 / (RRF_K + rank + 1));
}

// Signal B: vector similarity
for (let rank = 0; rank < vectorResults.length; rank++) {
  const id = vectorResults[rank].id;
  rrfScores.set(id, (rrfScores.get(id) || 0) + 1 / (RRF_K + rank + 1));
}

// Signal C: PageRank — used as a tiebreaker boost on the fused score
// (full implementation in graph-builder.ts; here we re-rank by PR after RRF)
const candidates = [...rrfScores.entries()].map(([id, rrfScore]) => {
  const file = chunkStore.fileFor(id);
  const finalScore = rrfScore * (1 + 0.1 * file.pagerank);
  return { id, score: finalScore };
});

candidates.sort((a, b) => b.score - a.score);

That's the entire fusion stage. Eight lines of substantive code. No weights to tune, no scores to calibrate, no model to train, no failure mode I've found in nine months of dogfooding.

Why does this work?

The intuition is simple and the proof is annoyingly slippery, so let me give you the intuition first.

RRF rewards consensus across retrievers. An item that ranks 1st in BM25 and 1st in vector search has the highest possible RRF score. An item that ranks 1st in BM25 but is missing entirely from vector search has only the BM25 contribution — much lower. Items that show up in the top 5 of both retrievers float above items that show up in the top 1 of just one.

This is exactly the property you want for hybrid code search. The case where BM25 is right is the case where the user typed a literal symbol; the case where vector search is right is the case where the user described what they wanted. The case where they're both right is the case where both retrievers agree, and that's the case RRF rewards most strongly.

The 1/(k+rank) shape matters. The score drops off fast at the top (rank 1 gives 1/61 ≈ 0.0164, rank 2 gives 1/62 ≈ 0.0161, rank 10 gives 1/70 ≈ 0.0143) and slowly at the tail (rank 50 gives 1/110 ≈ 0.0091, rank 100 gives 1/160 ≈ 0.0063). The result: top results matter a lot, mid results contribute meaningfully, tail results contribute almost nothing. Exactly the curve you'd hand-design if you were trying to write a "good combination" function from scratch.

The deeper reason it works — and the reason the original paper mathematically justifies it — is that ranks contain less information than scores, and less information is exactly what you want when you don't trust any single retriever to be calibrated. By throwing away the scores you also throw away the calibration problem. You can't be wrong about a calibration you never relied on.

Concrete example from a real query

Take the query "hybrid search ranking function" against sverklo's own codebase. Here's what each retriever returns, top 5 only:

RankBM25Vector
1src/search/hybrid.tssrc/search/hybrid.ts
2src/search/bm25.tssrc/server/tools/recall.ts
3src/search/scoring.tssrc/search/scoring.ts
4benchmark/src/types.tssrc/search/hybrid-fusion.ts
5src/server/tools/search.tssrc/search/bm25.ts

Computing RRF (k=60) and adding the PageRank tiebreaker, the fused result is:

RankFileRRF scoreWhy it ranks here
1src/search/hybrid.ts0.03281st in both retrievers — strongest possible signal
2src/search/scoring.ts0.03173rd in both retrievers — strong consensus
3src/search/bm25.ts0.03162nd in BM25, 5th in vector — strong literal match, weaker concept
4src/server/tools/recall.ts0.01612nd in vector only — concept match, no literal hit
5src/search/hybrid-fusion.ts0.01564th in vector only — same pattern

Notice the failure mode RRF avoids: BM25 had benchmark/src/types.ts at rank 4 (because the file contains the strings "hybrid", "search", "ranking", and "function" in type definitions) but it's not in the vector top-5 because semantically it's a type-definitions file, not a ranking implementation. RRF correctly demotes it out of the top results because it has no consensus support. A naive weighted-sum would have included it.

And notice the win RRF gives you: src/search/scoring.ts ranks #2 in the fused list even though it ranked #3 in both retrievers. The combination of "consistently in the top 3 of both signals" is more valuable than being #1 in just one. RRF surfaces consistent mid-rankers above lopsided top-rankers, and that's exactly the right behavior when you're combining noisy retrievers that each have their own failure modes.

What RRF doesn't do

I'd be lying if I said RRF was free. Three things it doesn't solve:

1. Calibrated probabilities

RRF gives you a ranking, not a probability that any given result is correct. If you want to threshold ("only return results with confidence > 0.9"), you can't do it with RRF scores directly. The score depends on the size of the candidate pool and the rank distribution, not on any absolute notion of relevance.

Sverklo doesn't need calibrated probabilities — the agent reads the top N regardless. But if you're building something that does (a search UI that hides low-confidence results, say), RRF gives you an ordering and not much else.

2. Recall when one retriever is silent

If BM25 returns zero results (the user's query has no exact-token matches at all) and vector search returns ten, RRF reduces to "use vector search". That's the right behavior, but it's not magic — you're as good as the better retriever in that case, no better.

The case where this hurts is queries that fall in the gap between both retrievers — concept queries where the embedding isn't quite right AND there are no literal-string anchors. For those queries, the only fix is a third or fourth retriever (a specialized code embedding model, a graph-walk over symbol references, etc.). Sverklo's sverklo_impact tool exists exactly for this case — it's a different way of asking "what's structurally related to X?" that doesn't depend on either BM25 or vector search.

3. Diversity

RRF doesn't penalize duplicates. If your top 5 BM25 hits are all in the same file at different chunks, RRF will happily put all 5 in the top of the fused result. For a code-search tool returning chunks, this is sometimes what you want and sometimes not.

Sverklo handles this with a post-fusion deduplication pass that caps each file's contribution at 3 chunks. It's a hack, but it's a 5-line hack and it covers the failure mode.

Why I'm writing this

I built sverklo expecting the embedding model to do most of the heavy lifting. I assumed I'd spend most of my time tuning the model, picking better embeddings, maybe fine-tuning on code. The model picker is one of the most common decision points in any RAG project, and there's a whole sub-industry of "best embedding model 2026" benchmarks selling that decision as the important one.

Then I built the hybrid stack and discovered that the combination function was more important than any individual signal. Better BM25, better embeddings, better PageRank — none of them moved the quality needle as much as switching from weighted-sum to RRF. The day I ripped out my weighted-sum-with-min-max-normalization code and replaced it with the eight lines above was the day sverklo's search results stopped feeling random.

I see a lot of "code search" tools shipping in 2026 that pick a single embedding model, normalize it however they normalize it, and return the top vector hits. Most of them would be visibly better — measurably better, in user-visible time-to-correct-answer — if they spent ten minutes adding a BM25 fallback and an RRF combiner. That work is so cheap and the upside is so large that I'm a little embarrassed it took me as long as it did to get there.

The TL;DR if you're building a hybrid retrieval system in 2026 and you're tempted to start with a weighted sum of normalized scores: don't. Use RRF. It's eight lines, has no tunable parameters, and outperforms every alternative I've tried. The only reason it isn't the default in every retrieval library is that "Reciprocal Rank Fusion" sounds more complicated than "weighted average," and people who haven't read the 2009 paper assume it must be exotic. It isn't. It's a one-liner.

Further reading. The original paper is Cormack, Clarke, and Büttcher (2009): "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods." It's 2 pages, no math beyond what's in this post, and worth your time if any part of the above felt hand-wavy.

Try it on your own codebase

npm install -g sverklo
cd your-project && sverklo init

Then ask your agent a question that grep is bad at — "what replaced the deleted X class?", "what handles auth across this codebase?", "what calls Y indirectly?" — and see what comes back. The fusion code above is what's running.

github.com/sverklo/sverklo

Notes