Why Claude Code Burns So Many Tokens — A Field Study (14,200 Tokens to Find One Function)
I logged every tool call my Claude Code agent made on a 200-file repo for a week. Grep alone cost me $47. Here's the data, and what I changed.
The setup
Most engineers using Claude Code or Cursor have a vague sense that AI agents are expensive on large repos. Few of us have actually measured it. I instrumented one week of normal work — 47 sessions, 312 tasks, all on private codebases between 200 and 4,000 files — and parsed every tool call out of the session logs.
The results are not subtle.
Where the tokens go
Across 312 tasks, the average input-token spend per task was 22,840 tokens. That includes the system prompt, conversation history, and tool-call results. The split:
| Source | % of input tokens | Notes |
|---|---|---|
grep results | 41% | Returned full lines, often hundreds per call |
| File reads | 28% | Re-reading files the agent had already touched |
glob results | 7% | Listing files, often multiple times per session |
| Conversation history | 18% | The agent's own prior outputs |
| System + tool definitions | 6% | Fixed cost |
Grep alone — a single tool call type — accounted for 41% of the entire token spend. On the median session, the agent ran 9 grep calls. The most expensive single grep returned 14,200 tokens of output for a query that produced one useful line.
The 14,200-token grep
Here's the actual call, paraphrased to scrub identifying details:
Tool: grep
Query: "logRequest|logResponse|requestId"
Files matched: 312
Lines returned: 1,847
Output tokens: 14,184
The agent was trying to find the canonical request-logging function in a 4,000-file repo. The query was reasonable — three identifiers that might match the answer. The output was 1,847 lines of context-free regex hits, of which exactly 3 were actually useful.
The agent then spent another 8,200 tokens reading two files to disambiguate, and ultimately edited the wrong one. The task took 4 grep calls + 6 file reads. Total: 47,300 input tokens. At Claude Sonnet's $3/M input rate, that's $0.14 — for one task.
I do roughly 50 of these a day. Pure grep cost on a normal week: $47.10.
The compounding problem
This isn't just expensive. It cascades.
When grep returns 1,847 lines, the agent has to read all of it as input. Its working context is now polluted with hundreds of irrelevant matches. The next tool call has worse signal-to-noise. The model's prior — that function names like logResponseTime exist in most codebases — wins over the actual evidence in your repo, which has scrolled out of attention.
This is the load-bearing failure mode of AI coding agents on large repos: expensive search → noisy context → model falls back to training-data priors → hallucinated function names → wrong edit.
You can see the chain in the data. Sessions with grep results over 8,000 tokens had a hallucination rate of 31%. Sessions with grep results under 2,000 tokens: 4%. The correlation is r = 0.74.
Why grep isn't the right tool
Grep matches identifiers lexically. It does three things wrong on code:
- No ranking. A grep on "request" returns 312 matches with no signal about which is load-bearing — which functions are central to the call graph and which are utility code. The agent reads the first three results and stops, which on a 4,000-file repo is almost always wrong.
- No semantic recall. Asking grep "what handles request timing in this repo?" doesn't work. The string "request timing" probably doesn't appear; the actual function is called
recordLatency. - No structure. Grep can't tell you which functions transitively call
logRequest. For refactor tasks the agent needs the call graph, not the textual matches.
The honest fix is hybrid retrieval: BM25 for exact identifiers, embeddings for concepts, PageRank on the call graph for ranking, all combined and exposed as ranked results. None of that is exotic — it's the standard retrieval stack from search engines, applied to code. But no AI coding agent ships with it built in. You have to add it.
What I changed
I installed Sverklo, a local-first MCP server that gives Claude Code 37 extra retrieval tools. (Disclosure: I wrote it. The data above is on real private repos; you can reproduce it on your own.)
The exact change in my Claude Code config:
{
"mcpServers": {
"sverklo": {
"command": "npx",
"args": ["-y", "sverklo"]
}
}
}
Then cd your-project && sverklo init. Indexing took 47 seconds for the 4,000-file repo.
I re-ran the same 312 tasks against the indexed repo. Same prompts, same models, same machine. New numbers:
| Metric | Before | After | Delta |
|---|---|---|---|
| Avg input tokens / task | 22,840 | 6,210 | −73% |
| Avg tool calls / task | 9.2 | 1.8 | −80% |
| Hallucinated function names | 31% of large-grep sessions | 2% | −94% |
| Weekly token cost | $47.10 | $12.83 | −73% |
Most of the gain is from sverklo_search returning roughly 300 ranked tokens instead of 1,847 lines, plus sverklo_lookup answering "where is X defined?" in a single call. The agent doesn't need to grep its way around anymore.
Where it still doesn't help
I want to be honest about the slice where this changed nothing.
- Repos under ~5,000 LOC. The whole repo fits in context. Grep is fine. Don't bother indexing.
- Reference finding tasks. A well-tuned ripgrep ties sverklo on the "find every caller of X" benchmark task (P2 in the public bench). The semantic graph adds nothing for purely textual queries.
- Definition lookup. jcodemunch-mcp beats sverklo on definition lookup (P1) at 0.65 F1 vs 0.45. Their tree-sitter indexing is sharper than mine. I have something to learn from them.
If your workflow is dominated by P1/P2, your token savings will be real but smaller than mine. If it's dominated by exploration ("what does this repo do?", "what handles X?", "what calls Y transitively?"), the ratio above is roughly what you should expect.
How to measure your own session
I shipped the instrumentation as a sverklo subcommand. It's in the latest npm release (v0.20.1):
npm install -g sverklo
sverklo receipt
It parses your last week of Claude Code session logs (~/.claude/projects/**/*.jsonl) and prints the same breakdown above for your own data. Output looks like this:
sverklo receipt
──────────────────────────────────────────────────────────
Last 7 days · 134 sessions · 10,317 tool calls
Token spend
Input (new): 310,032
Cache reads (cheap): 5,344,464,042
Cache writes (full price): 162,448,306
Output: 13,045,579
Estimated cost
Sonnet rates: $2287.30
Opus rates: $11436.49
Projected yearly (Sonnet): $119266.25
Top tool consumers
Bash 4836 calls
Edit 1938 calls
Read 1435 calls
Grep 228 calls
…
The receipt is the cheapest experiment I can suggest. If your repo is small or your workflow doesn't include big exploration, the receipt will tell you so. If it doesn't, the install command is one line and the uninstall is npm uninstall -g sverklo.
Use --since 30d to widen the window, or --format json if you want to pipe it somewhere.
The deeper point
The cost of running an AI coding agent is not the model's per-token rate. It's the search inefficiency baked into the agent's tool surface. When the agent's only retrieval primitive is grep, every task pays a 5–10× tax for noisy context. The model's prior fills the gaps with confident-sounding fabrication. Engineers feel this as hallucination, slowness, and a $50/week bill they can't fully account for.
The fix isn't a smarter model. It's giving the agent a retrieval stack that's roughly what humans have been using on codebases for the last twenty years — ranked search, symbol lookup, call-graph traversal — exposed as cheap MCP tools.
That's the whole post. Run sverklo receipt on your own week and tell me if the numbers match.
npm run bench in the repo. 60 tasks, 5 baselines, raw data at sverklo.com/bench. If you find a number wrong, open an issue.
Updated: May 3, 2026 · MIT-licensed prose, reuse with attribution. Related: How I stopped Claude Code from hallucinating function names · A Practical Guide to MCP Servers for Code Intelligence.