Why does Claude Code burn so many tokens?

In a one-week instrumented field study of 312 Claude Code tasks, grep alone accounted for 41% of input tokens. The agent runs a noisy grep, gets 200 lines back, then runs three more greps to disambiguate — each result re-fed into context. The dominant cost is not the model itself; it's the cascade of low-precision tool calls feeding raw text into the conversation.

How do I reduce Claude Code token usage?

Three steps: (1) measure first — instrument your sessions to see where tokens actually go, our open `sverklo receipt` command does this on session logs; (2) replace grep with a structured retriever that returns ranked symbols instead of raw lines; (3) cap tool results at 2K tokens — sessions with grep results over 8K hallucinate 31% of the time vs 4% under 2K, the noise itself causes wrong answers downstream.

Is Claude Code expensive on large repos?

Yes — and the cost grows non-linearly with repo size, because grep results scale with file count. A typical 200-file repo session in our study consumed 14,200 input tokens to locate a single function. The cost per task on Claude Sonnet at current pricing was about $0.04 — small individually, but compounded across hundreds of agent invocations per day. Hybrid retrieval (BM25 + embeddings + PageRank) cuts this ~60% in our bench.

Does grep waste tokens for AI agents?

Empirically, yes. Grep returns every textual match including comments, test files, generated code, and binary noise. The agent has to read all of it to identify the relevant hit. A structured symbol index (tree-sitter parsed, ranked by PageRank) returns the canonical definition with ~95% fewer tokens. Grep is fine when you know the exact string and the corpus is small; for AI-agent retrieval over a large repo it is the leading source of context bloat.

Why Claude Code Burns So Many Tokens — A Field Study (14,200 Tokens to Find One Function)

I logged every tool call my Claude Code agent made on a 200-file repo for a week. Grep alone cost me $47. Here's the data, and what I changed.

May 2026 · Nikita Groshin · ~2,000 words

The setup

Most engineers using Claude Code or Cursor have a vague sense that AI agents are expensive on large repos. Few of us have actually measured it. I instrumented one week of normal work — 47 sessions, 312 tasks, all on private codebases between 200 and 4,000 files — and parsed every tool call out of the session logs.

The results are not subtle.

Where the tokens go

Across 312 tasks, the average input-token spend per task was 22,840 tokens. That includes the system prompt, conversation history, and tool-call results. The split:

Source	% of input tokens	Notes
`grep` results	41%	Returned full lines, often hundreds per call
File reads	28%	Re-reading files the agent had already touched
`glob` results	7%	Listing files, often multiple times per session
Conversation history	18%	The agent's own prior outputs
System + tool definitions	6%	Fixed cost

Grep alone — a single tool call type — accounted for 41% of the entire token spend. On the median session, the agent ran 9 grep calls. The most expensive single grep returned 14,200 tokens of output for a query that produced one useful line.

The 14,200-token grep

Here's the actual call, paraphrased to scrub identifying details:

Tool: grep
Query: "logRequest|logResponse|requestId"
Files matched: 312
Lines returned: 1,847
Output tokens: 14,184

The agent was trying to find the canonical request-logging function in a 4,000-file repo. The query was reasonable — three identifiers that might match the answer. The output was 1,847 lines of context-free regex hits, of which exactly 3 were actually useful.

The agent then spent another 8,200 tokens reading two files to disambiguate, and ultimately edited the wrong one. The task took 4 grep calls + 6 file reads. Total: 47,300 input tokens. At Claude Sonnet's $3/M input rate, that's $0.14 — for one task.

I do roughly 50 of these a day. Pure grep cost on a normal week: $47.10.

The compounding problem

This isn't just expensive. It cascades.

When grep returns 1,847 lines, the agent has to read all of it as input. Its working context is now polluted with hundreds of irrelevant matches. The next tool call has worse signal-to-noise. The model's prior — that function names like logResponseTime exist in most codebases — wins over the actual evidence in your repo, which has scrolled out of attention.

This is the load-bearing failure mode of AI coding agents on large repos: expensive search → noisy context → model falls back to training-data priors → hallucinated function names → wrong edit.

You can see the chain in the data. Sessions with grep results over 8,000 tokens had a hallucination rate of 31%. Sessions with grep results under 2,000 tokens: 4%. The correlation is r = 0.74.

Why grep isn't the right tool

Grep matches identifiers lexically. It does three things wrong on code:

No ranking. A grep on "request" returns 312 matches with no signal about which is load-bearing — which functions are central to the call graph and which are utility code. The agent reads the first three results and stops, which on a 4,000-file repo is almost always wrong.
No semantic recall. Asking grep "what handles request timing in this repo?" doesn't work. The string "request timing" probably doesn't appear; the actual function is called recordLatency.
No structure. Grep can't tell you which functions transitively call logRequest. For refactor tasks the agent needs the call graph, not the textual matches.

The honest fix is hybrid retrieval: BM25 for exact identifiers, embeddings for concepts, PageRank on the call graph for ranking, all combined and exposed as ranked results. None of that is exotic — it's the standard retrieval stack from search engines, applied to code. But no AI coding agent ships with it built in. You have to add it.

What I changed

I installed Sverklo, a local-first MCP server that gives Claude Code 37 extra retrieval tools. (Disclosure: I wrote it. The data above is on real private repos; you can reproduce it on your own.)

The exact change in my Claude Code config:

{
  "mcpServers": {
    "sverklo": {
      "command": "npx",
      "args": ["-y", "sverklo"]
    }
  }
}

Then cd your-project && sverklo init. Indexing took 47 seconds for the 4,000-file repo.

I re-ran the same 312 tasks against the indexed repo. Same prompts, same models, same machine. New numbers:

Metric	Before	After	Delta
Avg input tokens / task	22,840	6,210	−73%
Avg tool calls / task	9.2	1.8	−80%
Hallucinated function names	31% of large-grep sessions	2%	−94%
Weekly token cost	$47.10	$12.83	−73%

Most of the gain is from sverklo_search returning roughly 300 ranked tokens instead of 1,847 lines, plus sverklo_lookup answering "where is X defined?" in a single call. The agent doesn't need to grep its way around anymore.

Where it still doesn't help

I want to be honest about the slice where this changed nothing.

Repos under ~5,000 LOC. The whole repo fits in context. Grep is fine. Don't bother indexing.
Reference finding tasks. A well-tuned ripgrep ties sverklo on the "find every caller of X" benchmark task (P2 in the public bench). The semantic graph adds nothing for purely textual queries.
Definition lookup. jcodemunch-mcp beats sverklo on definition lookup (P1) at 0.65 F1 vs 0.45. Their tree-sitter indexing is sharper than mine. I have something to learn from them.

If your workflow is dominated by P1/P2, your token savings will be real but smaller than mine. If it's dominated by exploration ("what does this repo do?", "what handles X?", "what calls Y transitively?"), the ratio above is roughly what you should expect.

How to measure your own session

I shipped the instrumentation as a sverklo subcommand. It's in the latest npm release (v0.20.1):

npm install -g sverklo
sverklo receipt

It parses your last week of Claude Code session logs (~/.claude/projects/**/*.jsonl) and prints the same breakdown above for your own data. Output looks like this:

sverklo receipt
──────────────────────────────────────────────────────────
Last 7 days · 134 sessions · 10,317 tool calls

Token spend
  Input (new):                       310,032
  Cache reads (cheap):         5,344,464,042
  Cache writes (full price):     162,448,306
  Output:                         13,045,579

Estimated cost
  Sonnet rates:                     $2287.30
  Opus rates:                      $11436.49
  Projected yearly (Sonnet):      $119266.25

Top tool consumers
  Bash        4836 calls
  Edit        1938 calls
  Read        1435 calls
  Grep         228 calls
  …

The receipt is the cheapest experiment I can suggest. If your repo is small or your workflow doesn't include big exploration, the receipt will tell you so. If it doesn't, the install command is one line and the uninstall is npm uninstall -g sverklo.

Use --since 30d to widen the window, or --format json if you want to pipe it somewhere.

The deeper point

The cost of running an AI coding agent is not the model's per-token rate. It's the search inefficiency baked into the agent's tool surface. When the agent's only retrieval primitive is grep, every task pays a 5–10× tax for noisy context. The model's prior fills the gaps with confident-sounding fabrication. Engineers feel this as hallucination, slowness, and a $50/week bill they can't fully account for.

The fix isn't a smarter model. It's giving the agent a retrieval stack that's roughly what humans have been using on codebases for the last twenty years — ranked search, symbol lookup, call-graph traversal — exposed as cheap MCP tools.

That's the whole post. Run sverklo receipt on your own week and tell me if the numbers match.

Reproduce the bench: all numbers in this post that come from the sverklo bench are reproducible via npm run bench in the repo. 60 tasks, 5 baselines, raw data at sverklo.com/bench. If you find a number wrong, open an issue.

Updated: May 3, 2026 · MIT-licensed prose, reuse with attribution. Related: How I stopped Claude Code from hallucinating function names · A Practical Guide to MCP Servers for Code Intelligence.