Engineering · Sverklo · 2026-05-09

Claude Code Troubleshooting on Large Repos — 6 Failure Modes and Fixes

2026-05-09 ~12 min read by Nikita Groshin

Claude Code is the strongest agentic coding tool I've used. On a 30-file project it's near-magical. On a 4,000-file repo it falls into specific, repeatable failure modes — files it can't find, function names it invents, tool calls it cascades, decisions it forgets, context windows it exhausts. Each failure has a name, a cause, and a fix. Six of them, in order of how often I've watched them happen, with the data behind each and links to the deeper writeups where they exist.

If you only read one section: grep accounts for 41% of Claude Code's input-token spend. The single highest-leverage fix is replacing grep cascades with a typed symbol lookup. Everything else is downstream.
In this guide
  1. Claude Code stops finding files
  2. Claude Code hallucinates function names
  3. Claude Code burns tokens on grep cascades
  4. Claude Code forgets yesterday's decisions
  5. Claude Code keeps repeating the same grep
  6. Claude Code's context window fills up

1. Claude Code stops finding files

"the file you're describing doesn't exist in this repository"

You added src/auth/refresh.ts ten minutes ago. The agent can see you typing about it. But when you ask it to "open the auth refresh file" it returns the equivalent of file not found.

Two causes, often combined:

Stale mental model. Claude Code internally maintains a soft model of your repo's structure based on the last directory listing it observed. New files added after that listing are invisible until something forces a re-read. The agent doesn't know it doesn't know.

Implicit path exclusions. The agent's earlier searches likely excluded test/, dist/, vendor/, build/, and similar directories as a noise-reduction heuristic. Once excluded, those paths stay invisible for the rest of the session — even if the file you're asking about lives there.

The fix: a live symbol index

Don't rely on the agent's cached directory state. Expose a query surface that always reflects current disk state:

The structural fix is exposing these as MCP tools so the agent can re-query disk on demand. Deep dive on the hallucination side of this failure: Why Claude Code Hallucinates Function Names That Don't Exist In Your Codebase.

2. Claude Code hallucinates function names

"getUserByEmail() is defined in src/users.ts" — except your codebase calls it findByEmail

The agent confidently writes code that calls getUserByEmail(). Your codebase has findByEmail(). Tests pass because the test mocks the dependency. The PR ships. Production breaks.

This is a generation-from-priors failure. Claude saw thousands of getUserBy* functions in training data; your codebase's actual conventions are a single context-window away. After compaction, the priors win.

The fix: ground generations in your real symbol graph

Three layers, in order of effectiveness:

  1. Install a code-intelligence MCP server. Sverklo, jcodemunch-mcp, serena, Claude-Context — pick one. The agent gets a lookup tool that verifies identifiers against a tree-sitter-parsed symbol table before writing them. Measurable effect: 37% fewer hallucinated imports on our bench.
  2. Keep your CLAUDE.md concrete. Real paths, real identifiers, real conventions — not abstractions. Rules survive compaction better than examples; examples survive better than abstract guidance.
  3. Cap tool results at 2K tokens. Sessions with grep results over 8K tokens hallucinate 31% of the time vs 4% under 2K. The noise itself causes wrong answers downstream.

Deep dive: How I stopped Claude Code from hallucinating function names on a 4,000-file repo.

3. Claude Code burns tokens on grep cascades

14,200 input tokens consumed to locate one function

You ask Claude Code to find where parseConfig is defined. It runs grep -r "parseConfig". The grep returns 80 lines (definitions, calls, comments, test mocks). The agent runs three more greps to disambiguate. Each result re-feeds into context. By the time it produces an answer it has consumed 14,200 input tokens — about $0.04 on Claude Sonnet at current pricing. Multiply by 200 invocations a day across an engineering team and the math gets loud.

This is the single most common failure mode and the one with the largest measurable impact:

SampleTokensHallucination rate
Grep results <2K tokens≈1,8004%
Grep results 2K–8K tokens≈4,40014%
Grep results >8K tokens≈11,20031%

Sample: 312 Claude Code tasks across one week, 200-file TypeScript repo. Full methodology + raw data in the field study.

The fix: typed retrieval instead of grep cascades

Replace grep with a structured symbol lookup that returns ranked results scoped by symbol type. Sverklo's hybrid retrieval (BM25 over chunk content + cosine similarity over ONNX embeddings + PageRank-weighted file ranking) returns the canonical definition with ~95% fewer tokens than grep.

The single tool call that replaces a grep cascade:

# Before — grep cascade, 14,200 tokens
grep -r "parseConfig" src/
grep -r "parseConfig" --include="*.ts" src/
grep -r "parseConfig" -A 5 src/ | head -100

# After — one typed call, ~150 tokens
sverklo_lookup({ symbol: "parseConfig" })

On the public bench, sverklo's tools-per-task is 1.0; naive grep is 6.1. Same task, ~6× fewer tool calls.

Deep dive: Why Claude Code Burns So Many Tokens — A Field Study.

4. Claude Code forgets yesterday's design decisions

"Why are we using Prisma?" — when you decided that three weeks ago

Yesterday you and Claude Code spent an hour on a design decision: Prisma over Drizzle, with reasons. Today you ask a question downstream of that decision and the agent suggests the opposite. Compaction ate the rationale.

This is structural to how Claude Code's context window works. When conversation length approaches the limit, older turns get summarized into a compressed representation. Code-specific decisions — exact identifiers, file paths, type signatures, library trade-offs — are the first to get lossy because they look noisy to the compactor relative to active conversational state.

The fix: bi-temporal memory pinned to git SHAs

Persist decisions in a queryable layer that survives compaction. Sverklo's memory layer uses bi-temporal columns: every memory carries valid_from_sha + valid_until_sha + superseded_by. Updating a decision doesn't overwrite — it inserts a new row, sets valid_until_sha on the old one, and links them via superseded_by. Recall queries can ask "what's true now?" or "what was true at commit abc?" with equal precision.

# After a design decision — explicitly remember:
sverklo_remember "We chose Prisma over Drizzle for the typed-ORM surface"

# Six months later, after migrations:
sverklo_recall "ORM choice"  # returns current decision
sverklo_recall "ORM choice" --at-sha abc123  # what we believed at commit abc

The pattern dates to relational databases in the 1990s. Applied to agent memory, it makes context compaction recoverable instead of destructive. Deep dive: Bi-temporal memory for AI coding agents and We Already Shipped Git-for-Agent-Memory — Bi-Temporal Beats Branch-Snapshot.

5. Claude Code keeps repeating the same grep

"let me search for that" — five times in a row

Watch a Claude Code session on a large repo and count how often it announces a search and then runs essentially the same grep with slightly different flags. Three to five repetitions per task is normal. Each one consumes tokens.

This is a tool-selection problem, not a search problem. The agent doesn't know which tool to reach for, so it falls back to the most general one (grep) and varies its parameters until something works. The deeper cause: too many MCP tools confuse selection (the agent freezes on which to use), too few force fallback to grep.

The fix: a slim, opinionated tool surface

Five tools that cover 80% of code-intel sessions:

Sverklo ships these as a named profile: SVERKLO_PROFILE=core exposes only those 5, dropping the system-prompt tool-list size 81% (8,016 → 1,522 tokens). The remaining 31 specialized tools stay hidden until you opt up.

Deep dive on the measurement: We Already Shipped MCP Code Mode — Sverklo's Tool Surface, Measured. Recipe page on combining profile-filtering with Anthropic's host-side defer_loading: Sverklo + Tool Search lazy-loading.

6. Claude Code's context window fills up

"To continue, please start a new conversation"

You've been working on a feature for two hours. Claude Code throws the soft-limit warning. You either start a new session and lose all the working context, or push past it and watch quality degrade as compaction lossily summarizes your past hour.

Three sources of context bloat, in descending order:

  1. Tool-call results — especially noisy grep output. See failure mode 3 above.
  2. Accumulated conversation history — every turn the model sees grows the prompt.
  3. System prompt's tool definitions — every MCP server adds 1K–10K tokens of tool descriptions to every turn's input.

The fix stack

Each of the previous five sections addresses a different bucket of context bloat. The cumulative effect is what matters:

SourceDefault costAfter fix
Grep cascades (per task)~14,200 tokens~500 tokens (typed retrieval)
Tool-list system prompt (per turn)~8,016 tokens~1,522 tokens (SVERKLO_PROFILE=core)
Memory across sessionsLost on compactionRecoverable via bi-temporal recall
Hallucinated identifiers (per task)31% rate above 8K4% rate under 2K

Combined, a typical session that hit the context-window soft limit at ~hour 2 lasts ~hour 5+ on the same context budget. The cost is one MCP server install and a profile env var.

What this guide is and isn't

This is a troubleshooting pillar — six of the most common failure modes Claude Code hits on real repos, with the data behind each and concrete fixes. The deep-dives are linked from each section; each one has its own measurement methodology and reproducer.

This is not a sverklo pitch. The fixes work with any code-intelligence MCP server (jcodemunch-mcp, serena, GitNexus, Claude-Context, sverklo). Sverklo is the one I maintain and have the most numbers for, so the examples lean that way; the patterns generalize. The public 5-baseline benchmark shows where each tool wins and loses, including the slices where sverklo loses to others.

Try the fix stack

npm install -g sverklo
SVERKLO_PROFILE=core sverklo init
# Then in your AI agent:
# Run sverklo_overview to see the codebase structure
# Run sverklo_lookup symbol:"parseConfig" instead of grep
# Run sverklo_remember to persist decisions across compactions

One install, one env var. Public bench · Recipe: profile + defer_loading · github.com/sverklo/sverklo

References