Local-first code intelligence — engineering notes
Receipts, methodology, losses. We publish the benchmark, the competitors patch, we patch back. Hybrid code search, PageRank for source code, bi-temporal memory, and the parser bugs we found by running the bench on real codebases.
The fix that wasn't
Our growth campaign Day 1 was supposed to be Saturday. Instead we shipped six npm versions in five days closing user-reported bugs. The most useful moment: shipping a fix that wasn't real. What changed after — a new constitutional principle, validation against the built binary not the source, and a CI parse-check that closes the entire "silent SyntaxError in shipped JS" class. The eight-day silent dashboard regression and the 12-hour false-positive on #59 are both in the log.
1 million lines, 10,400 unsafe blocks, 6 days. The audit gap is the bottleneck.
The Bun → Rust PR (#30412) merged yesterday — full runtime rewrite, ~1M lines, AI agents at the keyboard. The viral nerve isn't "Rust vs Zig," it's the 10,400 unsafe blocks plus Sumner's "we haven't been typing code ourselves for many months now." When agents write a million lines in a week, the bottleneck moves from generation to verification. Sverklo's role in that loop, the measurement layer angle, and the dogfood admission of a HIGH-severity regression caught by our own agents.
Claude Code Troubleshooting on Large Repos — 6 Failure Modes and Fixes
Claude Code stops finding files, hallucinates function names, burns tokens on grep cascades (41% of input-token spend), forgets yesterday's decisions after compaction, repeats the same searches, and exhausts context. Pillar guide covering all six failure modes with the data behind each (31% hallucination rate at >8K tokens vs 4% under 2K) and concrete fixes (typed retrieval, profile-filtered tool surface, bi-temporal memory). Internal links to every existing deep-dive.
We Already Shipped Git-for-Agent-Memory — Bi-Temporal Beats Branch-Snapshot
A 1,768-view tweet pitched Memoir as Git-style memory for AI agents earlier today. Sverklo's bi-temporal SHA-pinned memory layer ships the deeper version: valid_from_sha + valid_until_sha + superseded_by can answer "what was true at commit abc?" for any commit in history. Memoir's per-branch HEAD pointers can't (their own source comment confirms commit-level checkout is a placeholder). Honest comparison, the one ergonomic win Memoir has, and the borrowable idea worth implementing.
We Already Shipped MCP Code Mode — Sverklo's Tool Surface, Measured
Q2 2026's MCP discourse landed on tool-list bloat: Cloudflare cut a 1.17M-token spec to ~1K with Code Mode, Anthropic shipped MCP Tool Search lazy-load, Maxim wrote about cutting 92% at 500+ tools. Sverklo has shipped the same idea for months under SVERKLO_PROFILE. Today I measured it: 8,016 → 1,522 tokens (81% reduction) with one env var. Per-profile table, what each profile contains, when each works and when it breaks.
The bench is a feedback loop on two axes
Two iterations of the same pattern in one week. Lodash P1 fixes shipped on both sides of the bench in 36 hours. Auto-bench CI shipped 24 hours after a Reddit comment named the contribution-friction gap. Same loop, two axes — bugs and friction. The mechanism is the part worth writing down: public surface, honest losses, reproducibility, low-friction merge. When it doesn't work and how to apply it to your own tool category.
Late-interaction rerank made our F1 worse, not better
We wired a poor-man's late-interaction reranker into sverklo's lookup and refs tools, ran the full 4-dataset 120-task bench three times deterministically, and F1 dropped from 0.5847 to 0.5551 — a 7.5-point regression on P1 specifically. SQL match-quality is already optimal for "find the symbol named X"; semantic token alignment dilutes the signal. Negative result, full diagnosis, what we'd try next.
Claude Code burned 14,200 tokens to find one function
A field study of one week of instrumented Claude Code sessions across 312 tasks. Grep accounts for 41% of input tokens. Sessions with grep results over 8K tokens hallucinate 31% of the time vs 4% under 2K (r = 0.74). Includes the new `sverklo receipt` command (v0.20.1) that runs the same analysis on your own session logs.
How I stopped Claude Code from hallucinating function names on a 4,000-file repo (with a local MCP server)
My agent kept inventing function names that looked plausible but didn't exist (logResponseTime, trackRequestDuration — all fake). Three runs, three different invented names. The fix wasn't a smarter model — it was a real symbol graph exposed as MCP tools. Bench numbers, four failure modes, the cases where it doesn't help.
A Practical Guide to MCP Servers for Code Intelligence (May 2026)
Definitive landscape of 12 MCP servers for code intelligence. Honest comparison matrix on license, hosting, language coverage, tool count, and retrieval substrate. Includes a decision tree by team profile, a security PSA, a glossary, and a section that names the cases where Sverklo (the project that wrote the guide) loses.
I added two competitors to my own benchmark. One of them beat me at P1.
A user opened an issue asking sverklo to benchmark against jcodemunch-mcp and GitNexus. Two surprises: smart-grep ties sverklo on overall F1 (0.450 vs 0.449), and jcodemunch wins P1 (definition lookup) outright at 0.65 vs 0.45. Plus the structural quirk where both new competitors return ~0 on P2 because they only track import sites. Honest 5-baseline writeup with reproducible numbers.
MCP STDIO command injection: the class Anthropic won't patch, and the 30-second audit any maintainer can run
OX Security disclosed a class of CWE-78 RCEs affecting 7,000+ MCP servers. Anthropic declined to patch — "by design." Here's what the class actually is, the four-rule defense, how sverklo applies it in ~50 lines of validation, and a 30-second audit any user can run on any MCP server they're about to install. With grep one-liners.
Git for AI agent memory — version control for what your AI knows about your codebase
Git made code safe to change. Sverklo makes the agent's understanding of code safe to change. The four operations Git made boring — snapshot, branch, rollback, merge — applied to AI agent memory pinned to git SHAs. Plus the SQLite schema and the queries that justify the design.
Bi-temporal memory for AI coding agents — the 1990s database pattern that fixes "my agent forgot what we decided yesterday"
Most "memory" features for AI coding agents are flat key-value stores: when you update a memory, the old value is gone. That's the wrong abstraction for a codebase, where the team's beliefs change and the question "what did we think about auth at commit abc123?" is a real one. Sverklo borrows a 30-year-old database pattern — bi-temporal memory — and pins it to git SHAs. With the SQLite schema.
I benchmarked code retrieval for AI coding agents on 60 tasks
A tuned grep beats sverklo on F1 by 9 points. Sverklo wins by 62× on token economy and 7-12× on tool-call count. Both numbers are real, both are in the same report, and the second one matters more for AI agents with bounded context windows. The bench, the slice where sverklo loses to grep on dead-code detection, and why "tokens per correct answer" is the load-bearing metric.
bench:swe first results: where local-first code intelligence still misses
First complete cross-repository run of bench:swe across Express, NestJS, Vite, Prisma, and FastAPI. 38 of 65 perfect recall, 66.2% average. The headline number isn't the interesting part — the failure pattern is, and it points at one specific gap in hybrid retrieval that is fixable in v0.18.
Claude Code keeps losing context after compaction — here's how to fix it
Every long Claude Code session hits compaction. When it does, the file contents and search results go first — your agent forgets which files matter, what was decided, and what's risky to change. Persistent, git-aware code intelligence survives the context reset. Here's how.
Reciprocal Rank Fusion is doing 80% of the work in our hybrid search
I tried half a dozen scoring schemes for combining BM25, vector similarity, and PageRank. Most of them required tuning weights, calibrating scores, and explaining the result to skeptical reviewers. RRF is three lines of math, has no tunable parameters, and beats every alternative I tested. Here's what it does, why it works, and why it should be your default.
PageRank for source code: a 2026 revival
Embeddings tell you what code is similar; PageRank tells you what code is load-bearing. Neither alone is enough; together they're the difference between an LLM reading the test fixture and an LLM reading the production file. A short tour of an old idea applied to a new problem.
Bi-temporal memory for AI coding agents
When Claude Code compacts context, the decision you made about retry semantics last Tuesday is gone. Sverklo stores it against the git SHA it was made under and tells you whether the code it referred to still exists. This is why your agent should be git-aware.