Karpathy's Token Optimization Guide: How 90% of AI Coding Bills Are Wasted

May 13, 2026 · 7 min read

Karpathy Says You Are Burning 90% of Your Token Budget

Andrej Karpathy, the former Tesla AI director and OpenAI researcher, recently outlined a stark claim: roughly 90% of what developers spend on AI coding assistance is wasted. Not wasted on bad outputs — wasted on unnecessary token consumption. Bloated context windows, wrong model selection for the task, agents that resend entire codebases on every iteration, and missed caching opportunities silently inflate bills by an order of magnitude.

This is not a theoretical claim. It maps directly to the five most common waste patterns we see in real developer workflows. Each one is fixable, and the combined savings can reduce an AI coding budget from hundreds of dollars per month to tens. Let us break them down with concrete numbers.

Waste Pattern 1: Over-Loading Files into Context

The most common waste pattern is dumping entire files — or entire directories — into the model's context when only a few functions or classes are relevant. A developer debugging a React component might send the entire 2,000-line file plus all its imports, consuming 8,000-10,000 input tokens when 500 tokens of the relevant function and its types would suffice.

At Claude Opus 4.7 pricing ($5.00 per million input tokens), sending 10,000 tokens costs $0.05 per request. Sending 500 tokens costs $0.0025. Over a day of 50 such requests, that is $2.50 vs $0.125 — a 20x difference on input alone. Scale that to a month and you are looking at $50 in unnecessary input costs from this single pattern.

The fix is surgical context selection. Send only the function under review, its type signatures, and the specific error or test output. Tools like Claude Code's @file references and Cursor's context pinning help, but the discipline has to come from the developer. Ask yourself before every prompt: does the model actually need this file to answer my question?

Waste Pattern 2: Using the Wrong Model for the Task

This is the single biggest dollar-for-dollar waste. Developers default to their favorite model for every task, regardless of complexity. Running Claude Opus 4.7 ($5/$25) to generate a boilerplate Express route handler is like hiring a senior architect to paint a wall. The output is identical to what Haiku 4.5 or DeepSeek V4 Flash would produce, but at 10-35x the price.

Task Type	Recommended Model	Input / Output (per 1M)	Savings vs Opus 4.7
Boilerplate / scaffolding	DeepSeek V4 Flash	$0.14 / $0.28	97% cheaper
Unit test generation	GPT-5.4 Mini	$0.75 / $4.50	82% cheaper
Code documentation	Gemini 2.5 Flash	$0.30 / $2.50	90% cheaper
General coding tasks	Claude Sonnet 4.6	$3.00 / $15.00	40% cheaper
Complex architecture	Claude Opus 4.7	$5.00 / $25.00	Baseline
Deep research / reasoning	GPT-5.5	$5.00 / $30.00	Use only when needed

If 70% of your tasks are boilerplate and general coding, and only 10% require frontier reasoning, proper model routing cuts your weighted average cost by 60-75%. A developer spending $200/month on Opus for everything could spend $50-80 by routing appropriately.

Waste Pattern 3: Agents Resending the Entire Codebase

AI coding agents like Claude Code, Cursor Agent, and Copilot Workspace maintain context across multi-step tasks. The problem is that many agent implementations resend the full conversation history — including all previously loaded files — on every iteration. A 10-step debugging session that starts with 5,000 input tokens can balloon to 50,000+ tokens by the final step as context accumulates.

At Opus 4.7's $5/M input pricing, a 10-step session with escalating context costs roughly $1.38 in input tokens alone (average 27,500 tokens x 10 steps = 275K tokens). With smart context management — summarizing previous steps, dropping irrelevant files, resetting context between phases — the same session might consume 80K tokens total, costing $0.40. That is a 3.5x reduction from a single optimization.

The fix: break long agent sessions into focused sub-tasks. Summarize findings between phases. Remove files from context once the agent has moved past them. Some tools handle this automatically (Claude Code's context management is increasingly sophisticated), but developers who manually manage context in tools that do not will see immediate savings.

Waste Pattern 4: Not Using Prompt Caching

Prompt caching is the single most impactful cost optimization that most developers ignore. When you send the same system prompt, file contents, or documentation across multiple requests, caching lets the provider skip re-processing those tokens. Anthropic offers prompt caching that reduces input costs by up to 90% on cached tokens.

Consider a development session where you send a 3,000-token system prompt and a 5,000-token file on every request. Over 30 requests, that is 240,000 redundant input tokens — $1.20 at Opus 4.7 pricing. With caching, the first request pays full price but the remaining 29 pay a fraction, reducing the cost of those repeated tokens to approximately $0.12. Across a month of daily sessions, caching can save $25-50 on input costs alone.

OpenAI, Anthropic, and Google all support forms of prompt caching. If your tool or custom integration is not using it, you are paying full price for the same tokens repeatedly. Check your API integration and enable caching for static or slow-changing context.

Waste Pattern 5: No Model Routing Strategy

The final waste pattern ties the others together: treating model selection as a one-time decision rather than a per-task routing decision. Karpathy's point is that the optimal model changes with every prompt. A single coding session might involve tasks best served by four different models at four different price points.

Tools like OpenRouter's Pareto routing and custom model routing pipelines can automate this. The idea is simple: classify each request by complexity, then route to the cheapest model that can handle it. A quick type signature lookup goes to DeepSeek V4 Flash ($0.14/$0.28). A standard function implementation goes to Sonnet 4.6 ($3/$15) or Kimi K2.6 ($0.75/$3.50). A complex system design question goes to Opus 4.7 ($5/$25). The router makes the decision in milliseconds based on prompt characteristics.

Without routing, developers default to one model for everything. With routing, the effective blended cost drops to the weighted average of all models used. For a typical coding workload, that blended cost is $1-3 per million tokens instead of $5-25.

The Combined Savings: From $300/Month to $30/Month

Let us put all five optimizations together for a developer spending $300/month on AI coding with Claude Opus 4.7 as their default:

Optimization	Est. Reduction	Remaining Cost
Starting monthly cost	—	$300
1. Surgical context selection	-30%	$210
2. Model routing by task	-55%	$95
3. Agent context management	-30%	$66
4. Prompt caching	-40%	$40
5. Ongoing routing refinement	-25%	$30

These numbers are estimates and compound differently depending on your workflow, but the directional math is clear. A developer who applies all five optimizations can realistically reduce their AI coding spend by 85-90% while maintaining the same — or better — output quality. Karpathy's 90% waste figure is not hyperbole. It is a reflection of how most developers use AI models today: defaulting to the most expensive option, sending everything as context, and never looking at what they are actually paying for.

The first step is visibility. Use the AI Cost Estimator to understand what your current workflow costs, then identify which of these five waste patterns is your biggest source of overspending. Most developers will find that model routing alone cuts their bill in half.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →