Prompt Caching Explained: How to Cut Your AI Coding Costs by Up to 90%
April 18, 2026 · 5 min read
The Single Biggest Cost Saver for AI Coding
If you're using AI coding agents and not thinking about prompt caching, you're leaving money on the table — a lot of it. Prompt caching can reduce your input token costs by up to 90%, and for coding workflows where the same context gets re-sent every turn, the savings are enormous.
Let's break down what prompt caching is, how it works, and how to make the most of it in your AI coding workflow.
What Is Prompt Caching?
Prompt caching is a feature that stores the processed representation of input tokens so they don't need to be reprocessed on subsequent requests. When you send the same input prefix across multiple API calls, the model can skip the expensive prefill computation and jump straight to generating output.
Think of it like this: every time your AI coding agent makes a request, it sends a huge amount of repeated context — the system prompt, your codebase structure, and the conversation history from earlier turns. Without caching, the model re-processes all of this from scratch every single time. With caching, it processes it once and reuses the result.
How It Works: The Mechanics
When you make an API call with prompt caching enabled, here's what happens:
- First request: The model processes your full input normally — system prompt, codebase context, conversation history, and your new instruction. This costs the standard input price. The processed representation is stored in a cache.
- Subsequent requests: When you send a new request with the same input prefix (same system prompt, same codebase files), the model checks the cache. If it finds a match, it skips reprocessing those tokens and only computes the new, uncached portion.
- Cache hit pricing: Cached input tokens cost 90% less than standard input tokens. On Anthropic's API, cached tokens cost $0.30 per million for Claude Sonnet 4.6 (vs $3.00 standard) and $0.10 per million for Claude Haiku 4.5 (vs $1.00 standard). Note that the first write to populate the cache incurs a 25% surcharge ($3.75/M for Sonnet vs $3.00/M standard), but this is quickly offset by the 90% savings on subsequent cache reads.
The key insight: the cache matches on prefixes, not exact matches. As long as the beginning of your request stays the same — which it typically does in a coding session where you're adding to the conversation, not replacing it — the cache keeps hitting.
Real Savings: A 100-Turn CLI Session
Let's make this concrete. Imagine a 100-turn CLI coding session where you're building a feature. Each turn, your agent sends:
- System prompt: ~2,000 tokens (constant across all turns)
- Codebase context: ~10,000 tokens (changes slowly — maybe 5% new per turn)
- Conversation history: grows each turn (turn 1 = 0, turn 50 = ~40,000 tokens, turn 100 = ~80,000 tokens)
- New instruction: ~500 tokens (unique each turn)
Without caching, every token is billed at the full input rate. With caching, roughly 70% of input tokens across the session hit the cache. Here's the difference for Claude Sonnet 4.6:
| Scenario | Total Input Tokens | Input Cost (No Cache) | Input Cost (With Cache) | Savings |
|---|---|---|---|---|
| Claude Sonnet 4.6 | ~8M | $24.00 | $8.88 | 63% |
| Claude Haiku 4.5 | ~8M | $8.00 | $2.96 | 63% |
Now scale that to a full project build. A medium project (~5K LOC) generates about 24.1M input tokens across 367 turns. Without caching, that's $72.30 in input costs on Sonnet. With a 70% cache hit rate, it drops to roughly $26.81 — saving you over $45 on a single project.
Provider Support for Prompt Caching
Not all providers handle caching the same way. Here's the current landscape:
- Anthropic (Claude): Full prompt caching support. You mark cache breakpoints in your API calls, and cached tokens are billed at 90% off. Cache has a 5-minute TTL (refreshed on each hit). This is the gold standard for coding agent workflows.
- OpenAI (GPT): Partial caching support. Automatic prompt caching is available on some models, but you can't control cache breakpoints manually. Savings are typically 50% on cached input rather than 90%. The cache TTL and hit rates are less predictable.
- Google (Gemini): Partial caching through "cached contents" API. You can create explicit caches, but the workflow is more complex than Anthropic's inline caching. Context caching is available on Gemini 2.5 Pro but has specific requirements for minimum token counts.
- DeepSeek: Limited caching. Some infrastructure-level caching may occur, but there's no user-facing cache control or discounted pricing for cached tokens.
Practical Tips for Maximizing Cache Hits
Prompt caching isn't automatic — you need to structure your API calls to maximize cache hits. Here's how:
- Keep your system prompt stable. Don't modify the system prompt between turns. Any change invalidates the entire cache. Put your dynamic instructions at the end of the prompt, after all the static context.
- Order context consistently. Always send codebase files in the same order. If you're including file A, B, and C, don't reorder them to A, C, B on the next turn — that breaks the prefix match.
- Don't break conversations unnecessarily. Starting a new chat resets the cache. Continue in the same session when possible, even if you're switching tasks.
- Use Claude for long sessions. Anthropic's 90% discount on cached tokens is unmatched. If you're running a 200+ turn coding session, the savings compound dramatically. A full medium project build on Sonnet drops from ~$332 to roughly ~$130 with good cache utilization.
- Minimize unnecessary context changes. Don't include timestamps, random IDs, or other volatile data in your prompt prefix. These change every request and prevent cache hits.
The Bottom Line
Prompt caching is not a nice-to-have — it's a must-use feature for anyone running AI coding agents at scale. The difference between cached and uncached costs can be 2–3x on a single project, and the savings grow with project size and session length.
If you're using Claude Code or any Anthropic-based agent, make sure prompt caching is enabled (it's on by default in Claude Code). If you're using other providers, check their caching documentation and structure your prompts accordingly.
Want to see the impact on your specific project? Run your scope through the AI Cost Estimator and compare costs with and without caching across all supported models.
Want to calculate exact costs for your project?
Estimate Your AI Coding Costs →