Prompt Caching Across Claude, GPT, and Gemini: A 2026 Cost-Saving Playbook for Coding Agents
June 27, 2026 · 10 min read
Why Caching Dominates the Cost Equation
For any AI coding agent that reads the same files across multiple turns — which is almost all of them — prompt caching is not an optimization, it is the cost equation. A naïve agent that re-sends 25K of file context on every turn pays full input price each time. The same agent with proper caching pays 10% of the input cost on cache hits. Over a 20-turn session, that's an 85-90% reduction in input spending.
Every major provider — Anthropic, OpenAI, Google — now offers prompt caching, but the contracts differ in ways that matter for your agent design. This is the 2026 reference playbook.
Anthropic Claude: Explicit Breakpoints
Mechanism: You manually mark cache breakpoints in your prompt structure. Anything before a breakpoint can be cached and reused; anything after must be re-sent each turn.
- Cache write cost: 1.25x normal input rate (one-time, when the cache is first written).
- Cache read cost: 0.1x normal input rate (90% discount).
- Cache TTL: 5 minutes by default, configurable up to 1 hour.
- Where it fits in your prompt: system prompt, tool definitions, long codebase context — anything stable across a session.
The explicit-breakpoint model is the most powerful when you control the prompt structure. You can guarantee that the model re-uses the cache, and you can verify cache hits via the response metadata. The downside: it requires careful prompt engineering. Misplaced breakpoints (e.g. before content that actually changes between turns) silently break caching and you pay the full price.
OpenAI GPT-5.6: New 30-Minute Contract
OpenAI's GPT-5.6 family introduces a meaningfully better caching contract than earlier GPT models:
- Cache write cost: 1.25x normal input rate.
- Cache read cost: 0.1x normal input rate (90% discount).
- Cache TTL: 30-minute minimum (up from 5 minutes in earlier GPT models).
- Cache breakpoints: explicit support, similar to Anthropic's model.
The 30-minute minimum is the headline change. Long agent sessions that previously had to re-warm the cache every 5 minutes now keep it alive across more of the session. The economics align with the realistic coding-agent workflow: 30 minutes covers most "fix one bug end-to-end" task durations.
For teams that already structured their prompts for Anthropic-style breakpoints, the GPT-5.6 caching contract is a near-drop-in port. For teams relying on OpenAI's older implicit prefix caching, the migration to explicit breakpoints requires light prompt restructuring.
Google Gemini: Implicit Prefix Caching
Gemini's caching is less aggressive than Anthropic or OpenAI's, but easier to use:
- Mechanism: automatic — Google detects when subsequent requests share a long prefix and discounts the shared portion.
- Discount: 25% on the cached prefix (smaller than the 90% from Anthropic/OpenAI).
- Configuration: none required.
- TTL: not documented publicly, varies.
The trade-off is clear: easier setup, smaller savings. For high-volume teams the gap between 25% and 90% caching discount adds up to a meaningful cost difference at scale. For teams that are not optimizing aggressively or whose prompts change frequently, the implicit caching just works.
Cost Math: The Same Workload, Three Providers
Coding agent with 25K input / 5K output per turn, 20 turns per session, 70% of input tokens cacheable:
- Claude Sonnet 4.6 ($3/$15) with explicit caching: ~$0.45/session (vs $1.05 uncached). 57% saving.
- GPT-5.6 Terra ($2.50/$15) with 30-min cache: ~$0.42/session. 57% saving.
- Gemini 3.5 Flash ($1.50/$9) with implicit caching: ~$0.52/session. 22% saving.
With aggressive caching turned on, Terra and Sonnet 4.6 are roughly equivalent on total session cost despite different list prices. Gemini Flash, which looked cheapest on the raw price card, becomes more expensive per session because its caching is less aggressive.
Decision Rules for Your Workflow
Rule 1: Cache the stable parts of your prompt. System prompt, tool definitions, retrieved codebase context — anything that doesn't change turn-to-turn. Put cache breakpoints right before the parts that do change.
Rule 2: Match session length to cache TTL. If your agent sessions are typically 20-40 minutes, you want a provider with 30+ minute cache TTL — Anthropic with the 1-hour TTL extension, or GPT-5.6 by default. Gemini's variable TTL works against you on long sessions.
Rule 3: Verify cache hits in production. Both Anthropic and OpenAI return cache-hit metadata in the response. Log it. If your cache hit rate is below 50%, your breakpoints are misplaced and you're paying for caching that doesn't help.
Rule 4: Don't change models mid-session. Caches are provider-specific. The OpenRouter MCP routing pattern (switching models on the fly) defeats caching for the new model on its first turn. Pin your model for the duration of a session, then re-route between sessions.
Migration Trade-Offs
For teams currently on Anthropic Claude and considering a move to GPT-5.6 Terra for the price advantage:
- Your existing explicit-breakpoint prompts port mostly as-is. Both providers use the same model.
- The 30-minute minimum cache life on Terra is longer than Claude's default 5-min TTL but shorter than Claude's extended 1-hour option — for very long sessions, Claude actually has the better contract.
- Cache-hit metadata fields differ slightly. Update your logging.
Bottom Line
Prompt caching is the largest cost lever after model tier selection in 2026. Anthropic and OpenAI GPT-5.6 offer the most aggressive contracts (90% read discount), Gemini offers a smaller automatic discount (25%) with zero configuration. For workloads where you control the prompt structure and care about cost, explicit-breakpoint caching on Claude or Terra/Sol is the right default. Implicit prefix caching on Gemini is fine for high-volume, low-effort setups. The biggest mistake teams make is structuring their prompts for caching but never verifying that hits actually land — log the metadata, watch your hit rate, and treat anything below 50% as a fixable bug.
Frequently Asked Questions
How much can prompt caching actually cut my AI coding bill?
On typical agent sessions where 70%+ of input tokens are cacheable across multiple turns, Anthropic and OpenAI 5.6-family caching saves 55-70% of input cost. Combined with output staying at full price, total session cost typically drops 40-55%. Gemini's implicit caching delivers smaller savings (15-25% total).
Do I need to restructure my prompts for explicit-breakpoint caching?
Yes, but lightly. The pattern is: put your stable content (system prompt, tool definitions, retrieved code context) before the dynamic content (current user message, latest tool output), then place a cache breakpoint right before the dynamic section. Most agent frameworks already structure prompts this way — you may just need to flip the breakpoint flag in the API call.
Which provider has the best caching contract in 2026?
Anthropic, narrowly. The 90% read discount matches OpenAI's GPT-5.6, but Anthropic's optional 1-hour cache TTL covers longer sessions than OpenAI's 30-minute minimum. For agent sessions that frequently exceed 30 minutes, Anthropic wins. For shorter sessions, OpenAI 5.6 is equivalent.
Should I use Gemini's implicit caching or Claude's explicit caching?
If your team can spend an hour restructuring prompts, use Claude's explicit caching — the 90% discount vs Gemini's 25% is too big to leave on the table. If your prompts are constantly evolving or you have no time to restructure, Gemini's automatic caching is a reasonable fallback. The right answer for most production teams is explicit caching.
Will my agent's quality drop if I cache aggressively?
No. Caching is transparent to the model — it just reads the same content from a faster path. Quality, output behavior, and tool-call accuracy are identical between cached and uncached requests. The only risk is misconfigured caching (breakpoints in the wrong place) that silently breaks and makes you pay full price without realizing it. Always log the cache-hit metadata and monitor your hit rate.
Want to calculate exact costs for your project?
Related Articles
GPT-5.6 Terra vs Claude Sonnet 4.6 vs Gemini 3.5 Flash: The New Mid-Tier Coding Cost Math
GPT-5.6 Terra arrives at $2.50/$15 per million tokens — slightly cheaper than Claude Sonnet 4.6 on input, same on output, and meaningfully more expensive than Gemini 3.5 Flash. We work through the actual cost-per-task numbers for a 25K-context bug fix, where each model wins, and which one to make the default after June 27, 2026.
Open-Source AI Coding Agents 2026: MiMo Code vs Claude Code vs Aider Cost Comparison
Compare open-source AI coding agents: MiMo Code (free MIT, uses MiMo-V2.5), Claude Code (Opus 4.8, ~$100-300/mo), and Aider (free, BYO API). Features, SWE-Bench scores, and total cost of ownership.
Gemini macOS App vs Claude Desktop: AI Coding Assistant Cost on Mac in 2026
Compare Gemini macOS app, Claude Desktop, and Cursor for AI coding on Mac. Subscription costs, API pricing, features, and which is cheapest for different developer workflows.