Why Are Input Tokens Cheaper Than Output Tokens? The Economics of LLM Pricing

April 26, 2026 · 5 min read

The Pricing Asymmetry

If you've looked at LLM pricing pages, you've noticed a consistent pattern: output tokens cost 3-5x more than input tokens. For GPT-4o, input tokens cost $2.50 per million while output tokens cost $10.00 — a 4x difference. For Claude Sonnet 4.6, it's $3.00 vs $15.00 per million. Even budget models follow this pattern: DeepSeek V3.2 charges $0.26 for input vs $0.42 for output.

This isn't arbitrary pricing. The gap reflects a fundamental difference in how LLMs process input versus how they generate output — and understanding it is key to optimizing your AI coding costs.

Prefill vs Decode: Two Very Different Operations

LLM inference happens in two distinct phases: prefill and decode.

During prefill, the model processes all input tokens at once. Think of it like reading a document — you can scan it quickly because you're absorbing information in parallel. Modern GPUs excel at this kind of parallel processing, making it computationally efficient per token.

During decode, the model generates output tokens one at a time. Each new token depends on all previous tokens, so generation must happen sequentially. Think of it like writing — you can only write one word at a time, and each word depends on everything written before it.

This sequential nature means output generation can't be parallelized. GPU utilization drops significantly during decode, making it more expensive per token. You're paying for the GPU time it takes to generate each token one after another.

The KV Cache: Why Input Gets a Discount

To avoid reprocessing the entire input for every output token, LLMs use a Key-Value (KV) cache. This stores the intermediate computations from the prefill phase, so each new output token only needs to compute its own contribution while referencing the cached input representations.

The KV cache is the reason input tokens are "cheaper":

Input tokens are processed once in parallel during prefill — very efficient
The results are cached and reused for every subsequent output token
Each output token requires a full forward pass (albeit with cached input), sequentially
The KV cache itself consumes GPU memory, which is a cost factor for the provider

Anthropic's prompt caching takes this further: if you send the same input prefix across multiple API calls (like re-reading the same codebase), the prefill computation can be skipped entirely, reducing input costs by up to 90%. This is particularly valuable for AI coding agents that re-read the same files repeatedly.

Input vs Output Pricing Across Providers

Model	Input (per 1M)	Output (per 1M)	Ratio
Claude Opus 4.7	$5.00	$25.00	5x
Claude Sonnet 4.6	$3.00	$15.00	5x
GPT-5.4	$2.50	$15.00	6x
GPT-4o	$2.50	$10.00	4x
Gemini 2.5 Pro	$1.25	$10.00	8x
DeepSeek V3.2	$0.26	$0.42	1.6x
Llama 4 Scout	$0.08	$0.30	3.8x

Notice that DeepSeek V3.2 has the smallest ratio (1.6x), while Gemini 2.5 Pro has the largest (8x). This means DeepSeek is relatively more cost-efficient for output-heavy workflows, while Gemini penalizes output generation more severely.

Why This Matters for AI Coding Agents

AI coding agents are particularly affected by this pricing asymmetry because they operate in a unique pattern:

They read large codebases (high input tokens, relatively cheap per token)
They generate code modifications (high output tokens, expensive per token)
They iterate autonomously, multiplying both costs across many turns
Each turn accumulates more context, causing exponential input growth

While output tokens are more expensive per unit, the sheer volume of input tokens in AI coding means input costs often dominate the total bill. A 100-turn session on a medium project might consume 10M input tokens and 500K output tokens. Even at a 5x output premium, input costs ($30 for Claude Sonnet) far exceed output costs ($7.50).

Strategies to Optimize Your Token Spend

Enable prompt caching. Anthropic offers up to 90% savings on cached input prefixes. If your agent re-reads the same codebase, this is the single biggest cost reduction available.
Choose models with favorable ratios. DeepSeek V3.2's 1.6x output premium is much friendlier than Gemini 2.5 Pro's 8x for output-heavy coding tasks.
Keep sessions short. Start a fresh conversation rather than letting context grow unbounded. A new session resets the input token counter.
Use cheaper models for exploration. Run initial drafts with budget models, then refine with premium models only for the final output.
Estimate before you build. Use our AI Cost Estimator to project costs across 44 models for your specific project scope.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →