API vs Open Weights: Cost Breakdown

April 15, 2026 · 6 min read

Two Ways to Run an LLM

When you use an LLM in your application, you have two fundamentally different options: pay per token via an API, or run an open-weight model on your own GPU. The first is convenient. The second can be cheaper at scale. But "at scale" is doing a lot of work in that sentence.

Let's break down the real costs of each approach so you can make an informed decision.

The API Route: Pay Per Token

With API-hosted models, you pay for exactly what you use — input tokens and output tokens. No setup, no infrastructure, no maintenance. You send a request, you get a response, you get a bill.

Provider	Model	Input (per 1M)	Output (per 1M)
Anthropic	Claude Opus 4.7	$5.00	$25.00
Anthropic	Claude Sonnet 4.6	$3.00	$15.00
Anthropic	Claude Haiku 4.5	$1.00	$5.00
OpenAI	GPT-5.4	$2.50	$15.00
OpenAI	GPT-4.1	$2.00	$8.00
OpenAI	GPT-4.1 nano	$0.10	$0.40
Google	Gemini 2.5 Pro	$1.25	$10.00
Google	Gemini 2.5 Flash	$0.30	$2.50
DeepSeek	DeepSeek V3.2	$0.26	$0.42

The API route is simple and predictable. Your cost scales linearly with usage. If you process 10M input tokens with Claude Sonnet, you pay $30. If you process 100M, you pay $300. No surprises.

The Open-Weight Route: Pay for GPU

Open-weight models like Llama 4, DeepSeek, and Qwen are free to download but expensive to run. You're paying for GPU compute time, regardless of whether you're actively processing tokens or the GPU is sitting idle.

GPU rental costs vary by provider and GPU type:

GPU	Hourly Cost	Good For	Throughput*
A100 (80GB)	$1.50-$2.50/hr	8B-70B models	~800-1,200 tok/sec
H100 (80GB)	$3.00-$4.00/hr	70B+ models, faster inference	~1,500-2,500 tok/sec
2x A100	$3.00-$5.00/hr	70B models (comfortable fit)	~1,000-1,500 tok/sec
4x A100	$6.00-$10.00/hr	Larger models (405B+)	~800-1,200 tok/sec

The critical difference: with APIs, you pay per token. With GPUs, you pay per hour, whether you use the full capacity or not. A GPU that can process 1,000 tokens/second but is only serving 100 tokens/second is 90% wasted — and you're still paying full price.

*Throughput figures are approximate for a dense 70B model on the given GPU configuration. The 4x A100 row is sized for 405B+ models; throughput for those models will be lower. MoE models like Llama 4 Scout (109B total, 17B active) may have different throughput characteristics.

Break-Even Analysis: When Does Self-Hosting Save Money?

Let's do the math. Say you're considering running Llama 4 Scout on 2x A100 GPUs ($4/hr) instead of using DeepSeek V3.2 via API ($0.26/1M input, $0.42/1M output). At what token volume does self-hosting become cheaper?

With 2x A100 GPUs running Llama 4 Scout at ~1,000 tokens/second, you can process roughly 3.6M tokens per hour. At $4/hr, that's about $1.11 per 1M tokens (blended rate). DeepSeek V3.2 costs $0.26-$0.42 per 1M tokens depending on input vs output.

In this comparison, the API is cheaper. DeepSeek's per-token pricing is below your GPU cost per token even at full utilization. The API route wins here because DeepSeek is extraordinarily cheap.

But what about comparing against more expensive APIs? Let's compare self-hosting against Claude Sonnet 4.6:

Scenario	API Cost (Sonnet)	GPU Cost (2x A100)	Winner
10M tokens/day	$30-150/day	$96/day	Depends on output ratio
50M tokens/day	$150-750/day	$96/day	GPU wins
100M tokens/day	$300-1,500/day	$96/day	GPU wins by a lot

Against premium APIs like Claude Sonnet, self-hosting wins once you're processing roughly 30M+ tokens per day with a decent output-to-input ratio. Against budget APIs like DeepSeek or Gemini Flash, the API is almost always cheaper unless you're at extreme scale.

The Hidden Costs of Self-Hosting

The GPU rental cost is just the beginning. Self-hosting open-weight models comes with several costs that don't show up on the invoice:

Setup time — Getting a model running properly takes hours to days. You need to set up inference servers (vLLM, TGI, or similar), configure GPU drivers, handle model weights, and optimize for your workload. This is engineering time that could be spent building your product.
Maintenance — GPUs fail. Drivers need updating. Model weights get updated. Your inference server needs patching. All of this is time you spend instead of building features.
Downtime — When your GPU instance goes down, your product goes down. API providers have 99.9%+ uptime with global redundancy. Your single GPU instance does not.
Underutilization — If your traffic isn't consistent 24/7, your GPU sits idle during low-traffic periods. You're paying for capacity you're not using. API pricing naturally handles traffic spikes and lulls.
Model quality gap — The best open-weight models are good, but they generally don't match Claude Opus or GPT-5.4 on complex reasoning tasks. You're trading some capability for cost savings. Also, note that Llama 4 Scout uses a Mixture-of-Experts architecture with 109B total parameters (17B active), so throughput on GPU may differ from a dense 70B model — adjust your break-even calculations accordingly.

Which Route Should You Choose?

Use APIs if: You're processing less than 30M tokens/day, you value simplicity, or you need the best model quality. This covers 95% of developers and startups.
Self-host if: You're processing 50M+ tokens/day with premium APIs, you have DevOps capacity, and you can tolerate some downtime. The savings at high volume are substantial.
Hybrid approach: Use cheap APIs (DeepSeek, Gemini Flash) for high-volume routine tasks, and premium APIs for complex reasoning. This gets you most of the cost savings without the infrastructure burden.

The right answer depends entirely on your volume, your engineering resources, and your tolerance for operational complexity. Run the numbers for your specific use case — and remember to account for the hidden costs, not just the hourly GPU rate.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →