AMD MI355X Beats NVIDIA B200 on DeepSeek Inference Cost: What It Means for API Prices

By Eric Bush · May 29, 2026 · 5 min read

Close-up of a green circuit board with electronic components

AMD Undercuts NVIDIA on AI Inference Cost

A benchmark published via the SGLang and AMD collaboration has produced a notable result: AMD MI355X hardware running DeepSeek-R1 achieves inference at $0.169 per million tokens at 129 tokens per second per user. That is 5% cheaper than NVIDIA B200 running Dynamo TRT-LLM, and up to 40% cheaper than B200 running SGLang in a specific 48-GPU configuration.

The throughput comparison is equally striking: 24 AMD MI355X GPUs achieve 2,436 tokens per second per GPU — 1.25x higher per-GPU throughput than a 48-GPU NVIDIA B200 setup. You need fewer AMD chips to serve the same workload, which directly lowers infrastructure cost for cloud providers running DeepSeek models at scale.

Why Hardware Competition Matters for API Prices

AI API pricing is not set in a vacuum. Every time you call DeepSeek V4 Flash at $0.14 per million input tokens or Claude Haiku at $0.80 per million, the provider is covering inference hardware costs, power, networking, and margin. When hardware gets cheaper, the pressure on API prices increases — either through direct cost pass-through or through competitive pressure from providers who adopt cheaper hardware first.

NVIDIA has held near-monopoly pricing power on AI accelerators since the transformer era began. AMD's MI355X closing the performance gap — and exceeding it on specific workloads — introduces a credible alternative that providers can now realistically deploy. That competition benefits developers as end consumers of inference capacity.

DeepSeek Inference Cost Comparison

Hardware Config	Cost per 1M Tokens	GPU Count	Tok/s per GPU
AMD MI355X (SGLang)	$0.169	24	2,436
NVIDIA B200 (Dynamo TRT-LLM)	~$0.178	24	~1,950
NVIDIA B200 (SGLang, 48-GPU)	~$0.237	48	~1,950

These are infrastructure-level costs — what it actually costs a provider to serve DeepSeek-R1 inference at 129 tok/s latency. Current API prices from commercial providers are set above this floor to cover overhead and margin. But the floor is the ultimate price boundary; as it drops, API prices tend to follow over 6-18 months.

Which Models Benefit First?

The AMD MI355X benchmark was run specifically on DeepSeek-R1, a large mixture-of-experts model. AMD's advantage is most pronounced on large MoE models due to memory bandwidth characteristics of the MI355X architecture. This means open-weight frontier models like DeepSeek V4, Llama, and similar architectures are most likely to see price benefits first.

Proprietary models from Anthropic and OpenAI run on their own proprietary infrastructure — AMD's public benchmark results do not directly reveal their internal inference costs. However, as AMD gains traction with third-party inference providers and cloud platforms, the competitive pressure on NVIDIA chip pricing will eventually reduce costs across the board, including for providers serving Claude and GPT models.

What to Expect for API Pricing in 2026

Hardware cost benchmarks like this AMD result typically take 6-18 months to flow through to retail API prices. Cloud providers need to procure, deploy, and optimize the new hardware before passing savings to customers. But the directional signal is clear: inference infrastructure is getting cheaper faster than API prices are dropping, which means provider margins are expanding even as end-user prices hold steady.

For developers budgeting AI coding costs today, the practical implication is that models currently priced at $0.14-0.50 per million tokens for open-weight inference are likely to get cheaper over the next 12-18 months. Locking into long-term per-token commitments at current prices may not be the best strategy if you have flexibility. Use the AI Cost Estimator to stress-test your budget against different price scenarios and identify how much variance your project can absorb if prices shift.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

DeepSeek's DSpark Cuts V4 Inference Time by 60-85% — What That Does to API Pricing

DeepSeek released DSpark on June 28, 2026: an MIT-licensed speculative decoding framework that speeds up DeepSeek V4-Flash by 60-85% and V4-Pro by 57-78% in production. We work through how speculative decoding economics flow through to your API bill — and when they don't.

Nvidia and SK Hynix Multi-Year AI Chip Partnership: What It Means for the Inference Cost Roadmap

Nvidia locked in a multi-year deal with SK Hynix to co-develop next-gen AI memory chips. Here's how HBM advancements translate into cheaper inference and lower API prices for developers over the next 2-3 years.

DeepSeek Local Deployment: $5,000–$35,000 in Hardware vs. $0.14/M Tokens API — Which Actually Saves Money?

A complete cost breakdown of running DeepSeek R1/V3 (671B) locally on consumer and enterprise GPUs versus using the DeepSeek V4 API. We calculate the breakeven point where owning hardware beats paying per token.

← Previous

Claude Code Dynamic Workflows: Running Hundreds of Parallel Subagents — Token Cost Breakdown

xAI Grok Build 0.1 API: $1/M Token — How It Stacks Up Against Claude and GPT for Coding