Xiaomi MiMo UltraSpeed: 1T Model at 1000 Tokens/s Changes the Inference Cost Equation

By Eric Bush · June 9, 2026 · 7 min read

High-speed data center with blurred motion lights indicating fast processing

1000 Tokens Per Second on a Trillion Parameters

Xiaomi's MiMo-V2.5-Pro-UltraSpeed just demonstrated something that would have been considered impossible two years ago: generating 1,000 tokens per second from a 1-trillion parameter model. The technique combines aggressive FP4 quantization (4-bit floating point) with multi-draft speculative decoding, achieving throughput 10x faster than standard inference on the same model.

The pricing reflects this: UltraSpeed mode costs 3x the base MiMo-V2.5-Pro rate. But here's the counterintuitive economics — paying 3x per token while generating 10x faster can make each completed task cheaper, because you're occupying server resources for 1/10th the time.

The Speed-Cost Tradeoff Math

Traditional inference pricing charges per token regardless of speed. But real-world cost includes more than token fees — there's latency cost (developer waiting time), server occupancy (connection held open), and retry probability (timeouts on slow responses). UltraSpeed changes all three variables simultaneously.

Metric	Standard MiMo-V2.5-Pro	UltraSpeed Mode	Delta
Speed	~100 tok/s	1,000 tok/s	10x faster
Price per token	1x base	3x base	3x more expensive
Time for 5K output tokens	~50 seconds	~5 seconds	45s saved
Timeout retry rate	~8%	~0.5%	16x fewer retries
Effective cost/task (adjusted)	1x	~2.5x	Not 3x after retry savings

The effective cost multiplier drops from 3x to ~2.5x when you factor in eliminated retries and timeout waste. For latency-sensitive applications like interactive coding agents, the calculus shifts further — developer idle time has a cost too.

How FP4 + Speculative Decoding Achieves This

Two techniques combine to hit 1000 tok/s. FP4 quantization compresses the model's weights from 16-bit to 4-bit floating point, reducing memory bandwidth requirements by 4x. This allows the model to fit on fewer GPUs with faster memory throughput per parameter. Quality loss is minimal — MiMo reports less than 2% degradation on coding benchmarks.

Multi-draft speculative decoding uses a small draft model to predict likely continuations in parallel, then the large model verifies multiple tokens simultaneously. Instead of generating one token per forward pass, UltraSpeed verifies 8-12 tokens per pass with ~85% acceptance rate. Combined, these techniques achieve the 10x throughput improvement.

Comparison with Other Speed-Optimized Offerings

MiMo UltraSpeed isn't the only speed play in the market, but it operates at a different scale:

Provider	Model Size	Speed	Price Premium
MiMo UltraSpeed	1T params	1,000 tok/s	3x base
Gemini 2.5 Flash	~200B MoE	~400 tok/s	1x (native speed)
DeepSeek V4 Flash	~600B MoE	~300 tok/s	1x ($0.14/$0.28)
Claude Haiku 4.5	Undisclosed	~200 tok/s	1x ($0.80/$4.00)

When UltraSpeed Makes Economic Sense

The 3x premium is justified in specific scenarios:

Interactive coding agents where developer wait time costs more than the token premium. A developer earning $80/hour loses $1.10 per minute of waiting — 45 seconds saved per response at 20 responses/hour recovers $15/hour in productivity.
Batch processing with tight deadlines. If you need to process 1,000 code reviews overnight, 10x speed means you can complete in 1 hour instead of 10 — using fewer concurrent connections and reducing infrastructure complexity.
Agentic loops with many sequential steps. A coding agent that makes 50 sequential calls per task turns 50-second responses into 5-second responses — the task completes in 4 minutes instead of 42 minutes.

The Bigger Picture: Speed as a Cost Lever

MiMo UltraSpeed signals a market shift. Previously, model providers competed on two axes: quality and price-per-token. Speed is emerging as a third axis with its own economic logic. When inference is fast enough, it enables new architectures — you can afford more agentic loops, more verification passes, more speculative attempts — because each one completes in seconds rather than minutes.

For developers budgeting AI coding costs, the takeaway is clear: don't evaluate models on price-per-token alone. A model at 3x the per-token cost but 10x the speed might deliver lower cost-per-completed-task when you factor in developer time, retry elimination, and architectural simplification. Use the AI Cost Estimator to model these tradeoffs for your specific workflow patterns.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

MiMo-V2.5-DFlash Block Diffusion: 6x Faster Inference Could Slash Per-Token Costs

Xiaomi releases MiMo-V2.5-DFlash with block-diffusion speculative decoding achieving 6x speedup in coding. Draft model is only 2.94GB and acts as an acceleration plugin for existing MiMo deployments.

MiniMax M3 Released: Open-Source Model Beats GPT-5.5 on Coding at 1/20 the Inference Cost

MiniMax M3 launched today with 59% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro. Its MSA sparse attention architecture cuts per-token compute to 1/20 of previous generation. Open weights included.

580 Tokens Per Second and Your AI Coding Bill: Inference Speed vs. Price Tradeoffs Explained

Qwen3.5 hit 580 tokens/second on TokenSpeed. We explain the latency vs. throughput vs. cost triangle for AI coding agents, and when faster inference actually lowers your bill versus when it doesn't.

← Previous

OpenAI Files for IPO: How Going Public Could Reshape AI API Pricing

FrontierCode Benchmark Shows 87% of AI Code Gets Rejected: What This Means for Your Agent Budget