Xiaomi MiMo UltraSpeed: 1T Model at 1000 Tokens/s Changes the Inference Cost Equation
June 9, 2026 · 7 min read
1000 Tokens Per Second on a Trillion Parameters
Xiaomi's MiMo-V2.5-Pro-UltraSpeed just demonstrated something that would have been considered impossible two years ago: generating 1,000 tokens per second from a 1-trillion parameter model. The technique combines aggressive FP4 quantization (4-bit floating point) with multi-draft speculative decoding, achieving throughput 10x faster than standard inference on the same model.
The pricing reflects this: UltraSpeed mode costs 3x the base MiMo-V2.5-Pro rate. But here's the counterintuitive economics — paying 3x per token while generating 10x faster can make each completed task cheaper, because you're occupying server resources for 1/10th the time.
The Speed-Cost Tradeoff Math
Traditional inference pricing charges per token regardless of speed. But real-world cost includes more than token fees — there's latency cost (developer waiting time), server occupancy (connection held open), and retry probability (timeouts on slow responses). UltraSpeed changes all three variables simultaneously.
| Metric | Standard MiMo-V2.5-Pro | UltraSpeed Mode | Delta |
|---|---|---|---|
| Speed | ~100 tok/s | 1,000 tok/s | 10x faster |
| Price per token | 1x base | 3x base | 3x more expensive |
| Time for 5K output tokens | ~50 seconds | ~5 seconds | 45s saved |
| Timeout retry rate | ~8% | ~0.5% | 16x fewer retries |
| Effective cost/task (adjusted) | 1x | ~2.5x | Not 3x after retry savings |
The effective cost multiplier drops from 3x to ~2.5x when you factor in eliminated retries and timeout waste. For latency-sensitive applications like interactive coding agents, the calculus shifts further — developer idle time has a cost too.
How FP4 + Speculative Decoding Achieves This
Two techniques combine to hit 1000 tok/s. FP4 quantization compresses the model's weights from 16-bit to 4-bit floating point, reducing memory bandwidth requirements by 4x. This allows the model to fit on fewer GPUs with faster memory throughput per parameter. Quality loss is minimal — MiMo reports less than 2% degradation on coding benchmarks.
Multi-draft speculative decoding uses a small draft model to predict likely continuations in parallel, then the large model verifies multiple tokens simultaneously. Instead of generating one token per forward pass, UltraSpeed verifies 8-12 tokens per pass with ~85% acceptance rate. Combined, these techniques achieve the 10x throughput improvement.
Comparison with Other Speed-Optimized Offerings
MiMo UltraSpeed isn't the only speed play in the market, but it operates at a different scale:
| Provider | Model Size | Speed | Price Premium |
|---|---|---|---|
| MiMo UltraSpeed | 1T params | 1,000 tok/s | 3x base |
| Gemini 2.5 Flash | ~200B MoE | ~400 tok/s | 1x (native speed) |
| DeepSeek V4 Flash | ~600B MoE | ~300 tok/s | 1x ($0.14/$0.28) |
| Claude Haiku 4.5 | Undisclosed | ~200 tok/s | 1x ($0.80/$4.00) |
When UltraSpeed Makes Economic Sense
The 3x premium is justified in specific scenarios:
- Interactive coding agents where developer wait time costs more than the token premium. A developer earning $80/hour loses $1.10 per minute of waiting — 45 seconds saved per response at 20 responses/hour recovers $15/hour in productivity.
- Batch processing with tight deadlines. If you need to process 1,000 code reviews overnight, 10x speed means you can complete in 1 hour instead of 10 — using fewer concurrent connections and reducing infrastructure complexity.
- Agentic loops with many sequential steps. A coding agent that makes 50 sequential calls per task turns 50-second responses into 5-second responses — the task completes in 4 minutes instead of 42 minutes.
The Bigger Picture: Speed as a Cost Lever
MiMo UltraSpeed signals a market shift. Previously, model providers competed on two axes: quality and price-per-token. Speed is emerging as a third axis with its own economic logic. When inference is fast enough, it enables new architectures — you can afford more agentic loops, more verification passes, more speculative attempts — because each one completes in seconds rather than minutes.
For developers budgeting AI coding costs, the takeaway is clear: don't evaluate models on price-per-token alone. A model at 3x the per-token cost but 10x the speed might deliver lower cost-per-completed-task when you factor in developer time, retry elimination, and architectural simplification. Use the AI Cost Estimator to model these tradeoffs for your specific workflow patterns.
Want to calculate exact costs for your project?
Related Articles
MiniMax M3 Released: Open-Source Model Beats GPT-5.5 on Coding at 1/20 the Inference Cost
MiniMax M3 launched today with 59% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro. Its MSA sparse attention architecture cuts per-token compute to 1/20 of previous generation. Open weights included.
580 Tokens Per Second and Your AI Coding Bill: Inference Speed vs. Price Tradeoffs Explained
Qwen3.5 hit 580 tokens/second on TokenSpeed. We explain the latency vs. throughput vs. cost triangle for AI coding agents, and when faster inference actually lowers your bill versus when it doesn't.
Cursor Evals Now Shows Per-Model Cost: What the Data Reveals
Cursor's evals page now displays cost per model alongside quality scores. We analyze what this transparency means for developers choosing between Claude Opus, Sonnet, DeepSeek, and Gemini for AI-assisted coding.