Unsloth's MTP Speculative Decoding Hits 220 Tokens/s on Consumer GPUs — Local AI Inference Gets Dramatically Cheaper
May 14, 2026 · 5 min read
220 Tokens Per Second on Hardware You Already Own
UnslothAI just released MTP (Multi-Token Prediction) speculative decoding support for their GGUF models, and the performance numbers are remarkable. A Qwen3 35B-A3B model runs at 220 tokens per second on a single consumer GPU. The larger 27B dense model achieves 140 tokens/s. Both use draft tokens=2 as the optimal setting, delivering a 1.4x speedup with zero quality loss.
These are not datacenter numbers on A100 clusters. These are consumer GPU numbers, RTX 4090 and RTX 5090 class hardware that developers already have on their desks. When local inference hits 220 tok/s, the economics of self-hosted AI coding change fundamentally.
How MTP Speculative Decoding Works
Traditional autoregressive generation produces one token at a time. Speculative decoding uses a faster "draft" mechanism to predict multiple tokens ahead, then verifies them in parallel with the full model. MTP takes this further by training the model itself to predict multiple tokens simultaneously, eliminating the need for a separate draft model.
The key insight from Unsloth's implementation:
- Draft tokens=2 is optimal (more drafts add overhead without proportional gains)
- No quality degradation: accepted tokens are mathematically identical to standard generation
- 1.4x throughput improvement is consistent across different prompt types
- Works with quantized GGUF models, keeping VRAM requirements manageable
For coding tasks specifically, speculative decoding works exceptionally well because code has higher token predictability (boilerplate patterns, syntax structures, common idioms) than free-form natural language.
TCO Comparison: Local Inference vs Cloud APIs
Let us calculate the effective cost per million tokens for a self-hosted setup versus cloud API pricing. Assume a developer runs a 27B model at 140 tok/s on an RTX 4090:
| Cost Factor | Local (RTX 4090) | Notes |
|---|---|---|
| GPU hardware | $1,600 | Amortized over 3 years |
| Monthly electricity (8hr/day) | $25 | 450W at $0.12/kWh |
| Monthly amortization | $44 | $1,600 / 36 months |
| Total monthly cost | $69 | Fixed regardless of usage |
| Tokens/month (8hr/day) | ~121M tokens | 140 tok/s x 28,800s/day x 30 |
| Effective $/M tokens | $0.57 | Output tokens |
At $0.57 per million output tokens, local inference with MTP is:
- 26x cheaper than Claude Sonnet 4.6 ($15/M output)
- 44x cheaper than Claude Opus 4.7 ($25/M output)
- 53x cheaper than GPT-5.5 ($30/M output)
- Roughly comparable to DeepSeek V4 Flash ($0.28/M output) but with complete privacy
Who Benefits Most from Local MTP Inference
Not everyone should rush to self-host. The economics favor specific profiles:
High-volume individual developers who generate 10M+ tokens/month on repetitive coding tasks (boilerplate, tests, documentation). At that volume, the $69/month fixed cost is far cheaper than any cloud API except DeepSeek.
Teams with privacy requirements who cannot send proprietary code to external APIs. Local inference eliminates data egress concerns entirely, and at 140-220 tok/s, it is fast enough for interactive coding.
Developers in regions with expensive API access where latency to US/EU API endpoints adds 200-500ms per request. Local inference has sub-10ms latency overhead.
Who should stick with cloud APIs: developers who need frontier reasoning (Opus 4.7 / GPT-5.5 class), those with variable or low usage (under 5M tokens/month), and teams that need the reliability of managed infrastructure.
The Quality-Cost Frontier Is Shifting
A 27B parameter model in 2026 is not the 27B of 2024. Modern training techniques, better data curation, and architectural improvements mean today's 27B models approach the coding quality of 70B models from 18 months ago. Combined with MTP speculative decoding making them practically fast, the quality you get at $0.57/M tokens is genuinely useful for everyday coding work.
You will not replace Claude Opus for complex architectural decisions or multi-thousand-line refactors. But for the 70-80% of coding tasks that are straightforward, a 27B model at 140 tok/s locally is fast, good enough, and effectively free after hardware payoff.
Calculate Your Break-Even Point
The decision between local and cloud depends on your monthly token volume, quality requirements, and whether you already have suitable hardware. A developer spending $150/month on Claude Sonnet API calls would break even on a dedicated GPU in about 4 months, then run at near-zero marginal cost indefinitely.
Use our AI Cost Estimator to calculate your current cloud API spending and determine whether local inference with MTP speculative decoding would save you money at your specific usage volume.
Want to calculate exact costs for your project?
Estimate Your AI Coding Costs →