MiniMax M3 Released: Open-Source Model Beats GPT-5.5 on Coding at 1/20 the Inference Cost

By Eric Bush · June 1, 2026 · 6 min read

A New Cost Leader in Open-Source Coding

MiniMax released M3 today — the first open-source model to simultaneously achieve frontier coding performance, million-token context, and native multimodal capabilities. The headline number: 59.0% on SWE-Bench Pro, surpassing GPT-5.5 (57.2%) and Gemini 3.1 Pro (56.8%). But the cost story is equally significant.

M3 introduces MSA (Mixed Sparse Attention), a new architecture that reduces per-token compute cost to roughly 1/20 of MiniMax's previous generation when processing long contexts. For developers running inference locally or through hosted endpoints, this translates directly into cheaper API calls.

Performance vs Cost: The Numbers

Model	SWE-Bench Pro	Max Context	Open Weights	Estimated Cost/M Tokens
MiniMax M3	59.0%	1M tokens	Yes	~$0.50-1.00 (hosted)
GPT-5.5	57.2%	200K tokens	No	$5.00 input / $30.00 output
Claude Opus 4.8	~62%*	200K tokens	No	$5.00 input / $25.00 output
Gemini 3.1 Pro	56.8%	2M tokens	No	$2.00 input / $12.00 output

*Claude Opus 4.8 SWE-Bench Pro score estimated from CursorBench results. The key takeaway: M3 delivers GPT-5.5-level coding at a fraction of the cost, with the option to self-host and eliminate API bills entirely.

The MSA Architecture: Why It's Cheaper

Traditional transformer attention is quadratic with context length — doubling your context quadruples compute cost. M3's MSA (Mixed Sparse Attention) uses learned sparsity patterns that scale sub-linearly. At 1M tokens of context, M3 processes each new token using only ~5% of the full attention matrix. The result: processing a million-token codebase costs roughly the same as processing 50K tokens on a standard architecture.

This matters enormously for coding agents that need to ingest entire repositories. Where Claude Opus 4.8 or GPT-5.5 would require expensive context management (chunking, summarization, RAG), M3 can directly consume the full codebase in a single pass at minimal cost.

Self-Hosting Economics

With open weights, teams can run M3 on their own hardware. Based on the model's architecture requirements, a single NVIDIA A100 80GB can run M3 at approximately 30 tokens/second for standard contexts. For the full 1M context window, you'll need at least 4x A100s. The break-even point versus API billing depends on usage volume, but teams making more than ~500 API calls per day will likely save money self-hosting.

What This Means for AI Coding Costs

M3 represents a new category: open-source models that genuinely compete with frontier closed models on coding tasks. The cost implications are immediate. Developers using GPT-5.5 for code generation can likely switch to M3 via hosted endpoints at 80-90% lower cost with minimal quality loss. Teams self-hosting can reduce their per-token cost to near-zero marginal cost (amortized hardware only).

The pressure this puts on closed-model pricing is significant. OpenAI and Anthropic now compete not just with each other, but with free alternatives that match their mid-tier offerings. Expect accelerated price cuts on GPT-4o and Claude Sonnet within weeks.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

Can MiniMax M3 replace Claude Opus 4.8 for coding tasks?

For most standard coding tasks, M3's 59% SWE-Bench Pro score matches or exceeds GPT-5.5. However, Claude Opus 4.8 still leads on the most complex agentic workflows. M3 is best positioned as a replacement for GPT-5.5 or Gemini 3.1 Pro rather than the absolute frontier.

How much does it cost to self-host MiniMax M3?

Running M3 on 4x A100 80GB GPUs costs approximately $4-6/hour on cloud providers. At 30 tokens/second throughput, this translates to roughly $0.01-0.02 per 1K tokens — significantly cheaper than any hosted API for frontier models.

Does MiniMax M3 support function calling and tool use?

Yes. M3 supports native function calling, tool use, and multi-turn agent workflows. Its multimodal capabilities also include image and video input, making it suitable for browser-based coding agents that need screenshot understanding.

Where can I access MiniMax M3 as an API?

M3 is available through MiniMax's own API, and is expected to appear on OpenRouter, Together AI, and other hosted inference platforms within days of launch. Self-hosting is available immediately via the open weights on HuggingFace.

Kimi K3 Released: 2.8T Open Source Model with 1M Context — What Coding Teams Pay to Run It

Moonshot's Kimi K3 is a 2.8 trillion parameter open-source model with native vision and a 1M token context window. We break down the real self-hosting hardware costs, when it beats API pricing, and how it compares to today's cheapest coding APIs.

MiniMax M3 vs Claude Opus 4.8 vs GPT-5.5: Best AI Coding Model by Cost and Performance 2026

A head-to-head comparison of MiniMax M3, Claude Opus 4.8, and GPT-5.5 across coding benchmarks, token pricing, context windows, and real-world cost per task. Find the best model for your budget.

Meituan LongCat-2.0 Goes MIT Open Source: Free Self-Hosted 1.6T Coding Model Beats GPT-5.5

Meituan released LongCat-2.0 under MIT license with full weights and inference code. We analyze self-hosting economics for this 1.6T MoE model vs paying cloud API fees, including hardware requirements and break-even timelines.

← Previous

GitHub Copilot Token Billing Goes Live Today: First-Day Bill Spikes Reported Up to 60x Higher

Claude Opus 4.8 Parallel Subagents in Claude Code: What Running 100 Simultaneous Agents Actually Costs