AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

MiniMax M3 Released: Open-Source Model Beats GPT-5.5 on Coding at 1/20 the Inference Cost

June 1, 2026 · 6 min read

Team coding together at a hackathon

A New Cost Leader in Open-Source Coding

MiniMax released M3 today — the first open-source model to simultaneously achieve frontier coding performance, million-token context, and native multimodal capabilities. The headline number: 59.0% on SWE-Bench Pro, surpassing GPT-5.5 (57.2%) and Gemini 3.1 Pro (56.8%). But the cost story is equally significant.

M3 introduces MSA (Mixed Sparse Attention), a new architecture that reduces per-token compute cost to roughly 1/20 of MiniMax's previous generation when processing long contexts. For developers running inference locally or through hosted endpoints, this translates directly into cheaper API calls.

Performance vs Cost: The Numbers

Model SWE-Bench Pro Max Context Open Weights Estimated Cost/M Tokens
MiniMax M3 59.0% 1M tokens Yes ~$0.50-1.00 (hosted)
GPT-5.5 57.2% 200K tokens No $5.00 input / $30.00 output
Claude Opus 4.8 ~62%* 200K tokens No $5.00 input / $25.00 output
Gemini 3.1 Pro 56.8% 2M tokens No $2.00 input / $12.00 output

*Claude Opus 4.8 SWE-Bench Pro score estimated from CursorBench results. The key takeaway: M3 delivers GPT-5.5-level coding at a fraction of the cost, with the option to self-host and eliminate API bills entirely.

The MSA Architecture: Why It's Cheaper

Traditional transformer attention is quadratic with context length — doubling your context quadruples compute cost. M3's MSA (Mixed Sparse Attention) uses learned sparsity patterns that scale sub-linearly. At 1M tokens of context, M3 processes each new token using only ~5% of the full attention matrix. The result: processing a million-token codebase costs roughly the same as processing 50K tokens on a standard architecture.

This matters enormously for coding agents that need to ingest entire repositories. Where Claude Opus 4.8 or GPT-5.5 would require expensive context management (chunking, summarization, RAG), M3 can directly consume the full codebase in a single pass at minimal cost.

Self-Hosting Economics

With open weights, teams can run M3 on their own hardware. Based on the model's architecture requirements, a single NVIDIA A100 80GB can run M3 at approximately 30 tokens/second for standard contexts. For the full 1M context window, you'll need at least 4x A100s. The break-even point versus API billing depends on usage volume, but teams making more than ~500 API calls per day will likely save money self-hosting.

What This Means for AI Coding Costs

M3 represents a new category: open-source models that genuinely compete with frontier closed models on coding tasks. The cost implications are immediate. Developers using GPT-5.5 for code generation can likely switch to M3 via hosted endpoints at 80-90% lower cost with minimal quality loss. Teams self-hosting can reduce their per-token cost to near-zero marginal cost (amortized hardware only).

The pressure this puts on closed-model pricing is significant. OpenAI and Anthropic now compete not just with each other, but with free alternatives that match their mid-tier offerings. Expect accelerated price cuts on GPT-4o and Claude Sonnet within weeks.

Frequently Asked Questions

Can MiniMax M3 replace Claude Opus 4.8 for coding tasks?

For most standard coding tasks, M3's 59% SWE-Bench Pro score matches or exceeds GPT-5.5. However, Claude Opus 4.8 still leads on the most complex agentic workflows. M3 is best positioned as a replacement for GPT-5.5 or Gemini 3.1 Pro rather than the absolute frontier.

How much does it cost to self-host MiniMax M3?

Running M3 on 4x A100 80GB GPUs costs approximately $4-6/hour on cloud providers. At 30 tokens/second throughput, this translates to roughly $0.01-0.02 per 1K tokens — significantly cheaper than any hosted API for frontier models.

Does MiniMax M3 support function calling and tool use?

Yes. M3 supports native function calling, tool use, and multi-turn agent workflows. Its multimodal capabilities also include image and video input, making it suitable for browser-based coding agents that need screenshot understanding.

Where can I access MiniMax M3 as an API?

M3 is available through MiniMax's own API, and is expected to appear on OpenRouter, Together AI, and other hosted inference platforms within days of launch. Self-hosting is available immediately via the open weights on HuggingFace.

Want to calculate exact costs for your project?