Cohere North Mini Code: 80% SWE-Bench at 3B Active Parameters

By Eric Bush · June 10, 2026 · 7 min read

Server room with blue lighting and network cables

Near-Frontier Coding From a Tiny Model

Cohere has released North Mini Code, a 30B parameter Mixture-of-Experts (MoE) model that activates only 3B parameters per forward pass. The headline number: 80.2% on SWE-Bench pass@10 — a benchmark that measures real-world software engineering capability, not just isolated function generation. This is released under Apache 2.0, meaning free commercial use with no API dependency.

For context, the current SWE-Bench frontier sits around 85-90% for the largest proprietary models. Achieving 80% with just 3B active parameters means inference costs that are a fraction of what frontier APIs charge — while delivering genuinely useful coding assistance.

Understanding the MoE Advantage for Cost

Mixture-of-Experts models contain many parameters (30B in this case) but only activate a subset for each token. North Mini Code routes each token through the most relevant 3B parameters out of 30B total. The result: model quality of a large model at the inference cost of a small one.

In practical terms, inference compute scales with active parameters, not total parameters. Running North Mini Code costs roughly the same as running any 3B dense model — but produces output quality comparable to models 5-10x larger. This architectural efficiency is what makes the 80% SWE-Bench score so significant for cost optimization.

Memory requirements are higher than a true 3B model (you still need to load 30B parameters into memory), but inference FLOPs — the main driver of per-token cost on cloud hardware — are dramatically lower.

SWE-Bench: Why This Benchmark Matters

SWE-Bench tests models on real GitHub issues from popular Python repositories. The model must understand the issue description, navigate a full codebase, identify the relevant files, and produce a correct patch. This is radically different from HumanEval (isolated function completion) — it measures the kind of work developers actually do.

80.2% pass@10 means that given 10 attempts, the model produces a correct fix 80.2% of the time. For practical coding assistance, this translates to: most real bugs can be fixed by this model if you allow a few retries. The cost of those retries is minimal when inference is nearly free.

Cost Comparison: North Mini Code vs API Models

Let's compare the economics. For API-hosted options, we'll use standard pricing. For North Mini Code, we'll estimate costs based on self-hosting on cloud GPU instances (A10G or L4) and local hardware:

Model	SWE-Bench	Est. Cost/M Tokens	License	Hosting
Claude Opus 4.8	~88%	$5/$25	Proprietary	API only
Claude Sonnet 4.6	~82%	$3/$15	Proprietary	API only
North Mini Code (cloud)	80.2%	~$0.30/$0.60	Apache 2.0	Self-host
North Mini Code (local)	80.2%	~$0.01/$0.02	Apache 2.0	Local GPU
DeepSeek Coder V3	~75%	$0.50/$1.00	MIT	API or self-host
Claude Haiku 4.5	~65%	$1/$5	Proprietary	API only

The numbers tell a striking story. North Mini Code achieves 97% of Sonnet's SWE-Bench performance at 2-4% of the cost when self-hosted on cloud hardware. On local hardware with a capable GPU (RTX 4090 or similar), the per-token cost approaches zero.

Monthly Cost Scenarios

For a developer processing 200 coding tasks per day (roughly 500K tokens total daily throughput):

Setup	Monthly Cost	Quality Trade-off
100% Sonnet 4.6	~$139	Best overall quality
100% North Mini Code (cloud A10G)	~$45 (instance cost)	Slightly lower on hardest tasks
North Mini Code (local RTX 4090)	~$8 (electricity)	Same as cloud, slower throughput
Hybrid: 70% North Mini + 30% Sonnet	~$55-73	Near-Sonnet quality overall

The Open-Source Coding Model Landscape

North Mini Code enters a crowded field but carves a unique position. Compared to alternatives:

vs DeepSeek Coder V3 (236B): North Mini Code achieves higher SWE-Bench scores with dramatically less compute. DeepSeek V3 requires multiple high-end GPUs for self-hosting; North Mini Code runs on a single GPU.

vs Gemma 4 12B: Gemma fits in less memory (16GB vs ~20GB for North Mini) but scores significantly lower on real-world coding benchmarks. North Mini's MoE architecture provides better quality at a slight memory premium.

vs Qwen 3 Coder 32B: Similar quality tier but North Mini achieves it with 10x fewer active parameters, meaning faster inference and lower per-token compute costs.

Practical Limitations

Before you replace your Sonnet subscription entirely:

Memory requirements: 30B total parameters means ~18GB in FP16 or ~9GB in 4-bit quantization. You need a GPU with at least 12GB VRAM for responsive inference, or 24GB+ for full-speed operation.

The 80% vs 88% gap matters: For the hardest 20% of coding tasks — complex multi-step reasoning, subtle bug patterns, architecture-level decisions — frontier models still outperform. North Mini Code excels at the "middle 60%" of coding work: clear bug fixes, feature additions, test writing, and refactoring.

No multimodal: Unlike Gemma 4, North Mini Code is text/code only. If your workflow involves screenshotting UIs for implementation or processing diagrams, you'll still need a multimodal model.

The Bigger Picture: Inference Cost Collapse

North Mini Code represents a trend that's reshaping AI coding economics: the gap between open-source and proprietary model quality is closing faster than the pricing gap. When a free Apache 2.0 model hits 80% on SWE-Bench and the best proprietary model is at 88%, the question shifts from "can open-source models code?" to "is the remaining 8% worth 25-50x the cost?"

For many teams, the answer is increasingly "no" — at least for the majority of their workload. The economically optimal strategy is moving toward a tiered approach: free/cheap open-source for routine tasks, proprietary frontier models reserved for genuinely hard problems where first-attempt accuracy justifies the premium.

Bottom Line

Cohere North Mini Code at 80.2% SWE-Bench with 3B active parameters is a landmark for AI coding cost optimization. It proves that near-frontier coding quality doesn't require frontier pricing. Self-hosted on a single GPU, it offers 10-25x cost reduction versus API-based alternatives while handling the majority of real-world coding tasks competently. Combined with a proprietary model for the hardest tasks, it enables monthly AI coding budgets under $60 without sacrificing much practical capability.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

OpenAI Admits 30% of SWE-Bench Pro Is Flawed: What It Means for Coding Model Benchmarks

OpenAI audited SWE-Bench Pro and found ~30% of tasks have issues. Here's why benchmark scores shouldn't drive your model spending decisions.

Senior SWE-Bench: Claude Opus 4.8 Leads at 24% — The Cost per Successful Task Math

The new Senior SWE-Bench grades AI agents on senior-engineer level tasks: feature dev with hidden tests and bug fixing from logs. Opus 4.8 tops the board at 24%. What does that look like on your API bill?

DiffusionGemma: 4x Faster Text Generation at 3.8B Active Parameters — Cost Implications

Google's open-source DiffusionGemma generates 256 tokens per forward pass at 1000+ tok/s on H100. We analyze when text diffusion models save money for coding workloads and when they don't.

← Previous

Google Gemma 4 12B: Free Local AI Coding With Just 16GB RAM

OpenRouter Advisor: Let Cheap Models Call Expensive Ones Only When Needed