Ornith-1.0 Hits SWE-Bench Verified 82.4: What MIT-Licensed Agentic Coding at Frontier Level Costs You in 2026
June 26, 2026 · 10 min read
A New Open-Source Frontier in Coding
DeepReinforce dropped Ornith-1.0 on Hugging Face on June 25-26, 2026. The release is a family of four MIT-licensed models post-trained for agentic coding: 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE, built on top of Gemma 4 and Qwen 3.5 base weights with a self-improving training loop the team calls "self-scaffolding."
The reported benchmark numbers for the family flagship (the 397B-MoE configuration): SWE-Bench Verified 82.4, SWE-Bench Pro 62.2, Terminal-Bench 2.1 77. The 35B-MoE variant — the one most likely to fit on commodity inference hardware — reports SWE-Bench Verified 75.6, SWE-Bench Pro 50.4, and Terminal-Bench 2.1 of 64.2 (Terminus-2 harness) / 62.8 (Claude Code harness).
These numbers put the open-source coding frontier within striking distance of Claude Opus 4.8, GPT-5.5, and the proprietary tier. The interesting question is no longer "can open weights do agentic coding well?" It's what does it cost to self-host vs. pay API rates? Let's run the math.
Self-Hosting Cost: 35B-MoE on Cloud GPU
The Ornith model card recommends an 8×80GB GPU node with tensor parallelism for the 35B-MoE variant — concretely, an 8× H100 80GB node or 8× A100 80GB node. Cloud pricing for an 8× H100 instance in mid-2026:
- AWS p5.48xlarge: ~$50-60/hour on-demand
- Lambda Labs 8× H100: ~$22-28/hour
- Together / RunPod / CoreWeave: ~$24-32/hour
At a midpoint of $28/hour, the monthly cost of running one inference node continuously is ~$20,000. At a vLLM-typical throughput of ~3,000 output tokens/second on the MoE variant, that's ~7.8B output tokens per month if the node runs at 100% utilization.
Practical utilization is rarely 100%. A team running real workloads averages 25-50% utilization (peak hours dense, off-hours sparse). Call it 40% utilization: ~3.1B output tokens per month for $20K = ~$6.45 per million output tokens, all-in (excluding inference engineer time).
API Rate Comparison
For the same SWE-Bench Verified ~75 tier, the relevant proprietary alternatives in mid-2026:
- Claude Sonnet 4.6: $3 input / $15 output per million tokens
- GPT-5.5: $2.50 input / $20 output per million tokens (roughly)
- DeepSeek V4 Pro: $0.20 input / $2 output per million tokens
- Qwen 3.7 Plus: $0.50 input / $4 output per million tokens
Self-hosted Ornith 35B-MoE at ~$6.45 per million output tokens is roughly half the price of Claude Sonnet output, but more than DeepSeek V4 Pro and Qwen 3.7 Plus. The breakeven against Sonnet is around 1.3B output tokens per month per inference node. Below that volume, paying API rates is cheaper. Above it, self-hosting wins.
When Self-Hosting Ornith Makes Sense
The cost math says self-hosting wins above ~1.3B output tokens/month. There are at least three other reasons that change the calculus:
Data residency. If your code must stay within a specific jurisdiction or off third-party APIs entirely, self-hosting is the only path. A self-hosted Ornith node is structurally compliant with the most strict data residency requirements.
Latency control. A dedicated inference node has no queueing, no rate-limit pressure, no "model overloaded" errors. For interactive coding workflows where seconds matter, this is a real productivity gain on top of cost considerations.
Fine-tuning. Self-hosted models can be fine-tuned on your codebase, internal style guides, or domain-specific patterns. API models can't. For teams with strong internal conventions, a fine-tuned Ornith may produce more on-style code than any frontier API model.
The Smaller Variants: 9B and 31B Dense
The 9B-Dense and 31B-Dense Ornith variants are more accessible. The 9B fits on a single 24GB consumer card (RTX 4090, RTX 5090) with quantization. The 31B-Dense needs roughly 2× 40GB GPUs or a single 80GB A100/H100. SWE-Bench Verified scores drop accordingly (the 9B is in the mid-50s tier, the 31B Dense is in the upper-60s tier) but the cost-per-token also drops substantially.
For individual developers or small teams who don't need frontier benchmark performance, the 9B variant on a single RTX 5090 ($2,000-2,500 hardware investment plus electricity) can plausibly handle a large fraction of routine coding tasks at near-zero marginal cost per token. The hardware pays for itself against API rates somewhere in the 200-500M output tokens range, which a heavy individual user crosses in 12-24 months.
The 397B-MoE: Frontier-Tier, Specialized Use Case
The 397B-MoE variant — the one that posts the headline 82.4 / 62.2 / 77 numbers — needs a much larger inference footprint. Practical deployment requires 16-32 H100 GPUs depending on quantization and batch size. Per-node cost: $50K-$100K monthly. Volume break-even against Claude Opus 4.8 ($15 input / $75 output): roughly 800M-1.5B output tokens per month per node, which is a lot but achievable for organizations running hundreds of agents in parallel.
For most teams, the 35B-MoE variant is the right entry point. The 397B-MoE matters mostly because its existence demonstrates that open weights have caught up to the proprietary frontier on coding tasks — which is a structural shift in the market regardless of whether you personally deploy it.
What This Means for Coding Cost Economics
Three structural implications of the Ornith release:
1. Pricing pressure on the proprietary tier. When open weights deliver Sonnet-tier coding performance at self-host cost parity, the pricing floor for "good enough" coding APIs drops. Expect Claude Sonnet, GPT-5.5, and other mid-tier APIs to face increased competitive pressure on their output token pricing over the next two quarters.
2. Hybrid routing becomes more attractive. Teams running OpenRouter or LiteLLM routing can now plausibly route easy tasks to Ornith (cheap), reserve Claude/GPT for hard tasks (expensive), and significantly cut their average cost per task. This is the model orchestration pattern Anthropic and OpenAI hate, but it works.
3. Self-hosted as a real option for mid-sized teams. Previously, self-hosting frontier-tier coding models was an enterprise-only conversation. With Ornith 35B-MoE working on a single 8×H100 node, a 50-developer engineering org can plausibly run a private inference deployment that covers most coding workloads.
Bottom Line
Ornith-1.0 is the strongest open-source coding family released to date. The 35B-MoE variant is the practical choice for teams with non-trivial usage volume and infrastructure tolerance, breaking even against Claude Sonnet at roughly 1.3B output tokens per month. The 9B variant is the practical choice for individual developers willing to invest in a consumer GPU. The 397B-MoE variant is the demonstration that open weights have reached the proprietary frontier on agentic coding — which is the bigger structural news than any single deployment.
Frequently Asked Questions
What is Ornith-1.0?
Ornith-1.0 is an MIT-licensed family of open-source coding models released by DeepReinforce on June 25-26, 2026. The family includes 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE variants, all post-trained for agentic coding on top of Gemma 4 and Qwen 3.5 base weights using a self-scaffolding training loop.
What are Ornith-1.0's benchmark scores?
The 397B-MoE flagship reports SWE-Bench Verified 82.4, SWE-Bench Pro 62.2, and Terminal-Bench 2.1 77. The smaller 35B-MoE variant reports SWE-Bench Verified 75.6, SWE-Bench Pro 50.4, and Terminal-Bench 2.1 62.8-64.2 depending on the harness used. These numbers put open-source coding within striking distance of Claude Opus 4.8 and GPT-5.5.
How much does it cost to self-host Ornith 35B-MoE?
On an 8×H100 80GB cloud node at typical mid-2026 rental rates (~$28/hour, ~$20K/month), with 40% practical utilization, the all-in cost works out to roughly $6.45 per million output tokens. This breaks even against Claude Sonnet 4.6 ($15/M output) around 1.3B output tokens per month per node.
When does self-hosting Ornith beat paying API rates?
Above roughly 1.3B output tokens per month per inference node when compared to Claude Sonnet 4.6. Below that, paying API rates is cheaper because the fixed infrastructure cost doesn't amortize. Other reasons to self-host regardless of cost: data residency requirements, latency control, and the ability to fine-tune the model on your codebase.
Can I run Ornith-1.0 on a single consumer GPU?
The 9B-Dense variant fits on a single 24GB consumer card (RTX 4090, RTX 5090) with quantization. SWE-Bench performance is in the mid-50s tier — not frontier, but sufficient for routine coding tasks. The 31B-Dense and larger variants need multi-GPU or datacenter hardware.
Want to calculate exact costs for your project?
Related Articles
The 2026 Open-Source SWE-Bench Frontier: TCO Math for Self-Hosting Top Coding Models
Open-weight coding models have reached SWE-Bench Verified scores in the 75-82 range. We run the total cost of ownership math on self-hosting versus paying API rates across volume tiers — and identify when each path wins in 2026.
Open Source Model Explosion: Gemma 4, DeepSeek V4, Kimi K2.6 — How Free Models Are Reshaping AI Coding Costs
A wave of open-source models just dropped: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1. Here's how they compare on pricing and what they mean for AI coding budgets in 2026.
Kimi K2.7 vs DeepSeek V4: Open Source Coding Models Cost Comparison 2026
Compare Kimi K2.7 and DeepSeek V4 open source coding models on API pricing, self-hosting costs, and performance to find the best value for your development workflow.