FrontierCode Benchmark Shows 87% of AI Code Gets Rejected: What This Means for Your Agent Budget
June 9, 2026 · 8 min read
The Merge Rate Reality Check
Cognition's new FrontierCode benchmark asked real open-source maintainers to review AI-generated pull requests — not synthetic evaluations, not automated test suites, but actual humans deciding "would I merge this?" The results are sobering: Claude Opus 4.8, the top performer, achieved only a 13.4% merge rate. GPT-5.5 managed 6.3%. Most models fell below 5%.
This means 87% of AI-generated code — even from the best model available — gets rejected by experienced maintainers. Not because it doesn't compile or pass tests, but because it doesn't meet the standards of production codebases: style consistency, architectural fit, edge case handling, and maintainability.
The Cost Multiplier Nobody Talks About
Every AI coding cost estimate assumes the generated code gets used. But if 87% needs rework, your effective cost per accepted feature is dramatically higher than raw token costs suggest.
| Model | Merge Rate | Raw Cost/Attempt | Effective Cost/Merged PR | Multiplier |
|---|---|---|---|---|
| Claude Opus 4.8 | 13.4% | $1.20 | $8.96 | 7.5x |
| GPT-5.5 | 6.3% | $0.90 | $14.29 | 15.9x |
| Claude Sonnet 4.6 | ~8% | $0.55 | $6.88 | 12.5x |
| DeepSeek V4 | ~3% | $0.04 | $1.33 | 33x |
Raw cost per attempt assumes a typical feature implementation (~30K input tokens, ~8K output tokens). The effective cost divides by merge rate to show what you actually pay per successfully merged contribution. Claude Opus 4.8 at $5/$25 per million tokens is expensive per token but cheapest per merged PR because its higher merge rate reduces waste.
Why Benchmark Scores Didn't Predict This
SWE-bench scores suggest these models solve 40-60% of issues. So why only 13% merge rate? The gap comes from what benchmarks don't measure:
- Code style mismatch. The code works but doesn't match project conventions — wrong abstraction patterns, inconsistent naming, or non-idiomatic solutions that create maintenance burden.
- Scope creep. AI models over-solve problems, touching files they shouldn't, adding unnecessary abstractions, or refactoring adjacent code unprompted.
- Missing context. Without full project history and team conventions, AI generates technically correct but contextually wrong solutions — using deprecated patterns, ignoring established migration paths, or conflicting with in-progress work.
- Test quality. Models often write tests that pass but don't actually validate the right behavior, or skip edge cases that maintainers know from experience are critical.
Strategies to Improve Merge Rate (and Cut Waste)
Teams achieving 30-40% merge rates with AI agents share common patterns:
- Rich context injection. Feed the agent your project's CONTRIBUTING.md, style guide, recent merged PRs as examples, and explicit constraints ("do not modify files outside /src/api/"). This alone can double merge rates.
- Iterative review loops. Instead of one-shot generation, run a review cycle: generate → automated lint/test → AI self-review → human spot-check → regenerate. Each loop costs tokens but dramatically improves acceptance.
- Scope limitation. Constrain the agent to single-file changes or specific functions rather than multi-file features. Smaller, focused PRs have 3-4x higher merge rates than large feature PRs.
- Model selection by task type. Use Opus 4.8 ($5/$25) for complex architectural changes where quality matters most. Use Sonnet 4.6 ($3/$15) for routine implementations. Use Haiku 4.5 ($0.80/$4) for test generation and documentation where merge standards are lower.
Budget Planning with Rejection Rates
Realistic budgeting must account for the rejection multiplier. If your team targets 10 merged AI-generated PRs per week:
| Scenario | Attempts Needed | Weekly Token Cost |
|---|---|---|
| Opus 4.8 raw (13.4% merge) | ~75 attempts | ~$90 |
| Opus 4.8 + context (30% merge) | ~34 attempts | ~$48 |
| Opus 4.8 + context + review loop (45%) | ~23 attempts (+ review cost) | ~$55 |
The review loop strategy costs slightly more per attempt but requires far fewer attempts total, landing at a similar weekly spend with much less human review burden.
The Takeaway
FrontierCode exposes the gap between "AI can write code" and "AI can write code that ships." When budgeting for AI coding agents, multiply your raw token cost by 7-8x for a realistic estimate of cost-per-accepted-feature. Then invest in context quality and review workflows to bring that multiplier down to 2-3x. The cheapest token is the one that produces code a maintainer actually merges.
Want to calculate exact costs for your project?
Related Articles
7 Coding Agents, 1 Budget: Claude Code vs Cursor vs Copilot vs Devin vs Codex vs Grok Build vs Replit Agent — Real Cost Comparison 2026
A comprehensive cost breakdown of the 7 most-used AI coding agents in 2026. Monthly fees, per-task costs, free tier limits, and a decision table to find the right agent for your budget.
Claude Code Auto Mode Comes to Pro: What Lower Agent Access Means for Coding Costs
Claude Code auto mode is now available on Pro and supports Sonnet 4.6 and Opus 4.7. Here is what that changes for AI coding costs and developer workflows.
How to Budget for AI Coding Agents in a Startup: Month-by-Month Guide
A practical month-by-month budget template for AI coding agent spending in startups. From $2000/mo prototyping costs to $100/mo maintenance mode, with model selection strategies for each phase.