FrontierCode Benchmark Shows 87% of AI Code Gets Rejected: What This Means for Your Agent Budget

By Eric Bush · June 9, 2026 · 8 min read

Code review interface with red rejected annotations on a dark screen

The Merge Rate Reality Check

Cognition's new FrontierCode benchmark asked real open-source maintainers to review AI-generated pull requests — not synthetic evaluations, not automated test suites, but actual humans deciding "would I merge this?" The results are sobering: Claude Opus 4.8, the top performer, achieved only a 13.4% merge rate. GPT-5.5 managed 6.3%. Most models fell below 5%.

This means 87% of AI-generated code — even from the best model available — gets rejected by experienced maintainers. Not because it doesn't compile or pass tests, but because it doesn't meet the standards of production codebases: style consistency, architectural fit, edge case handling, and maintainability.

The Cost Multiplier Nobody Talks About

Every AI coding cost estimate assumes the generated code gets used. But if 87% needs rework, your effective cost per accepted feature is dramatically higher than raw token costs suggest.

Model	Merge Rate	Raw Cost/Attempt	Effective Cost/Merged PR	Multiplier
Claude Opus 4.8	13.4%	$1.20	$8.96	7.5x
GPT-5.5	6.3%	$0.90	$14.29	15.9x
Claude Sonnet 4.6	~8%	$0.55	$6.88	12.5x
DeepSeek V4	~3%	$0.04	$1.33	33x

Raw cost per attempt assumes a typical feature implementation (~30K input tokens, ~8K output tokens). The effective cost divides by merge rate to show what you actually pay per successfully merged contribution. Claude Opus 4.8 at $5/$25 per million tokens is expensive per token but cheapest per merged PR because its higher merge rate reduces waste.

Why Benchmark Scores Didn't Predict This

SWE-bench scores suggest these models solve 40-60% of issues. So why only 13% merge rate? The gap comes from what benchmarks don't measure:

Code style mismatch. The code works but doesn't match project conventions — wrong abstraction patterns, inconsistent naming, or non-idiomatic solutions that create maintenance burden.
Scope creep. AI models over-solve problems, touching files they shouldn't, adding unnecessary abstractions, or refactoring adjacent code unprompted.
Missing context. Without full project history and team conventions, AI generates technically correct but contextually wrong solutions — using deprecated patterns, ignoring established migration paths, or conflicting with in-progress work.
Test quality. Models often write tests that pass but don't actually validate the right behavior, or skip edge cases that maintainers know from experience are critical.

Strategies to Improve Merge Rate (and Cut Waste)

Teams achieving 30-40% merge rates with AI agents share common patterns:

Rich context injection. Feed the agent your project's CONTRIBUTING.md, style guide, recent merged PRs as examples, and explicit constraints ("do not modify files outside /src/api/"). This alone can double merge rates.
Iterative review loops. Instead of one-shot generation, run a review cycle: generate → automated lint/test → AI self-review → human spot-check → regenerate. Each loop costs tokens but dramatically improves acceptance.
Scope limitation. Constrain the agent to single-file changes or specific functions rather than multi-file features. Smaller, focused PRs have 3-4x higher merge rates than large feature PRs.
Model selection by task type. Use Opus 4.8 ($5/$25) for complex architectural changes where quality matters most. Use Sonnet 4.6 ($3/$15) for routine implementations. Use Haiku 4.5 ($0.80/$4) for test generation and documentation where merge standards are lower.

Budget Planning with Rejection Rates

Realistic budgeting must account for the rejection multiplier. If your team targets 10 merged AI-generated PRs per week:

Scenario	Attempts Needed	Weekly Token Cost
Opus 4.8 raw (13.4% merge)	~75 attempts	~$90
Opus 4.8 + context (30% merge)	~34 attempts	~$48
Opus 4.8 + context + review loop (45%)	~23 attempts (+ review cost)	~$55

The review loop strategy costs slightly more per attempt but requires far fewer attempts total, landing at a similar weekly spend with much less human review burden.

The Takeaway

FrontierCode exposes the gap between "AI can write code" and "AI can write code that ships." When budgeting for AI coding agents, multiply your raw token cost by 7-8x for a realistic estimate of cost-per-accepted-feature. Then invest in context quality and review workflows to bring that multiplier down to 2-3x. The cheapest token is the one that produces code a maintainer actually merges.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

AI Coding Agent Benchmark Inflation: How to Adjust Your Budget for Real Performance

SWE-bench scores are inflated. Cursor's June 2026 audit found 63% of agent successes came from retrieving known fixes, not independent reasoning. Here's how to read benchmarks honestly and size your AI coding budget accordingly.

7 Coding Agents, 1 Budget: Claude Code vs Cursor vs Copilot vs Devin vs Codex vs Grok Build vs Replit Agent — Real Cost Comparison 2026

A comprehensive cost breakdown of the 7 most-used AI coding agents in 2026. Monthly fees, per-task costs, free tier limits, and a decision table to find the right agent for your budget.

Zuck Says AI Agent Development Is Slower Than Expected: What Meta's $145B Bet Means for Your AI Coding Budget

Mark Zuckerberg's internal remark that agent development is running slower than expected — while Meta continues to spend $145B on AI infrastructure — is a signal about the real timeline for cost-effective agents. Here's what it means for coding budgets over the next 12 months.

← Previous

Xiaomi MiMo UltraSpeed: 1T Model at 1000 Tokens/s Changes the Inference Cost Equation

Claude Integrates with Apple Foundation Models: On-Device + Cloud Cost Architecture