AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

FrontierCode Benchmark Shows 87% of AI Code Gets Rejected: What This Means for Your Agent Budget

June 9, 2026 · 8 min read

Code review interface with red rejected annotations on a dark screen

The Merge Rate Reality Check

Cognition's new FrontierCode benchmark asked real open-source maintainers to review AI-generated pull requests — not synthetic evaluations, not automated test suites, but actual humans deciding "would I merge this?" The results are sobering: Claude Opus 4.8, the top performer, achieved only a 13.4% merge rate. GPT-5.5 managed 6.3%. Most models fell below 5%.

This means 87% of AI-generated code — even from the best model available — gets rejected by experienced maintainers. Not because it doesn't compile or pass tests, but because it doesn't meet the standards of production codebases: style consistency, architectural fit, edge case handling, and maintainability.

The Cost Multiplier Nobody Talks About

Every AI coding cost estimate assumes the generated code gets used. But if 87% needs rework, your effective cost per accepted feature is dramatically higher than raw token costs suggest.

Model Merge Rate Raw Cost/Attempt Effective Cost/Merged PR Multiplier
Claude Opus 4.8 13.4% $1.20 $8.96 7.5x
GPT-5.5 6.3% $0.90 $14.29 15.9x
Claude Sonnet 4.6 ~8% $0.55 $6.88 12.5x
DeepSeek V4 ~3% $0.04 $1.33 33x

Raw cost per attempt assumes a typical feature implementation (~30K input tokens, ~8K output tokens). The effective cost divides by merge rate to show what you actually pay per successfully merged contribution. Claude Opus 4.8 at $5/$25 per million tokens is expensive per token but cheapest per merged PR because its higher merge rate reduces waste.

Why Benchmark Scores Didn't Predict This

SWE-bench scores suggest these models solve 40-60% of issues. So why only 13% merge rate? The gap comes from what benchmarks don't measure:

  • Code style mismatch. The code works but doesn't match project conventions — wrong abstraction patterns, inconsistent naming, or non-idiomatic solutions that create maintenance burden.
  • Scope creep. AI models over-solve problems, touching files they shouldn't, adding unnecessary abstractions, or refactoring adjacent code unprompted.
  • Missing context. Without full project history and team conventions, AI generates technically correct but contextually wrong solutions — using deprecated patterns, ignoring established migration paths, or conflicting with in-progress work.
  • Test quality. Models often write tests that pass but don't actually validate the right behavior, or skip edge cases that maintainers know from experience are critical.

Strategies to Improve Merge Rate (and Cut Waste)

Teams achieving 30-40% merge rates with AI agents share common patterns:

  • Rich context injection. Feed the agent your project's CONTRIBUTING.md, style guide, recent merged PRs as examples, and explicit constraints ("do not modify files outside /src/api/"). This alone can double merge rates.
  • Iterative review loops. Instead of one-shot generation, run a review cycle: generate → automated lint/test → AI self-review → human spot-check → regenerate. Each loop costs tokens but dramatically improves acceptance.
  • Scope limitation. Constrain the agent to single-file changes or specific functions rather than multi-file features. Smaller, focused PRs have 3-4x higher merge rates than large feature PRs.
  • Model selection by task type. Use Opus 4.8 ($5/$25) for complex architectural changes where quality matters most. Use Sonnet 4.6 ($3/$15) for routine implementations. Use Haiku 4.5 ($0.80/$4) for test generation and documentation where merge standards are lower.

Budget Planning with Rejection Rates

Realistic budgeting must account for the rejection multiplier. If your team targets 10 merged AI-generated PRs per week:

Scenario Attempts Needed Weekly Token Cost
Opus 4.8 raw (13.4% merge) ~75 attempts ~$90
Opus 4.8 + context (30% merge) ~34 attempts ~$48
Opus 4.8 + context + review loop (45%) ~23 attempts (+ review cost) ~$55

The review loop strategy costs slightly more per attempt but requires far fewer attempts total, landing at a similar weekly spend with much less human review burden.

The Takeaway

FrontierCode exposes the gap between "AI can write code" and "AI can write code that ships." When budgeting for AI coding agents, multiply your raw token cost by 7-8x for a realistic estimate of cost-per-accepted-feature. Then invest in context quality and review workflows to bring that multiplier down to 2-3x. The cheapest token is the one that produces code a maintainer actually merges.

Want to calculate exact costs for your project?