← Back to Blog

A 14-Point Benchmark Drop Quietly Costs Your Team $760/Month — Here's the Math

June 28, 2026 · 9 min read

Calculator beside a financial spreadsheet with rising numbers

The Translation Most Teams Skip

When Cursor published its SWE-Bench Pro audit in June 2026, the engineering takeaway was clear: Opus 4.8 Max drops from 87.1% to 73.0% under isolated conditions, Composer 2.5 drops 20.7 points. The procurement takeaway was less clear. Most teams looked at the numbers, nodded, and moved on without doing the math.

The math is straightforward and the answer is uncomfortable. A 14-point benchmark drop is not a vendor inconvenience. It is a recurring monthly tax on every AI coding workload you run.

The Mechanism: Attempts Per Resolved Task

Frontier coding agents don't pass-or-fail in one shot. They iterate. A task with an 87% one-shot resolution rate typically needs 1.15 attempts on average to complete (because some tasks need a retry). At 73%, that number rises to 1.37 attempts. That difference — 0.22 extra attempts per task — is the entire cost gap.

Per-attempt cost on a typical coding task: 30K input tokens (file context + system prompt + chat) and 4K output tokens. At Opus 4.8 pricing ($5/M input, $25/M output):

  • Per-attempt: 30K × $5/M + 4K × $25/M = $0.25
  • Extra attempts per task at 14-point drop: 0.22
  • Extra cost per task: $0.055

Team-Level Numbers

Now scale this to realistic team workloads. Assume an engineer runs 20 substantial coding tasks through the agent per working day (this matches Cursor's own usage metrics for active developers).

Team Size Tasks/Month Extra Cost @ 14-pt Drop Annualized
5 engineers 2,200 $121 $1,452
10 engineers 4,400 $242 $2,904
20 engineers 8,800 $484 $5,808
50 engineers 22,000 $1,210 $14,520

And these are the conservative numbers. If your workload skews toward harder tasks (where the model's marketed-vs-real gap widens), or you use larger context windows, or you run more tasks per day, multiply accordingly.

Where The $760/Month Headline Number Comes From

A 14-engineer team running an average of 25 daily tasks per engineer at typical context sizes lands at roughly $760/month in benchmark-gap tax. That is the "every meeting room you walked past" cost. It's not catastrophic alone. It compounds over a year, across multiple models in your stack, and into adjacent decision biases (over-budgeting for capability that doesn't exist, under-budgeting for retries).

Three Cost-Recovery Strategies

You cannot patch the benchmark gap. You can offset its impact.

1. Right-size the model. If you priced your annual contract on 87% capability, renegotiate to 73%. A 16% performance reset deserves a 10-20% price reset. Vendors will engage.

2. Route low-stakes tasks to cheaper models. A 73%-vs-87% gap matters less when the alternative is a 65%-capable mid-tier model at 1/10 the price. The price-quality ratio shifts dramatically in favor of mid-tier when frontier is honest about its real numbers.

3. Cap retry budget. Set a token-budget ceiling per task. When the agent exceeds it, hand off to a human reviewer. This prevents the retry tax from unbounded growth on hard tasks.

A Note On Forecasting

If you build a 12-month AI coding budget assuming 87% one-shot resolution and the audited reality is 73%, you will overspend your annual budget by roughly 19% on this line item. For a $50K annual contract that is $9,500. For a $500K contract it is $95K.

The remedy is not to be cynical about benchmarks. It is to forecast against audited numbers, not marketed numbers. Every public audit (Cursor's SWE-Bench, VitaBench 2.0, Civ VI tournament) is a free input to your model. Use them.

The Permanent Lesson

Benchmark inflation is structural, not malicious. Vendors optimize against benchmarks because procurement reads benchmarks. The numbers will always trend toward the maximum compatible with vendor claims. Your job is to translate those numbers into your workload, your codebase, and your retry economics — and to assume a 10-25% discount until proven otherwise.

Want to calculate exact costs for your project?

Frequently Asked Questions

Where does the 1.15 vs 1.37 attempts-per-task number come from?

Standard geometric series — for a success rate of p, expected attempts to success is 1/p. At 87% it's 1.15, at 73% it's 1.37. The 0.22 difference is the cost gap.

Doesn't caching offset the extra attempt cost?

Partially. Cache reads cost ~10% of uncached input. But retry attempts often re-issue partially different prompts that don't hit the cache cleanly. Expect cache to offset 30-50% of the gap, not 100%.

What if my workload uses larger context windows?

The dollar gap scales linearly with context size. A 100K-input workload sees 3.3x the per-task gap of a 30K-input workload.

How do I track this in our internal cost monitoring?

Add a 'retry tax' line to your monthly AI cost dashboard: (actual attempts - 1) × per-attempt cost. Most providers expose attempt counts via OTLP or their dashboards.