AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

Cursor Evals Now Shows Per-Model Cost: What the Data Reveals

June 10, 2026 · 7 min read

Analytics dashboard showing cost and performance metrics on a screen

Cursor Makes Cost Transparent

Cursor's evals page at cursor.com/evals has quietly added a column that changes how developers should think about model selection: cost per eval run. Previously, the page showed quality scores — pass rates on coding tasks — but left developers guessing about the price-performance tradeoff. Now you can see exactly what each model costs to achieve its score.

This is a significant shift. When quality was the only visible metric, developers naturally gravitated toward the highest-scoring model. With cost visible alongside quality, the rational choice becomes clear: the cheapest model that passes your quality threshold is the right model.

The Cost-Quality Matrix: What Cursor's Data Shows

Based on current API pricing and typical token usage per coding task in Cursor's eval suite, here's how the major models stack up. These figures represent the cost to complete a standard coding eval task (averaging 2,000 input tokens and 1,500 output tokens per request, with multi-turn tasks requiring 3-5 requests).

Model API Pricing (in/out per M) Est. Cost Per Eval Task Quality Score Cost Per Quality Point
Claude Opus 4.8 $5.00 / $25.00 $0.147 94% $0.0016
Claude Sonnet 4.6 $3.00 / $15.00 $0.089 89% $0.0010
Gemini 2.5 Pro $1.25 / $10.00 $0.058 87% $0.0007
Claude Haiku 4.5 $1.00 / $5.00 $0.030 76% $0.0004
DeepSeek V4 Flash $0.14 / $0.28 $0.001 72% $0.00001

The Key Insight: Cost Per Quality Point

The "Cost Per Quality Point" column reveals something counterintuitive. DeepSeek V4 Flash delivers the best value per quality point by a massive margin — it costs roughly 160x less per quality point than Claude Opus 4.8. But that metric alone is misleading. If your task requires 90%+ accuracy (complex refactors, security-critical code), DeepSeek's 72% pass rate means you'll need retries, human review, or both — erasing the cost advantage.

The real question Cursor's data helps answer is: what's the minimum quality threshold for your specific use case? For autocomplete and simple boilerplate, 72% is fine — failures are caught instantly. For autonomous multi-file edits, you need 90%+ or the debugging cost exceeds the generation savings.

How This Changes Model Selection Strategy

Before cost transparency, the default strategy was simple: use the best model available. Now, rational model selection follows a decision tree:

Step 1: Define your quality floor. What's the minimum pass rate you need? For tab completion, 70% is acceptable. For agent-driven tasks, 85-90% is the minimum. For production code generation without human review, you want 93%+.

Step 2: Find the cheapest model above your floor. If your floor is 85%, Gemini 2.5 Pro at $0.058/task beats Claude Sonnet at $0.089/task — a 35% savings with comparable quality. If your floor is 90%, Sonnet 4.6 is your pick, since Opus 4.8 costs 65% more for only 5 percentage points of additional quality.

Step 3: Account for retry economics. A model with 89% pass rate needs ~1.12 attempts per success. A model with 72% needs ~1.39 attempts. Factor this multiplier into your cost calculation. DeepSeek at $0.001 * 1.39 = $0.0014 is still far cheaper than Opus at $0.147 * 1.06 = $0.156 — but the quality of failures matters too. A subtle bug that passes initial review costs far more than re-running a clearly failed generation.

The Monthly Impact: Real Numbers for Teams

Consider a 5-person dev team making 200 AI-assisted coding requests per developer per day. That's 1,000 requests/day or roughly 22,000 requests/month. Here's what model selection means for their bill:

Model Choice Cost/Request Monthly (22K requests) Annual
All Opus 4.8 $0.147 $3,234 $38,808
All Sonnet 4.6 $0.089 $1,958 $23,496
Mixed (Opus for complex, Haiku for simple) ~$0.053 $1,166 $13,992
All DeepSeek V4 Flash $0.001 $22 $264

The difference between "always use the best" and "use the cheapest that works" is $25,000+ per year for a small team. The mixed approach — routing complex tasks to Opus and simple tasks to Haiku or DeepSeek — captures 90% of the quality at 36% of the all-Opus cost.

What Cursor's Transparency Signals for the Industry

Cursor publishing cost data alongside quality scores is a competitive move. It signals that they expect developers to optimize — and that Cursor's routing layer can help. Expect other AI coding tools to follow with similar transparency, because the alternative is losing cost-conscious users to tools that prove their value.

For developers, this data enables a conversation that was previously impossible: going to your engineering manager with concrete cost-per-quality tradeoffs rather than vague claims about AI productivity. "We can save $15K/year by using Sonnet instead of Opus, with only a 5% drop in first-attempt success rate" is a sentence that gets budget approved.

Practical Recommendations

Run your own evals. Cursor's benchmarks test general coding tasks. Your codebase is specific. A model that scores 87% overall might score 95% on your TypeScript React code and 60% on your Rust systems code. Build a small eval suite (~20 representative tasks from your actual work) and test models against it.

Set up model routing. Don't use one model for everything. Use Opus or Sonnet for complex multi-file changes, architecture decisions, and bug diagnosis. Use Haiku or DeepSeek for autocomplete, test generation, and boilerplate. The quality difference on simple tasks is negligible; the cost difference is 50-150x.

Track your actual spend per task type. Most teams have no visibility into which tasks consume the most tokens. Instrument your usage. You'll likely find that 20% of your tasks (agent-driven refactors, large context searches) consume 80% of your budget — and those are the tasks where model selection matters most.

The era of "just use the most expensive model" is ending. Cursor's cost transparency is the beginning of a more mature market where developers make informed price-performance decisions — the same way teams already choose between EC2 instance types or database tiers. The cheapest model that passes your eval is the right choice. Now you have the data to prove it.

Want to calculate exact costs for your project?