Cursor Evals Now Shows Per-Model Cost: What the Data Reveals

By Eric Bush · June 10, 2026 · 7 min read

Analytics dashboard showing cost and performance metrics on a screen

Cursor Makes Cost Transparent

Cursor's evals page at cursor.com/evals has quietly added a column that changes how developers should think about model selection: cost per eval run. Previously, the page showed quality scores — pass rates on coding tasks — but left developers guessing about the price-performance tradeoff. Now you can see exactly what each model costs to achieve its score.

This is a significant shift. When quality was the only visible metric, developers naturally gravitated toward the highest-scoring model. With cost visible alongside quality, the rational choice becomes clear: the cheapest model that passes your quality threshold is the right model.

The Cost-Quality Matrix: What Cursor's Data Shows

Based on current API pricing and typical token usage per coding task in Cursor's eval suite, here's how the major models stack up. These figures represent the cost to complete a standard coding eval task (averaging 2,000 input tokens and 1,500 output tokens per request, with multi-turn tasks requiring 3-5 requests).

Model	API Pricing (in/out per M)	Est. Cost Per Eval Task	Quality Score	Cost Per Quality Point
Claude Opus 4.8	$5.00 / $25.00	$0.147	94%	$0.0016
Claude Sonnet 4.6	$3.00 / $15.00	$0.089	89%	$0.0010
Gemini 2.5 Pro	$1.25 / $10.00	$0.058	87%	$0.0007
Claude Haiku 4.5	$1.00 / $5.00	$0.030	76%	$0.0004
DeepSeek V4 Flash	$0.14 / $0.28	$0.001	72%	$0.00001

The Key Insight: Cost Per Quality Point

The "Cost Per Quality Point" column reveals something counterintuitive. DeepSeek V4 Flash delivers the best value per quality point by a massive margin — it costs roughly 160x less per quality point than Claude Opus 4.8. But that metric alone is misleading. If your task requires 90%+ accuracy (complex refactors, security-critical code), DeepSeek's 72% pass rate means you'll need retries, human review, or both — erasing the cost advantage.

The real question Cursor's data helps answer is: what's the minimum quality threshold for your specific use case? For autocomplete and simple boilerplate, 72% is fine — failures are caught instantly. For autonomous multi-file edits, you need 90%+ or the debugging cost exceeds the generation savings.

How This Changes Model Selection Strategy

Before cost transparency, the default strategy was simple: use the best model available. Now, rational model selection follows a decision tree:

Step 1: Define your quality floor. What's the minimum pass rate you need? For tab completion, 70% is acceptable. For agent-driven tasks, 85-90% is the minimum. For production code generation without human review, you want 93%+.

Step 2: Find the cheapest model above your floor. If your floor is 85%, Gemini 2.5 Pro at $0.058/task beats Claude Sonnet at $0.089/task — a 35% savings with comparable quality. If your floor is 90%, Sonnet 4.6 is your pick, since Opus 4.8 costs 65% more for only 5 percentage points of additional quality.

Step 3: Account for retry economics. A model with 89% pass rate needs ~1.12 attempts per success. A model with 72% needs ~1.39 attempts. Factor this multiplier into your cost calculation. DeepSeek at $0.001 * 1.39 = $0.0014 is still far cheaper than Opus at $0.147 * 1.06 = $0.156 — but the quality of failures matters too. A subtle bug that passes initial review costs far more than re-running a clearly failed generation.

The Monthly Impact: Real Numbers for Teams

Consider a 5-person dev team making 200 AI-assisted coding requests per developer per day. That's 1,000 requests/day or roughly 22,000 requests/month. Here's what model selection means for their bill:

Model Choice	Cost/Request	Monthly (22K requests)	Annual
All Opus 4.8	$0.147	$3,234	$38,808
All Sonnet 4.6	$0.089	$1,958	$23,496
Mixed (Opus for complex, Haiku for simple)	~$0.053	$1,166	$13,992
All DeepSeek V4 Flash	$0.001	$22	$264

The difference between "always use the best" and "use the cheapest that works" is $25,000+ per year for a small team. The mixed approach — routing complex tasks to Opus and simple tasks to Haiku or DeepSeek — captures 90% of the quality at 36% of the all-Opus cost.

What Cursor's Transparency Signals for the Industry

Cursor publishing cost data alongside quality scores is a competitive move. It signals that they expect developers to optimize — and that Cursor's routing layer can help. Expect other AI coding tools to follow with similar transparency, because the alternative is losing cost-conscious users to tools that prove their value.

For developers, this data enables a conversation that was previously impossible: going to your engineering manager with concrete cost-per-quality tradeoffs rather than vague claims about AI productivity. "We can save $15K/year by using Sonnet instead of Opus, with only a 5% drop in first-attempt success rate" is a sentence that gets budget approved.

Practical Recommendations

Run your own evals. Cursor's benchmarks test general coding tasks. Your codebase is specific. A model that scores 87% overall might score 95% on your TypeScript React code and 60% on your Rust systems code. Build a small eval suite (~20 representative tasks from your actual work) and test models against it.

Set up model routing. Don't use one model for everything. Use Opus or Sonnet for complex multi-file changes, architecture decisions, and bug diagnosis. Use Haiku or DeepSeek for autocomplete, test generation, and boilerplate. The quality difference on simple tasks is negligible; the cost difference is 50-150x.

Track your actual spend per task type. Most teams have no visibility into which tasks consume the most tokens. Instrument your usage. You'll likely find that 20% of your tasks (agent-driven refactors, large context searches) consume 80% of your budget — and those are the tasks where model selection matters most.

The era of "just use the most expensive model" is ending. Cursor's cost transparency is the beginning of a more mature market where developers make informed price-performance decisions — the same way teams already choose between EC2 instance types or database tiers. The cheapest model that passes your eval is the right choice. Now you have the data to prove it.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Grok 4.5 Private Test Uses Cursor Data: What to Watch Before Budgeting for xAI Coding Models

Elon Musk said Grok 4.5 is in private testing at SpaceX and Tesla, trained from a 1.5T-token V9 base with supplemental Cursor data. Pricing is not public yet, so developers should treat it as a watchlist item, not a budget line.

OpenRouter Launches MCP Server: One-Click Model Comparison Without Leaving Your Coding Agent

OpenRouter released an MCP server giving coding agents real-time access to model pricing, benchmark scores, and documentation. We walk through what it does, how to install it in Claude Code or Cursor, and how it changes day-to-day model selection workflow.

Cursor Composer 2.5: A New Coding Model That Rivals Opus at 1/10th the Cost

Cursor released Composer 2.5 with two pricing tiers — Standard at $0.50/$2.50 per million tokens is 10x cheaper than Claude Opus 4.7. We analyze what proprietary IDE models mean for AI coding economics.

← Previous

How to Track and Reduce AI Token Spending With OpenRouter Analytics

Magnetar Capital Replaces Analysts With AI Agents: The $18B Cost Experiment