Cursor Evals Now Shows Per-Model Cost: What the Data Reveals
June 10, 2026 · 7 min read
Cursor Makes Cost Transparent
Cursor's evals page at cursor.com/evals has quietly added a column that changes how developers should think about model selection: cost per eval run. Previously, the page showed quality scores — pass rates on coding tasks — but left developers guessing about the price-performance tradeoff. Now you can see exactly what each model costs to achieve its score.
This is a significant shift. When quality was the only visible metric, developers naturally gravitated toward the highest-scoring model. With cost visible alongside quality, the rational choice becomes clear: the cheapest model that passes your quality threshold is the right model.
The Cost-Quality Matrix: What Cursor's Data Shows
Based on current API pricing and typical token usage per coding task in Cursor's eval suite, here's how the major models stack up. These figures represent the cost to complete a standard coding eval task (averaging 2,000 input tokens and 1,500 output tokens per request, with multi-turn tasks requiring 3-5 requests).
| Model | API Pricing (in/out per M) | Est. Cost Per Eval Task | Quality Score | Cost Per Quality Point |
|---|---|---|---|---|
| Claude Opus 4.8 | $5.00 / $25.00 | $0.147 | 94% | $0.0016 |
| Claude Sonnet 4.6 | $3.00 / $15.00 | $0.089 | 89% | $0.0010 |
| Gemini 2.5 Pro | $1.25 / $10.00 | $0.058 | 87% | $0.0007 |
| Claude Haiku 4.5 | $1.00 / $5.00 | $0.030 | 76% | $0.0004 |
| DeepSeek V4 Flash | $0.14 / $0.28 | $0.001 | 72% | $0.00001 |
The Key Insight: Cost Per Quality Point
The "Cost Per Quality Point" column reveals something counterintuitive. DeepSeek V4 Flash delivers the best value per quality point by a massive margin — it costs roughly 160x less per quality point than Claude Opus 4.8. But that metric alone is misleading. If your task requires 90%+ accuracy (complex refactors, security-critical code), DeepSeek's 72% pass rate means you'll need retries, human review, or both — erasing the cost advantage.
The real question Cursor's data helps answer is: what's the minimum quality threshold for your specific use case? For autocomplete and simple boilerplate, 72% is fine — failures are caught instantly. For autonomous multi-file edits, you need 90%+ or the debugging cost exceeds the generation savings.
How This Changes Model Selection Strategy
Before cost transparency, the default strategy was simple: use the best model available. Now, rational model selection follows a decision tree:
Step 1: Define your quality floor. What's the minimum pass rate you need? For tab completion, 70% is acceptable. For agent-driven tasks, 85-90% is the minimum. For production code generation without human review, you want 93%+.
Step 2: Find the cheapest model above your floor. If your floor is 85%, Gemini 2.5 Pro at $0.058/task beats Claude Sonnet at $0.089/task — a 35% savings with comparable quality. If your floor is 90%, Sonnet 4.6 is your pick, since Opus 4.8 costs 65% more for only 5 percentage points of additional quality.
Step 3: Account for retry economics. A model with 89% pass rate needs ~1.12 attempts per success. A model with 72% needs ~1.39 attempts. Factor this multiplier into your cost calculation. DeepSeek at $0.001 * 1.39 = $0.0014 is still far cheaper than Opus at $0.147 * 1.06 = $0.156 — but the quality of failures matters too. A subtle bug that passes initial review costs far more than re-running a clearly failed generation.
The Monthly Impact: Real Numbers for Teams
Consider a 5-person dev team making 200 AI-assisted coding requests per developer per day. That's 1,000 requests/day or roughly 22,000 requests/month. Here's what model selection means for their bill:
| Model Choice | Cost/Request | Monthly (22K requests) | Annual |
|---|---|---|---|
| All Opus 4.8 | $0.147 | $3,234 | $38,808 |
| All Sonnet 4.6 | $0.089 | $1,958 | $23,496 |
| Mixed (Opus for complex, Haiku for simple) | ~$0.053 | $1,166 | $13,992 |
| All DeepSeek V4 Flash | $0.001 | $22 | $264 |
The difference between "always use the best" and "use the cheapest that works" is $25,000+ per year for a small team. The mixed approach — routing complex tasks to Opus and simple tasks to Haiku or DeepSeek — captures 90% of the quality at 36% of the all-Opus cost.
What Cursor's Transparency Signals for the Industry
Cursor publishing cost data alongside quality scores is a competitive move. It signals that they expect developers to optimize — and that Cursor's routing layer can help. Expect other AI coding tools to follow with similar transparency, because the alternative is losing cost-conscious users to tools that prove their value.
For developers, this data enables a conversation that was previously impossible: going to your engineering manager with concrete cost-per-quality tradeoffs rather than vague claims about AI productivity. "We can save $15K/year by using Sonnet instead of Opus, with only a 5% drop in first-attempt success rate" is a sentence that gets budget approved.
Practical Recommendations
Run your own evals. Cursor's benchmarks test general coding tasks. Your codebase is specific. A model that scores 87% overall might score 95% on your TypeScript React code and 60% on your Rust systems code. Build a small eval suite (~20 representative tasks from your actual work) and test models against it.
Set up model routing. Don't use one model for everything. Use Opus or Sonnet for complex multi-file changes, architecture decisions, and bug diagnosis. Use Haiku or DeepSeek for autocomplete, test generation, and boilerplate. The quality difference on simple tasks is negligible; the cost difference is 50-150x.
Track your actual spend per task type. Most teams have no visibility into which tasks consume the most tokens. Instrument your usage. You'll likely find that 20% of your tasks (agent-driven refactors, large context searches) consume 80% of your budget — and those are the tasks where model selection matters most.
The era of "just use the most expensive model" is ending. Cursor's cost transparency is the beginning of a more mature market where developers make informed price-performance decisions — the same way teams already choose between EC2 instance types or database tiers. The cheapest model that passes your eval is the right choice. Now you have the data to prove it.
Want to calculate exact costs for your project?
Related Articles
Cursor Composer 2.5: A New Coding Model That Rivals Opus at 1/10th the Cost
Cursor released Composer 2.5 with two pricing tiers — Standard at $0.50/$2.50 per million tokens is 10x cheaper than Claude Opus 4.7. We analyze what proprietary IDE models mean for AI coding economics.
7 Coding Agents, 1 Budget: Claude Code vs Cursor vs Copilot vs Devin vs Codex vs Grok Build vs Replit Agent — Real Cost Comparison 2026
A comprehensive cost breakdown of the 7 most-used AI coding agents in 2026. Monthly fees, per-task costs, free tier limits, and a decision table to find the right agent for your budget.
AI Coding Agent Latency vs Cost: Why Faster Models Cost More and When It's Worth Paying
Faster AI models charge premium prices. This guide breaks down the latency-cost tradeoff in AI coding, explains when speed justifies the premium, and when you should accept slower inference to save money.