Senior SWE-Bench: Claude Opus 4.8 Leads at 24% — The Cost per Successful Task Math
By Eric Bush · July 4, 2026 · 9 min read
What Senior SWE-Bench Actually Measures
Senior SWE-Bench, released this week as an open-source benchmark, is designed to measure whether an AI coding agent can operate at the level of a senior engineer, not just a mid-level one. It splits into two task categories:
- Feature development. Instructions arrive in the shape of a natural-language message from a hypothetical PM. There is no formal spec. A verifier agent, seeded with expert acceptance criteria, generates behavioral tests that grade the agent's output.
- Bug fixing. The agent gets a bug report plus runtime artifacts: logs, profiling traces, sometimes crash dumps. It must investigate, diagnose, and produce a fix. Pure code-only reading is not enough.
Both are meaningfully harder than the classic SWE-Bench setup, which supplies a well-scoped natural-language description of the exact bug and its location. That gap is the point.
The Leaderboard
| Model + Harness | Pass rate | Effort tier |
|---|---|---|
| Claude Opus 4.8 + Mini-SWE-Agent | 24.0% | max effort |
| Claude Sonnet 5 + Mini-SWE-Agent | 19.4% | max effort |
| GPT-5.5 + Mini-SWE-Agent | 16.0% | max effort |
Every frontier model in this bench fails at least 75% of senior-level tasks. That is the number you should hold in your head when a vendor tells you their agent can replace a senior engineer.
What "max effort" Costs
The Mini-SWE-Agent max-effort configuration allows extensive tool calling, retries, and long trajectories. On Opus 4.8 that translates to real token spend per task:
- Median tokens per attempt: ~800K input + ~120K output.
- At Opus pricing (~$15/M input, ~$75/M output cached): approximately $18-$22 per task attempt.
- Pass rate 24% means cost per successful task ≈ $75-$90.
For comparison, a senior engineer's fully-loaded cost in the US runs $150-$220/hour, so if a task would take them 30-60 minutes, the AI is competitive on cost only when it succeeds — and expensive relative to a senior's hour rate when it fails and requires human takeover.
Sonnet vs Opus on This Bench
Sonnet 5 hits 19.4% at roughly 1/5 the input cost per attempt of Opus 4.8. That changes the cost-per-successful-task arithmetic significantly:
| Model | Cost/attempt | Pass rate | Cost/success |
|---|---|---|---|
| Opus 4.8 | ~$20 | 24.0% | ~$83 |
| Sonnet 5 | ~$4 | 19.4% | ~$21 |
| GPT-5.5 | ~$6 | 16.0% | ~$38 |
Sonnet 5 is the clear cost-per-outcome winner on this benchmark. Opus 4.8's higher pass rate does not compensate for its 5x pricing. The one place Opus still wins is on tasks with expensive-to-detect regressions, where an incorrect Sonnet solution can cost more in downstream rework than the Opus premium.
The 75% Failure Tail — What Actually Fails
Reading the failed transcripts, three patterns dominate:
- Ambiguous acceptance criteria. The agent implements a plausible interpretation, but the verifier expected a different one. This mirrors what happens with a real PM ticket.
- Multi-hop diagnosis. Bug fixes that require correlating logs from three services, or reading a profiler flame graph, still lose the agent partway through the chain.
- Silent hidden dependencies. Changes that pass tests but break a downstream consumer not mentioned in the ticket. Senior engineers catch these from experience; agents do not.
Budget Implications
Three concrete adjustments to any team-level AI coding budget:
- Do not budget "one attempt = one done." Assume 3-4 attempts on average per senior-level task, even with Opus.
- Reserve Opus for tasks that pass Sonnet's 75%-fail category. Route the rest through Sonnet, keep human review on Opus outputs, and count human time in your total-cost calculation.
- Add a rework line item. If your team's AI coding output failure rate is 75%+, allocate downstream engineer time in the budget, not just API tokens.
Bottom Line
Senior SWE-Bench is more expensive to run than classic SWE-Bench but produces more honest numbers. For anyone budgeting a coding agent to do senior-engineer work, treat 24% as your ceiling, not your average. And prefer Sonnet-first workflows for cost-per-outcome — Opus only for the tail of hardest tasks.
Want to calculate exact costs for your project?
Frequently Asked Questions
What is Senior SWE-Bench and how is it different?
It is a new open-source benchmark grading AI coding agents on senior-engineer-level tasks: feature development with hidden behavioral tests instead of formal specs, and bug fixing that requires investigation from logs and profiling traces. Both are harder than classic SWE-Bench because they lack a well-scoped natural-language problem statement.
What is Claude Opus 4.8's Senior SWE-Bench score?
24.0% with Mini-SWE-Agent at max effort, topping the leaderboard. Sonnet 5 scored 19.4% and GPT-5.5 scored 16.0% under the same harness. All frontier models fail at least 75% of tasks.
How much does one Opus 4.8 attempt cost on this benchmark?
Roughly $18-$22 per attempt at max effort (median ~800K input tokens + 120K output tokens per task at Opus pricing). At a 24% pass rate that works out to roughly $75-$90 per successful task.
Is Opus 4.8 or Sonnet 5 better cost-per-outcome on Senior SWE-Bench?
Sonnet 5 wins on cost-per-outcome at approximately $21 per successful task versus Opus 4.8 at $83. Opus is worth its premium only for tasks with expensive-to-detect regressions, where downstream rework from an incorrect Sonnet answer costs more than the Opus surcharge.
What kinds of tasks does even Opus 4.8 fail on?
Three dominant patterns: ambiguous acceptance criteria where the agent picks the wrong plausible interpretation, multi-hop diagnosis across services or profiler traces, and silent hidden dependencies where a passing patch breaks a downstream consumer not mentioned in the ticket.
Related Articles
GLM-5.2 vs Claude Opus 4.8 on SWE-Bench: Cost Per Coding Task Compared
Compare GLM-5.2 and Claude Opus 4.8 on SWE-Bench performance and cost per coding task. Open-source MIT model vs premium frontier pricing analyzed.
Agent Arena Benchmark: Real-World Cost Per Successful Task Across GPT-5.5, Claude Opus 4.7, and GPT-5.4
Arena's new real-world AI agent leaderboard ranks models by actual task success across 300K+ tasks and 2M+ tool calls. We analyze what the rankings mean for cost-per-successful-task when choosing a coding model.
ByteDance Seed 2.1 Matches Claude Opus on Agent Stability: A Cost-Per-Task Reality Check
ByteDance Seed 2.1 launched June 23, 2026 with benchmarks claiming parity with Claude Opus on agentic coding. We compare cost-per-completed-task against Opus 4.8 and where the parity claim actually holds.