Senior SWE-Bench: Claude Opus 4.8 Leads at 24% — The Cost per Successful Task Math

By Eric Bush · July 4, 2026 · 9 min read

A senior software engineer reviewing code on a large monitor with dense terminal output visible

What Senior SWE-Bench Actually Measures

Senior SWE-Bench, released this week as an open-source benchmark, is designed to measure whether an AI coding agent can operate at the level of a senior engineer, not just a mid-level one. It splits into two task categories:

Feature development. Instructions arrive in the shape of a natural-language message from a hypothetical PM. There is no formal spec. A verifier agent, seeded with expert acceptance criteria, generates behavioral tests that grade the agent's output.
Bug fixing. The agent gets a bug report plus runtime artifacts: logs, profiling traces, sometimes crash dumps. It must investigate, diagnose, and produce a fix. Pure code-only reading is not enough.

Both are meaningfully harder than the classic SWE-Bench setup, which supplies a well-scoped natural-language description of the exact bug and its location. That gap is the point.

The Leaderboard

Model + Harness	Pass rate	Effort tier
Claude Opus 4.8 + Mini-SWE-Agent	24.0%	max effort
Claude Sonnet 5 + Mini-SWE-Agent	19.4%	max effort
GPT-5.5 + Mini-SWE-Agent	16.0%	max effort

Every frontier model in this bench fails at least 75% of senior-level tasks. That is the number you should hold in your head when a vendor tells you their agent can replace a senior engineer.

What "max effort" Costs

The Mini-SWE-Agent max-effort configuration allows extensive tool calling, retries, and long trajectories. On Opus 4.8 that translates to real token spend per task:

Median tokens per attempt: ~800K input + ~120K output.
At Opus pricing (~$15/M input, ~$75/M output cached): approximately $18-$22 per task attempt.
Pass rate 24% means cost per successful task ≈ $75-$90.

For comparison, a senior engineer's fully-loaded cost in the US runs $150-$220/hour, so if a task would take them 30-60 minutes, the AI is competitive on cost only when it succeeds — and expensive relative to a senior's hour rate when it fails and requires human takeover.

Sonnet vs Opus on This Bench

Sonnet 5 hits 19.4% at roughly 1/5 the input cost per attempt of Opus 4.8. That changes the cost-per-successful-task arithmetic significantly:

Model	Cost/attempt	Pass rate	Cost/success
Opus 4.8	~$20	24.0%	~$83
Sonnet 5	~$4	19.4%	~$21
GPT-5.5	~$6	16.0%	~$38

Sonnet 5 is the clear cost-per-outcome winner on this benchmark. Opus 4.8's higher pass rate does not compensate for its 5x pricing. The one place Opus still wins is on tasks with expensive-to-detect regressions, where an incorrect Sonnet solution can cost more in downstream rework than the Opus premium.

The 75% Failure Tail — What Actually Fails

Reading the failed transcripts, three patterns dominate:

Ambiguous acceptance criteria. The agent implements a plausible interpretation, but the verifier expected a different one. This mirrors what happens with a real PM ticket.
Multi-hop diagnosis. Bug fixes that require correlating logs from three services, or reading a profiler flame graph, still lose the agent partway through the chain.
Silent hidden dependencies. Changes that pass tests but break a downstream consumer not mentioned in the ticket. Senior engineers catch these from experience; agents do not.

Budget Implications

Three concrete adjustments to any team-level AI coding budget:

Do not budget "one attempt = one done." Assume 3-4 attempts on average per senior-level task, even with Opus.
Reserve Opus for tasks that pass Sonnet's 75%-fail category. Route the rest through Sonnet, keep human review on Opus outputs, and count human time in your total-cost calculation.
Add a rework line item. If your team's AI coding output failure rate is 75%+, allocate downstream engineer time in the budget, not just API tokens.

Bottom Line

Senior SWE-Bench is more expensive to run than classic SWE-Bench but produces more honest numbers. For anyone budgeting a coding agent to do senior-engineer work, treat 24% as your ceiling, not your average. And prefer Sonnet-first workflows for cost-per-outcome — Opus only for the tail of hardest tasks.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

What is Senior SWE-Bench and how is it different?

It is a new open-source benchmark grading AI coding agents on senior-engineer-level tasks: feature development with hidden behavioral tests instead of formal specs, and bug fixing that requires investigation from logs and profiling traces. Both are harder than classic SWE-Bench because they lack a well-scoped natural-language problem statement.

What is Claude Opus 4.8's Senior SWE-Bench score?

24.0% with Mini-SWE-Agent at max effort, topping the leaderboard. Sonnet 5 scored 19.4% and GPT-5.5 scored 16.0% under the same harness. All frontier models fail at least 75% of tasks.

How much does one Opus 4.8 attempt cost on this benchmark?

Roughly $18-$22 per attempt at max effort (median ~800K input tokens + 120K output tokens per task at Opus pricing). At a 24% pass rate that works out to roughly $75-$90 per successful task.

Is Opus 4.8 or Sonnet 5 better cost-per-outcome on Senior SWE-Bench?

Sonnet 5 wins on cost-per-outcome at approximately $21 per successful task versus Opus 4.8 at $83. Opus is worth its premium only for tasks with expensive-to-detect regressions, where downstream rework from an incorrect Sonnet answer costs more than the Opus surcharge.

What kinds of tasks does even Opus 4.8 fail on?

Three dominant patterns: ambiguous acceptance criteria where the agent picks the wrong plausible interpretation, multi-hop diagnosis across services or profiler traces, and silent hidden dependencies where a passing patch breaks a downstream consumer not mentioned in the ticket.

GLM-5.2 vs Claude Opus 4.8 on SWE-Bench: Cost Per Coding Task Compared

Compare GLM-5.2 and Claude Opus 4.8 on SWE-Bench performance and cost per coding task. Open-source MIT model vs premium frontier pricing analyzed.

Agent Arena Benchmark: Real-World Cost Per Successful Task Across GPT-5.5, Claude Opus 4.7, and GPT-5.4

Arena's new real-world AI agent leaderboard ranks models by actual task success across 300K+ tasks and 2M+ tool calls. We analyze what the rankings mean for cost-per-successful-task when choosing a coding model.

ByteDance Seed 2.1 Matches Claude Opus on Agent Stability: A Cost-Per-Task Reality Check

ByteDance Seed 2.1 launched June 23, 2026 with benchmarks claiming parity with Claude Opus on agentic coding. We compare cost-per-completed-task against Opus 4.8 and where the parity claim actually holds.

← Previous

Fable 5 Hits 16.1% on Remote Labor Index — What a 6x Jump in 8 Months Means for Coding Costs

LangChain: Your Coding Agent Bill Doubled — The 4-Stage Fix for Tool Fragmentation