← Back to Blog

Senior SWE-Bench: Claude Opus 4.8 Leads at 24% — The Cost per Successful Task Math

By Eric Bush · July 4, 2026 · 9 min read

A senior software engineer reviewing code on a large monitor with dense terminal output visible

What Senior SWE-Bench Actually Measures

Senior SWE-Bench, released this week as an open-source benchmark, is designed to measure whether an AI coding agent can operate at the level of a senior engineer, not just a mid-level one. It splits into two task categories:

  • Feature development. Instructions arrive in the shape of a natural-language message from a hypothetical PM. There is no formal spec. A verifier agent, seeded with expert acceptance criteria, generates behavioral tests that grade the agent's output.
  • Bug fixing. The agent gets a bug report plus runtime artifacts: logs, profiling traces, sometimes crash dumps. It must investigate, diagnose, and produce a fix. Pure code-only reading is not enough.

Both are meaningfully harder than the classic SWE-Bench setup, which supplies a well-scoped natural-language description of the exact bug and its location. That gap is the point.

The Leaderboard

Model + Harness Pass rate Effort tier
Claude Opus 4.8 + Mini-SWE-Agent24.0%max effort
Claude Sonnet 5 + Mini-SWE-Agent19.4%max effort
GPT-5.5 + Mini-SWE-Agent16.0%max effort

Every frontier model in this bench fails at least 75% of senior-level tasks. That is the number you should hold in your head when a vendor tells you their agent can replace a senior engineer.

What "max effort" Costs

The Mini-SWE-Agent max-effort configuration allows extensive tool calling, retries, and long trajectories. On Opus 4.8 that translates to real token spend per task:

  • Median tokens per attempt: ~800K input + ~120K output.
  • At Opus pricing (~$15/M input, ~$75/M output cached): approximately $18-$22 per task attempt.
  • Pass rate 24% means cost per successful task ≈ $75-$90.

For comparison, a senior engineer's fully-loaded cost in the US runs $150-$220/hour, so if a task would take them 30-60 minutes, the AI is competitive on cost only when it succeeds — and expensive relative to a senior's hour rate when it fails and requires human takeover.

Sonnet vs Opus on This Bench

Sonnet 5 hits 19.4% at roughly 1/5 the input cost per attempt of Opus 4.8. That changes the cost-per-successful-task arithmetic significantly:

Model Cost/attempt Pass rate Cost/success
Opus 4.8~$2024.0%~$83
Sonnet 5~$419.4%~$21
GPT-5.5~$616.0%~$38

Sonnet 5 is the clear cost-per-outcome winner on this benchmark. Opus 4.8's higher pass rate does not compensate for its 5x pricing. The one place Opus still wins is on tasks with expensive-to-detect regressions, where an incorrect Sonnet solution can cost more in downstream rework than the Opus premium.

The 75% Failure Tail — What Actually Fails

Reading the failed transcripts, three patterns dominate:

  1. Ambiguous acceptance criteria. The agent implements a plausible interpretation, but the verifier expected a different one. This mirrors what happens with a real PM ticket.
  2. Multi-hop diagnosis. Bug fixes that require correlating logs from three services, or reading a profiler flame graph, still lose the agent partway through the chain.
  3. Silent hidden dependencies. Changes that pass tests but break a downstream consumer not mentioned in the ticket. Senior engineers catch these from experience; agents do not.

Budget Implications

Three concrete adjustments to any team-level AI coding budget:

  • Do not budget "one attempt = one done." Assume 3-4 attempts on average per senior-level task, even with Opus.
  • Reserve Opus for tasks that pass Sonnet's 75%-fail category. Route the rest through Sonnet, keep human review on Opus outputs, and count human time in your total-cost calculation.
  • Add a rework line item. If your team's AI coding output failure rate is 75%+, allocate downstream engineer time in the budget, not just API tokens.

Bottom Line

Senior SWE-Bench is more expensive to run than classic SWE-Bench but produces more honest numbers. For anyone budgeting a coding agent to do senior-engineer work, treat 24% as your ceiling, not your average. And prefer Sonnet-first workflows for cost-per-outcome — Opus only for the tail of hardest tasks.

Want to calculate exact costs for your project?

Frequently Asked Questions

What is Senior SWE-Bench and how is it different?

It is a new open-source benchmark grading AI coding agents on senior-engineer-level tasks: feature development with hidden behavioral tests instead of formal specs, and bug fixing that requires investigation from logs and profiling traces. Both are harder than classic SWE-Bench because they lack a well-scoped natural-language problem statement.

What is Claude Opus 4.8's Senior SWE-Bench score?

24.0% with Mini-SWE-Agent at max effort, topping the leaderboard. Sonnet 5 scored 19.4% and GPT-5.5 scored 16.0% under the same harness. All frontier models fail at least 75% of tasks.

How much does one Opus 4.8 attempt cost on this benchmark?

Roughly $18-$22 per attempt at max effort (median ~800K input tokens + 120K output tokens per task at Opus pricing). At a 24% pass rate that works out to roughly $75-$90 per successful task.

Is Opus 4.8 or Sonnet 5 better cost-per-outcome on Senior SWE-Bench?

Sonnet 5 wins on cost-per-outcome at approximately $21 per successful task versus Opus 4.8 at $83. Opus is worth its premium only for tasks with expensive-to-detect regressions, where downstream rework from an incorrect Sonnet answer costs more than the Opus surcharge.

What kinds of tasks does even Opus 4.8 fail on?

Three dominant patterns: ambiguous acceptance criteria where the agent picks the wrong plausible interpretation, multi-hop diagnosis across services or profiler traces, and silent hidden dependencies where a passing patch breaks a downstream consumer not mentioned in the ticket.