Claude Opus 4.7 Leads ITBench-AA at 47%: What Enterprise IT Benchmarks Say About Coding Value
May 28, 2026 · 5 min read
The Benchmark That Tests Real Enterprise Work
Artificial Analysis and IBM released ITBench-AA this week, described as the first benchmark specifically designed for enterprise IT agent tasks. Unlike SWE-Bench, which focuses on open-source software bug fixing, ITBench-AA tests 59 tasks requiring agents to investigate Kubernetes event snapshots via shell commands and submit root cause diagnoses for production incidents.
The headline result: no frontier model scored above 50%. Claude Opus 4.7 led at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. For teams evaluating whether premium-priced frontier models justify their cost, these numbers deserve careful analysis — not panic, and not dismissal.
Why All Models Scored Below 50%
The sub-50% scores reflect genuine difficulty, not benchmark design failure. Kubernetes incident diagnosis requires integrating information across multiple system components, understanding infrastructure state from incomplete logs, and reasoning about failure modes that are often ambiguous even for experienced engineers. These are exactly the tasks where current models struggle: multi-step diagnosis with incomplete information and high consequence for errors.
The benchmark also revealed a significant difference in how many reasoning rounds different models required. Models varied by nearly 3x in the number of tool calls needed to reach a conclusion on the same task. This matters directly for cost: a model that needs 30 tool calls to answer a question costs roughly 3x more than one that reaches the same answer in 10 calls, even if the final answer quality is identical.
Cost-Per-Correct-Task: The Right Metric
Raw accuracy scores are a starting point, not the full picture for cost evaluation. The metric that matters for budget decisions is cost-per-correct-task: the total API spend required to get one correctly completed task.
| Model | Accuracy | Relative cost per task | Cost-per-correct-task index |
|---|---|---|---|
| Claude Opus 4.7 | 47% | Very high | High (premium performance, premium price) |
| GPT-5.5 | 46% | Very high | High (similar accuracy, similar cost) |
| Qwen3.7 Max | 42% | Mid | Moderate (5% less accurate, significantly cheaper) |
The difference between 47% and 42% accuracy is 5 percentage points. Depending on how much cheaper a mid-tier model is versus Claude Opus 4.7, the cost-per-correct-task could actually be lower for the less accurate model. If Qwen3.7 Max costs 80% less per token but only loses 10% of accuracy, you come out ahead financially on most workloads.
When the 5-Point Accuracy Gap Actually Matters
The calculus changes when the cost of an incorrect answer is high. For Kubernetes incident diagnosis in production, a wrong root cause analysis leads to an engineer spending hours on a red herring. If the human cost of a wrong answer is $500 in engineering time and the model API cost is $5 per task, optimizing for the cheapest model is counterproductive.
A practical framework: use the expensive frontier model when the task is high-stakes, infrequent, and requires high accuracy. Route high-volume, lower-consequence tasks to cheaper models even at slightly lower accuracy. The ITBench-AA results provide a concrete data point for making that decision in enterprise IT operations.
What These Benchmarks Can't Tell You
ITBench-AA covers 59 tasks in Kubernetes incident diagnosis. Your actual enterprise IT environment is different. Before using these benchmark results to justify a model choice, test the specific models against a sample of your own real incidents. Benchmark rankings are a starting point for informed evaluation, not a substitute for it. Use the AI Cost Estimator to compare what the top-performing models would cost at your actual task volume before selecting a provider.
Want to calculate exact costs for your project?
Related Articles
GPT-5.5 vs Claude Opus 4.7 vs DeepSeek V4: AI Coding Cost Comparison (May 2026)
A detailed cost comparison of GPT-5.5, Claude Opus 4.7, and DeepSeek V4 for AI-assisted coding. See exactly how much each model costs for real development tasks.
Claude Opus 4.7 Fast Mode: Faster Coding at What Cost?
Anthropic released Fast Mode for Claude Opus 4.7 in the API and Claude Code. We break down the speed vs cost tradeoff and when to use Fast Mode versus standard Opus or Sonnet 4.6.
Claude Opus 4.8 vs 4.7: What Changed and What It Costs Developers
Anthropic released Claude Opus 4.8 with improved coding benchmarks, a 75% reduction in bug miss rate, and Fast Mode now 3x cheaper. Here is what actually changed and how it affects your AI coding budget.