AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

Claude Opus 4.7 Leads ITBench-AA at 47%: What Enterprise IT Benchmarks Say About Coding Value

May 28, 2026 · 5 min read

The Benchmark That Tests Real Enterprise Work

Artificial Analysis and IBM released ITBench-AA this week, described as the first benchmark specifically designed for enterprise IT agent tasks. Unlike SWE-Bench, which focuses on open-source software bug fixing, ITBench-AA tests 59 tasks requiring agents to investigate Kubernetes event snapshots via shell commands and submit root cause diagnoses for production incidents.

The headline result: no frontier model scored above 50%. Claude Opus 4.7 led at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. For teams evaluating whether premium-priced frontier models justify their cost, these numbers deserve careful analysis — not panic, and not dismissal.

Why All Models Scored Below 50%

The sub-50% scores reflect genuine difficulty, not benchmark design failure. Kubernetes incident diagnosis requires integrating information across multiple system components, understanding infrastructure state from incomplete logs, and reasoning about failure modes that are often ambiguous even for experienced engineers. These are exactly the tasks where current models struggle: multi-step diagnosis with incomplete information and high consequence for errors.

The benchmark also revealed a significant difference in how many reasoning rounds different models required. Models varied by nearly 3x in the number of tool calls needed to reach a conclusion on the same task. This matters directly for cost: a model that needs 30 tool calls to answer a question costs roughly 3x more than one that reaches the same answer in 10 calls, even if the final answer quality is identical.

Cost-Per-Correct-Task: The Right Metric

Raw accuracy scores are a starting point, not the full picture for cost evaluation. The metric that matters for budget decisions is cost-per-correct-task: the total API spend required to get one correctly completed task.

Model Accuracy Relative cost per task Cost-per-correct-task index
Claude Opus 4.747%Very highHigh (premium performance, premium price)
GPT-5.546%Very highHigh (similar accuracy, similar cost)
Qwen3.7 Max42%MidModerate (5% less accurate, significantly cheaper)

The difference between 47% and 42% accuracy is 5 percentage points. Depending on how much cheaper a mid-tier model is versus Claude Opus 4.7, the cost-per-correct-task could actually be lower for the less accurate model. If Qwen3.7 Max costs 80% less per token but only loses 10% of accuracy, you come out ahead financially on most workloads.

When the 5-Point Accuracy Gap Actually Matters

The calculus changes when the cost of an incorrect answer is high. For Kubernetes incident diagnosis in production, a wrong root cause analysis leads to an engineer spending hours on a red herring. If the human cost of a wrong answer is $500 in engineering time and the model API cost is $5 per task, optimizing for the cheapest model is counterproductive.

A practical framework: use the expensive frontier model when the task is high-stakes, infrequent, and requires high accuracy. Route high-volume, lower-consequence tasks to cheaper models even at slightly lower accuracy. The ITBench-AA results provide a concrete data point for making that decision in enterprise IT operations.

What These Benchmarks Can't Tell You

ITBench-AA covers 59 tasks in Kubernetes incident diagnosis. Your actual enterprise IT environment is different. Before using these benchmark results to justify a model choice, test the specific models against a sample of your own real incidents. Benchmark rankings are a starting point for informed evaluation, not a substitute for it. Use the AI Cost Estimator to compare what the top-performing models would cost at your actual task volume before selecting a provider.

Want to calculate exact costs for your project?