Claude Opus 4.7 Leads ITBench-AA at 47%: What Enterprise IT Benchmarks Say About Coding Value

By Eric Bush · May 28, 2026 · 5 min read

Data science visualization with colorful plots

The Benchmark That Tests Real Enterprise Work

Artificial Analysis and IBM released ITBench-AA this week, described as the first benchmark specifically designed for enterprise IT agent tasks. Unlike SWE-Bench, which focuses on open-source software bug fixing, ITBench-AA tests 59 tasks requiring agents to investigate Kubernetes event snapshots via shell commands and submit root cause diagnoses for production incidents.

The headline result: no frontier model scored above 50%. Claude Opus 4.7 led at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. For teams evaluating whether premium-priced frontier models justify their cost, these numbers deserve careful analysis — not panic, and not dismissal.

Why All Models Scored Below 50%

The sub-50% scores reflect genuine difficulty, not benchmark design failure. Kubernetes incident diagnosis requires integrating information across multiple system components, understanding infrastructure state from incomplete logs, and reasoning about failure modes that are often ambiguous even for experienced engineers. These are exactly the tasks where current models struggle: multi-step diagnosis with incomplete information and high consequence for errors.

The benchmark also revealed a significant difference in how many reasoning rounds different models required. Models varied by nearly 3x in the number of tool calls needed to reach a conclusion on the same task. This matters directly for cost: a model that needs 30 tool calls to answer a question costs roughly 3x more than one that reaches the same answer in 10 calls, even if the final answer quality is identical.

Cost-Per-Correct-Task: The Right Metric

Raw accuracy scores are a starting point, not the full picture for cost evaluation. The metric that matters for budget decisions is cost-per-correct-task: the total API spend required to get one correctly completed task.

Model	Accuracy	Relative cost per task	Cost-per-correct-task index
Claude Opus 4.7	47%	Very high	High (premium performance, premium price)
GPT-5.5	46%	Very high	High (similar accuracy, similar cost)
Qwen3.7 Max	42%	Mid	Moderate (5% less accurate, significantly cheaper)

The difference between 47% and 42% accuracy is 5 percentage points. Depending on how much cheaper a mid-tier model is versus Claude Opus 4.7, the cost-per-correct-task could actually be lower for the less accurate model. If Qwen3.7 Max costs 80% less per token but only loses 10% of accuracy, you come out ahead financially on most workloads.

When the 5-Point Accuracy Gap Actually Matters

The calculus changes when the cost of an incorrect answer is high. For Kubernetes incident diagnosis in production, a wrong root cause analysis leads to an engineer spending hours on a red herring. If the human cost of a wrong answer is $500 in engineering time and the model API cost is $5 per task, optimizing for the cheapest model is counterproductive.

A practical framework: use the expensive frontier model when the task is high-stakes, infrequent, and requires high accuracy. Route high-volume, lower-consequence tasks to cheaper models even at slightly lower accuracy. The ITBench-AA results provide a concrete data point for making that decision in enterprise IT operations.

What These Benchmarks Can't Tell You

ITBench-AA covers 59 tasks in Kubernetes incident diagnosis. Your actual enterprise IT environment is different. Before using these benchmark results to justify a model choice, test the specific models against a sample of your own real incidents. Benchmark rankings are a starting point for informed evaluation, not a substitute for it. Use the AI Cost Estimator to compare what the top-performing models would cost at your actual task volume before selecting a provider.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Claude Enterprise Adds Per-User Cost Dashboards: What the New Analytics Reveal About Your AI Coding Spend

On July 3, 2026, Anthropic rolled out per-user and per-group cost analytics for Claude Enterprise, plus a new Value tab in the Claude Code admin console. We walk through what the new dashboards actually surface, the cost leaks they expose, and how to act on them.

Claude Enterprise Usage Analytics: Why AI Coding Cost Control Is Becoming Developer FinOps

Anthropic's July 2026 Claude Enterprise analytics release gives admins spend visibility, active developer metrics, SCIM group reporting, Analytics API access, and budget thresholds. Here is why AI coding cost control is becoming Developer FinOps.

NVIDIA ASPIRE Uses Claude Opus 4.6 with 1M Context as Robotics Coding Agent: What It Costs Per Task

NVIDIA and academic partners built ASPIRE, a self-improving robotics framework whose programming brain is Claude Opus 4.6 in 1M-token mode. Success rates jump from 4% to 31% on unseen long-horizon tasks — but every LIBERO-Pro trial burns real tokens. Here is the per-task cost math.

← Previous

How to Read SWE-Bench Scores Before Choosing an AI Coding Tool (2026 Guide)

Anthropic's Zero-Trust AI Agent Security Framework: The Hidden Compliance Costs