AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

Berkeley's 'Agents' Last Exam': Why 0% on the Hardest Tasks Is a Hidden Token-Cost Problem

June 16, 2026 · 6 min read

Abstract visualization of data nodes and connected pathways

A Benchmark Built to Be Failed

Berkeley's Center for Responsible, Decentralized Intelligence (RDI) released Agents' Last Exam (ALE), a benchmark designed to measure how capable AI agents really are on hard, professional, multi-step tasks. The headline result was sobering: on the most difficult tasks, success rates hovered near 0%, even for frontier agents.

Most coverage framed this as a capability story—agents are not as ready as the hype suggests. True, but there is a second story that engineering teams feel directly in their bills: a failed agent run is not a cheap run.

Failure Burns Tokens Too

When an agent attempts a hard task and fails, it does not fail instantly. It plans, calls tools, reads files, retries, second-guesses, and often loops until it hits a step limit. Every one of those iterations consumes input and output tokens at full price. A task that ends in failure can easily cost more than one that succeeds, because the agent thrashes instead of converging.

This inverts the intuition behind most cost estimates. Teams budget for the tokens a successful task consumes and then multiply by task count. But if 30% of attempts fail after burning 1.5x the tokens of a success, the real bill is meaningfully higher than the naive estimate.

Modeling the True Cost

A more honest cost-per-completed-task formula accounts for the failure tax:

Cost per success = (success cost + failure-rate × failure cost) ÷ success rate

Task DifficultySuccess RateNaive CostTrue Cost/Success
Easy (CRUD, boilerplate)95%$0.10~$0.12
Medium (feature work)70%$0.40~$0.75
Hard (cross-system refactor)25%$1.20~$6.00
ALE-tier (research-grade)~0%$3.00Effectively infinite

The bottom row is the warning ALE delivers: for tasks agents cannot reliably do, there is no cost per success, because there is no success. You are paying purely for attempts.

How to Budget Around It

  • Cap agent steps: hard iteration limits stop runaway thrash before it dominates your bill.
  • Triage by difficulty: route easy, high-success tasks to agents; keep research-grade tasks with humans until success rates improve.
  • Track failure cost separately: instrument token spend on failed runs so you see the real cost-per-completed-task, not the optimistic one.
  • Fail fast and cheap: use a cheap model for a first attempt; escalate to a premium model only when the task looks tractable.

Bottom Line

ALE is a reality check on agent capability, but it is also a cost lesson: tokens spent on failure are still tokens spent. Model your real success rates and step limits with our AI Cost Estimator before committing an agent to tasks it may not be able to finish.

Frequently Asked Questions

What is Agents' Last Exam?

Agents' Last Exam (ALE) is a benchmark from Berkeley's RDI lab that measures AI agent performance on hard, professional, multi-step tasks. On the most difficult tasks, success rates were near 0% even for frontier agents.

Why do failed agent runs cost so much?

A failing agent doesn't stop instantly—it plans, calls tools, retries, and loops until it hits a step limit, consuming full-price tokens the whole time. A failed run can cost more than a successful one.

How should I budget for agent failure rates?

Compute cost per completed task as (success cost + failure-rate × failure cost) ÷ success rate. Cap agent steps, triage tasks by difficulty, and instrument token spend on failed runs to see the true cost.

Want to calculate exact costs for your project?