Berkeley's 'Agents' Last Exam': Why 0% on the Hardest Tasks Is a Hidden Token-Cost Problem
June 16, 2026 · 6 min read
A Benchmark Built to Be Failed
Berkeley's Center for Responsible, Decentralized Intelligence (RDI) released Agents' Last Exam (ALE), a benchmark designed to measure how capable AI agents really are on hard, professional, multi-step tasks. The headline result was sobering: on the most difficult tasks, success rates hovered near 0%, even for frontier agents.
Most coverage framed this as a capability story—agents are not as ready as the hype suggests. True, but there is a second story that engineering teams feel directly in their bills: a failed agent run is not a cheap run.
Failure Burns Tokens Too
When an agent attempts a hard task and fails, it does not fail instantly. It plans, calls tools, reads files, retries, second-guesses, and often loops until it hits a step limit. Every one of those iterations consumes input and output tokens at full price. A task that ends in failure can easily cost more than one that succeeds, because the agent thrashes instead of converging.
This inverts the intuition behind most cost estimates. Teams budget for the tokens a successful task consumes and then multiply by task count. But if 30% of attempts fail after burning 1.5x the tokens of a success, the real bill is meaningfully higher than the naive estimate.
Modeling the True Cost
A more honest cost-per-completed-task formula accounts for the failure tax:
Cost per success = (success cost + failure-rate × failure cost) ÷ success rate
| Task Difficulty | Success Rate | Naive Cost | True Cost/Success |
|---|---|---|---|
| Easy (CRUD, boilerplate) | 95% | $0.10 | ~$0.12 |
| Medium (feature work) | 70% | $0.40 | ~$0.75 |
| Hard (cross-system refactor) | 25% | $1.20 | ~$6.00 |
| ALE-tier (research-grade) | ~0% | $3.00 | Effectively infinite |
The bottom row is the warning ALE delivers: for tasks agents cannot reliably do, there is no cost per success, because there is no success. You are paying purely for attempts.
How to Budget Around It
- Cap agent steps: hard iteration limits stop runaway thrash before it dominates your bill.
- Triage by difficulty: route easy, high-success tasks to agents; keep research-grade tasks with humans until success rates improve.
- Track failure cost separately: instrument token spend on failed runs so you see the real cost-per-completed-task, not the optimistic one.
- Fail fast and cheap: use a cheap model for a first attempt; escalate to a premium model only when the task looks tractable.
Bottom Line
ALE is a reality check on agent capability, but it is also a cost lesson: tokens spent on failure are still tokens spent. Model your real success rates and step limits with our AI Cost Estimator before committing an agent to tasks it may not be able to finish.
Frequently Asked Questions
What is Agents' Last Exam?
Agents' Last Exam (ALE) is a benchmark from Berkeley's RDI lab that measures AI agent performance on hard, professional, multi-step tasks. On the most difficult tasks, success rates were near 0% even for frontier agents.
Why do failed agent runs cost so much?
A failing agent doesn't stop instantly—it plans, calls tools, retries, and loops until it hits a step limit, consuming full-price tokens the whole time. A failed run can cost more than a successful one.
How should I budget for agent failure rates?
Compute cost per completed task as (success cost + failure-rate × failure cost) ÷ success rate. Cap agent steps, triage tasks by difficulty, and instrument token spend on failed runs to see the true cost.
Want to calculate exact costs for your project?
Related Articles
How to Budget for AI Coding Agents in a Startup: Month-by-Month Guide
A practical month-by-month budget template for AI coding agent spending in startups. From $2000/mo prototyping costs to $100/mo maintenance mode, with model selection strategies for each phase.
Bot Traffic Hits 57.5%: How AI Coding Agents Are Driving Up Infrastructure Costs
Cloudflare Radar reports bots now generate 57.5% of internet traffic. AI coding agents making API calls, fetching docs, and using MCP tools are a growing contributor. Here's what this means for your costs.
Replit Parallel Agents: How Multi-Agent Coding Multiplies Your Token Costs
Replit launched parallel agents that work on multiple files simultaneously. We analyze the token cost multiplier effect and when parallelism saves money versus wastes it.