Berkeley's 'Agents' Last Exam': Why 0% on the Hardest Tasks Is a Hidden Token-Cost Problem

By Eric Bush · June 16, 2026 · 6 min read

Abstract visualization of data nodes and connected pathways

A Benchmark Built to Be Failed

Berkeley's Center for Responsible, Decentralized Intelligence (RDI) released Agents' Last Exam (ALE), a benchmark designed to measure how capable AI agents really are on hard, professional, multi-step tasks. The headline result was sobering: on the most difficult tasks, success rates hovered near 0%, even for frontier agents.

Most coverage framed this as a capability story—agents are not as ready as the hype suggests. True, but there is a second story that engineering teams feel directly in their bills: a failed agent run is not a cheap run.

Failure Burns Tokens Too

When an agent attempts a hard task and fails, it does not fail instantly. It plans, calls tools, reads files, retries, second-guesses, and often loops until it hits a step limit. Every one of those iterations consumes input and output tokens at full price. A task that ends in failure can easily cost more than one that succeeds, because the agent thrashes instead of converging.

This inverts the intuition behind most cost estimates. Teams budget for the tokens a successful task consumes and then multiply by task count. But if 30% of attempts fail after burning 1.5x the tokens of a success, the real bill is meaningfully higher than the naive estimate.

Modeling the True Cost

A more honest cost-per-completed-task formula accounts for the failure tax:

Cost per success = (success cost + failure-rate × failure cost) ÷ success rate

Task Difficulty	Success Rate	Naive Cost	True Cost/Success
Easy (CRUD, boilerplate)	95%	$0.10	~$0.12
Medium (feature work)	70%	$0.40	~$0.75
Hard (cross-system refactor)	25%	$1.20	~$6.00
ALE-tier (research-grade)	~0%	$3.00	Effectively infinite

The bottom row is the warning ALE delivers: for tasks agents cannot reliably do, there is no cost per success, because there is no success. You are paying purely for attempts.

How to Budget Around It

Cap agent steps: hard iteration limits stop runaway thrash before it dominates your bill.
Triage by difficulty: route easy, high-success tasks to agents; keep research-grade tasks with humans until success rates improve.
Track failure cost separately: instrument token spend on failed runs so you see the real cost-per-completed-task, not the optimistic one.
Fail fast and cheap: use a cheap model for a first attempt; escalate to a premium model only when the task looks tractable.

Bottom Line

ALE is a reality check on agent capability, but it is also a cost lesson: tokens spent on failure are still tokens spent. Model your real success rates and step limits with our AI Cost Estimator before committing an agent to tasks it may not be able to finish.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

What is Agents' Last Exam?

Agents' Last Exam (ALE) is a benchmark from Berkeley's RDI lab that measures AI agent performance on hard, professional, multi-step tasks. On the most difficult tasks, success rates were near 0% even for frontier agents.

Why do failed agent runs cost so much?

A failing agent doesn't stop instantly—it plans, calls tools, retries, and loops until it hits a step limit, consuming full-price tokens the whole time. A failed run can cost more than a successful one.

How should I budget for agent failure rates?

Compute cost per completed task as (success cost + failure-rate × failure cost) ÷ success rate. Cap agent steps, triage tasks by difficulty, and instrument token spend on failed runs to see the true cost.

Prompt Caching with Deep Agents: How Teams Cut Agent Token Costs by 41-80%

LangChain says prompt caching with deep agents can reduce costs by 41-80% depending on setup. This guide explains what gets cached, why provider behavior differs, and how to calculate real savings for AI coding agents.

AI Agent Sandbox Escape: How Runaway Coding Agents Can Blow Your Budget

When AI coding agents escape their sandbox, token costs can spike 100x. Learn budget caps, kill switches, and monitoring to prevent runaway agent cost blowouts.

Juggler's Branching Threads: Cutting Token Waste in GUI Coding Agents

Juggler is an open-source GUI coding agent that organizes sessions as branching trees instead of linear chat. Here is why editable, branchable context saves real tokens.

← Previous

Pentagon Labels Anthropic a 'Supply-Chain Risk': What a Fallback Plan Costs Coding Teams

Grok Build's New Agent Dashboard: The Real Cost of Running Parallel Coding Sessions