VitaBench 2.0 Calls the Bluff: Claude Opus 4.6 Barely Clears 0.5 on Long-Horizon Tasks
June 28, 2026 · 9 min read
A Benchmark That Actually Tests Persistence
On June 26, 2026, Meituan's LongCat team open-sourced VitaBench 2.0, the first AI agent benchmark designed around long-horizon, dynamic-user evaluation. The numbers are large enough to matter:
- 56 simulated users with persistent preferences
- 819 complex tasks averaging 2,093 interactions per user
- Average task span: 1,580 days of simulated time
- 66 executable tools and 2,000+ dynamic preference shifts
The best frontier model — Claude Opus 4.6 in open-book mode — scored just above 0.5 average. Every other tested model came in lower. The hidden message is harder than the headline: paying frontier prices does not buy you long-horizon competence.
Three Findings That Reset Pricing Decisions
The LongCat team published three results worth pasting into your model-selection doc.
1. Thinking mode is not free uplift on personalization. Enabling extended thinking improved some math/code subtasks but flat-lined or hurt personalization scores. If you have been paying for reasoning tokens because you assumed they help everywhere, this is your stop sign.
2. Active-questioning tasks collapse. Across every model tested, tasks that required the agent to ask the user a clarifying question scored substantially lower than tasks where the user volunteered every detail. Agents over-commit and under-ask. The token cost shows up later, when they have to redo the work.
3. Open-book vs closed-book gap is enormous. Closed-book scores (no access to user history) were materially lower than open-book. Most production "long context" deployments are open-book by definition — the actual moat is preference retrieval quality, not context window size.
Translating 0.5 Into a Dollar Number
A 50% completion rate on multi-step tasks looks innocuous in a benchmark table. In a coding context it is brutal.
Suppose you run a coding agent against a real engineering backlog where each ticket takes 5 sub-tasks. At Opus 4.6 long-horizon scoring (~0.5 per sub-task), the expected probability of completing a 5-step ticket without intervention is 0.5⁵ = 3.1%. Even at 0.7 per sub-task it is only 16.8%. The other 83-97% of attempts incur tokens for partial work that has to be redone or human-corrected.
Concretely: a 5-step ticket budgeted at 50K tokens of Opus 4.6 use ($0.25 input + $1.25 output ≈ $1.50/ticket) actually consumes 3-5x that when the rework loops are counted. Real cost per completed ticket on long-horizon work: $4.50 to $7.50, not $1.50.
Why This Matters Right Now
Vendors have been marketing 1M+ context windows as a long-horizon solution for two years. VitaBench 2.0 is the first public benchmark that decouples context length from long-horizon competence and the results are not flattering. A 1M token window does not buy you preference recall across simulated months of usage; it only buys you the ability to fit the data, not act on it.
What to Do About It
Until model providers publish their own VitaBench 2.0 numbers (expect this within weeks), three workarounds keep your bill in check.
Decompose tickets. Treat any ticket beyond 3 sub-steps as N independent agent runs with hand-curated context per run. You pay slightly more per run but eliminate the compounding failure rate.
Force clarifying questions. Add a system-prompt directive: "Before executing, list any assumption you are making about user intent and ask for confirmation." This costs 100-300 input tokens per run and recovers most of the active-questioning gap.
Cap reasoning budget on personalization tasks. If thinking mode flat-lines or hurts here, don't pay for it. Configure your agent to skip extended thinking when the task is preference-driven rather than logic-driven.
The Bigger Story
VitaBench 2.0 joins Cursor's SWE-Bench audit and Wilkinson's Civ VI tournament as a new generation of benchmarks that surface failure modes the older ones missed. The pattern: every honest measurement of agent persistence lowers the apparent capability of frontier models, often by 20-40 points. The good news is that this is a fixable problem with prompting and workflow. The expensive news is that until the model layer catches up, you are paying frontier prices for mid-tier real-world performance.
Want to calculate exact costs for your project?
Frequently Asked Questions
Is VitaBench 2.0 specific to one domain?
No — it covers life-scenario tasks (food ordering, scheduling, customer service) but the failure modes (preference recall, active questioning) generalize across coding agents too.
Why does open-book scoring matter for coding?
Most coding agents have read access to your repo and prior conversations. Open-book corresponds to that setup. Closed-book scores are still useful as a 'pure model' baseline.
Can fine-tuning fix the active-questioning gap?
Possibly for narrow domains, but the cost-benefit on a $20K fine-tune for a 0.05-0.10 lift is rarely worth it. A system-prompt directive plus structured tool-use captures most of the gain.
Are there VitaBench-style benchmarks aimed at coding specifically?
Not yet at this depth. SWE-Atlas and NL2Repo are the closest, but they focus on single-task completion rather than multi-month preference persistence.
Related Articles
ByteDance Seed 2.1 Matches Claude Opus on Agent Stability: A Cost-Per-Task Reality Check
ByteDance Seed 2.1 launched June 23, 2026 with benchmarks claiming parity with Claude Opus on agentic coding. We compare cost-per-completed-task against Opus 4.8 and where the parity claim actually holds.
Claude Opus 4.7 Finishes Robotics Tasks 20× Faster With 10× Less Code: The Cost-Per-Task Story
Anthropic's Project Fetch phase two shows Claude Opus 4.7 completing robotics tasks autonomously, ~20× faster than the best human team and with nearly 10× less code. Here's what capability jumps do to cost per task.
GLM-5.2 vs Claude Opus 4.8 on SWE-Bench: Cost Per Coding Task Compared
Compare GLM-5.2 and Claude Opus 4.8 on SWE-Bench performance and cost per coding task. Open-source MIT model vs premium frontier pricing analyzed.