← Back to Blog

CEO-Bench: Only 3 of 14 AI Models Made a Profit in a 500-Day Startup Simulation

June 29, 2026 · 8 min read

Business strategy board with financial charts and growth metrics

The Experiment

Princeton researchers introduced CEO-Bench in June 2026: a simulated environment where AI agents run a subscription software company called NovaMind for 500 days, starting with $1 million in capital. The agents controlled pricing decisions, product development prioritization, marketing spend allocation, and hiring — autonomously, for the full duration.

Fourteen AI systems were tested. The success criterion was simple: end with more money than you started with. Only three cleared the bar in their best runs.

The Results

Model Best Run Profit Outcome
Claude Fable 5 $47.15M Profitable
Claude Opus 4.8 $27.80M Profitable
GPT-5.5 $21.30M Profitable
Rules heuristic (no LLM) $15.76M Profitable (no AI)
All other 11 models Bankrupt

The most striking row is the fourth: a simple rule-based heuristic with no LLM — fixed pricing, conservative quotas, targeted feature development — outperformed eleven of fourteen AI systems. Most models couldn't maintain a coherent multi-week strategy, ran out of cash on poor decisions, and were eventually bankrupt before day 300.

What Failed

The paper identifies two failure modes that dominated. Strategic drift: models would adopt a plan, then reverse it within a dozen simulation steps without clear cause — discounting aggressively in month 2, then hiking prices in month 3, then discounting again. No consistent logic connected the decisions.

Short horizon fixation: models optimized for the next 5–10 days rather than the full 500-day arc. Cash-burning growth plays that might pay off in month 18 were repeatedly chosen, then abandoned before the payoff materialized.

Why This Matters for AI Coding Agent ROI

CEO-Bench is not a coding benchmark. But the gap it reveals — between short-context task performance and long-horizon autonomous decision-making — maps directly to how AI coding agents fail at scale.

Most AI coding agents are evaluated on single-task benchmarks like SWE-bench (fix one bug, pass tests). These scores say nothing about whether an agent can manage a multi-week refactor, maintain consistent architecture decisions across 200+ files, or know when to stop and ask a human. CEO-Bench suggests the top models that handle long-horizon planning also handle sustained coding quality better.

The cost implication is direct. If you're running autonomous coding agents (Grok Build /goal mode, Claude Code auto, Devin) on work that spans more than a few hours, you are implicitly running a long-horizon agent. Budget models that score well on SWE-bench may still make expensive architectural mistakes that surface as rework costs 10 sessions later.

Mapping CEO-Bench to Coding Token Costs

The three profitable models — Claude Fable 5 ($10/$50), Claude Opus 4.8 ($5/$25), GPT-5.5 ($5/$30) — are all in the premium pricing tier. For a medium project (15,000 LOC, CLI agent, production quality), our estimator puts Claude Opus 4.8 at roughly $890 in raw tokens and GPT-5.5 at around $1,100.

By contrast, a budget model in the same scenario runs $40–120 in tokens. The CEO-Bench result suggests a real but hard-to-quantify cost on the other side: rework hours when the agent drifts or makes inconsistent decisions. If a $40 budget run requires $200 of engineer time to course-correct, the total cost of ownership flips.

This is not an argument that premium models always win. For short, well-scoped tasks — a single function, a bug fix, a test file — budget models are demonstrably effective and the long-horizon failure mode simply doesn't have time to manifest. The CEO-Bench result is a warning label specifically for autonomous, multi-session work.

Practical Decision Framework

Based on the CEO-Bench findings and how they translate to coding workflows:

  • Single-session tasks under 2 hours: Budget models (DeepSeek V4 Flash, Qwen3 Coder Next, GPT-4.1) are cost-optimal. The planning horizon is short enough that drift doesn't accumulate.
  • Multi-session refactors or feature sprints: Claude Opus 4.8 or GPT-5.5 tier. The extra cost is insurance against compounding bad decisions.
  • Fully autonomous cloud agents (Devin-style): Only the top-tier models were profitable in CEO-Bench. The context is different but the lesson is the same — long-horizon autonomy is where model quality creates the largest ROI spread.

The rules heuristic beating eleven AI systems also suggests a general principle: for stable, well-understood workflows, simple automation beats over-engineered AI agents on cost every time. Reach for the complex agent when the problem actually requires judgment, not just execution.

Want to calculate exact costs for your project?

Frequently Asked Questions

What is CEO-Bench?

CEO-Bench is a Princeton University benchmark that places AI agents in charge of a simulated SaaS company for 500 simulated days, evaluating long-horizon strategic decision-making including pricing, hiring, and product development.

Which AI models were profitable in CEO-Bench?

Only Claude Fable 5 ($47.15M profit), Claude Opus 4.8 ($27.80M), and GPT-5.5 ($21.30M) in their best runs. A simple rule-based heuristic with no AI also made $15.76M, outperforming 11 of 14 AI models.

How does CEO-Bench relate to AI coding costs?

The long-horizon planning capability tested in CEO-Bench maps to multi-session autonomous coding work. Models that fail on long-horizon tasks tend to make inconsistent architectural decisions, creating rework costs that can exceed the token savings from cheaper models.

Should I always use premium models for coding?

No. For single-session, well-scoped tasks, budget models are cost-optimal. Premium models justify the cost primarily for multi-session autonomous work where strategic consistency matters.