The Civilization VI AI Tournament Found the Real Coding-Agent Bottleneck — And It Costs You Tokens

June 28, 2026 · 8 min read

Strategy board game pieces on a hex grid

Four Frontier Models, 76 Tools, 23 Games

On June 28, 2026, former UK Downing Street data scientist Liam Wilkinson published one of the more honest agent benchmarks we have seen this year. He wired up 76 MCP tools that let an LLM play Civilization VI, then ran Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and one open-source model through 23 full matches against each other.

The headline anecdote — Claude nuking Toulouse to stop France's cultural victory and still losing on diplomacy — got the press. The actual finding is more useful, and more expensive, than that. Two numbers explain why your coding bill is higher than your benchmark scores predicted.

The Two Numbers

1-2%: the fraction of turns where the model proactively checked global state before acting. The rest of the time it acted on stale, partial, or imagined context.
48-66%: the fraction of its own stated 10-turn plans the model actually executed within those 10 turns. Most plans were quietly abandoned before they paid off.

Wilkinson's conclusion: intelligence is not the bottleneck. Perception (knowing what state the world is in) and execution (following through on your own plan) are. Both gaps look identical inside a coding agent — and both burn tokens.

Mapping Civ VI Failures Onto Coding Agents

A Civilization turn and an AI coding turn share more structure than you might expect. Both involve a long-lived state (the map / the repo), a finite tool surface (76 MCP calls / file read, edit, run, grep), and a sequence of decisions where small misreads compound.

The "1-2% global state check" failure in Civ shows up in coding as the agent that edits user.service.ts without first running grep for other call sites. The plan looks fine. The first edit looks fine. Then the test suite reveals 11 broken callers and the agent burns another 40K tokens chasing the ripple.

The "48-66% follow-through" failure shows up as the agent that announces a four-step refactor plan in turn 1, executes step 1 cleanly, and then in turn 4 has forgotten step 3 entirely because its plan never re-entered the context window. You pay for the original plan, the partial execution, and the silent abandonment.

What Each Failure Costs in Dollars

Take a 30-turn Opus 4.8 coding session: 30K input tokens per turn, 4K output per turn, current Opus pricing at $5/$25 per million.

Baseline cost: 30 × (30K × $5 + 4K × $25) / 1M = $7.50
Perception failure tax (no global state check): roughly +25% on tokens for the recovery cascade. Adds $1.88 per session.
Execution failure tax (abandoned plans): wasted output tokens on plans never followed, plus reissued plans in later turns. Adds roughly 12% — $0.90 per session.

At 200 coding sessions per month per engineer, that is $556 in monthly perception/execution tax per engineer, before you have done anything wrong as a user.

Three Cheap Fixes That Work

You cannot patch the model. You can patch the loop. Three interventions, all cheap, target the exact failure modes Wilkinson surfaced:

1. Forced state-check turn. Every 3-5 turns, prepend a "summarize current repo state, list known callers of files you intend to edit" pseudo-tool call. This is roughly 1.5K extra input tokens (a $0.0075 charge on Opus) and pays for itself the first time it prevents one ripple-effect cascade.

2. Plan re-injection. Append the original plan to the system prompt of each turn. The plan is small (~300 tokens), but its persistence in context drives follow-through rates closer to 80%.

3. Hard step boundary. Use a stop sequence or tool-call gate so the agent cannot proceed to step N+1 without an explicit "step N complete" tool emission. Eliminates silent abandonment entirely.

Why This Matters for Benchmark Trust

SWE-Bench and Terminal-Bench task durations are short — usually one or two coherent edits. They don't capture multi-turn perception or 10-turn follow-through. A model can score 87% on SWE-Bench Pro and still execute 50% of its own plans on a real one-hour task. That gap is invisible until you pay for it.

Wilkinson's tournament is one of the first benchmarks that does measure agent persistence. Expect more like it — and expect them to expose more cost overruns hidden behind marketed scores. Until then, the three fixes above are the cheapest insurance.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

Are these failure rates specific to Civilization VI?

The benchmark is Civ, but the underlying failure modes — sparse global state checks and incomplete plan execution — are well documented in coding agent traces too. Wilkinson's contribution is quantifying them with clean data.

Which model performed best on perception and execution?

Claude Opus 4.6 led on plan execution (closer to 66%), Gemini 3.1 Pro on state-check frequency. No model approached human levels on either dimension.

Does extended thinking mode help?

Partially. Extended thinking improves single-turn reasoning but does not fix multi-turn plan persistence. The plan re-injection fix is cheaper and more reliable than throwing thinking budget at the problem.

How do I instrument plan follow-through in my own agent?

Tag every plan emission with an ID, then track whether each numbered step produces a corresponding completion event within N turns. Most agent frameworks (Claude Code, Cursor, Codex) expose this via OpenTelemetry hooks.

GitHub Copilot Switches to Token-Based Billing: What It Really Costs Developers

GitHub Copilot is moving from flat subscriptions to token-based billing. We break down what this means for your actual monthly spend and how it compares to Claude Code, Cursor, and direct API access.

Why OpenAI Codex Now Drives 99.8% of Internal Token Output: Lessons for Your Own AI Coding Bill

OpenAI's internal report on June 27, 2026 disclosed that Codex now generates 99.8% of the company's internal token output — up from less than 10% a year ago. 80.6% of users launch tasks longer than 30 minutes. We work through the cost implications and what your own team can learn from how OpenAI runs Codex internally.

WeChat Mini Agent Grayscale: When a Super-App Agent Means Per-Conversation Tokens at Scale

Tencent's WeChat is grayscale-testing 'Mini', an agent embedded in the super-app entry point. We break down the per-conversation token economics for developers building on WeChat's agent platform.

← Previous

The 30-Minute Minimum Cache Life: GPT-5.6's New Caching Economics Explained

VitaBench 2.0 Calls the Bluff: Claude Opus 4.6 Barely Clears 0.5 on Long-Horizon Tasks