Simon Willison's Sonnet/Haiku Delegation Trick — Put It in Claude Memory to Cut Fable Tokens

By Eric Bush · July 4, 2026 · 8 min read

A developer typing on a MacBook keyboard with a coffee cup nearby, symbolising day-to-day AI coding workflow

The Story

At AI Engineer World's Fair this week, Simon Willison spent time with the Claude Code team and came away with two tightly-related workflow tips. The first is philosophical: let Fable and Opus use judgment rather than following hard-coded rules. The second is operational, and it is where the real savings live — Jesse Vincent's technique of instructing Fable to delegate work to smaller models based on task shape.

Willison stored the resulting prompt in his Claude Code memory file and reported on his blog that Fable token burn dropped noticeably in real work. With Fable 5 pricing set to rise again, this is one of the cheapest optimizations a solo developer or a small team can adopt today. The change is a few lines of prompt text.

The Prompt Pattern

The core instruction sits in CLAUDE.md (or the equivalent memory file for your harness) and looks something like this:

When a task is:
- Mechanical (renames, regex substitutions, formatting fixes) → dispatch to Haiku
- Substantive implementation (writing new modules, refactors within one file) → dispatch to Sonnet
- Judgment, review, architectural decisions, and multi-file synthesis → keep in the main loop (Fable/Opus)

Only escalate back to Fable when you notice the smaller model is producing wrong or low-quality output.

The instruction does not enumerate specific rules. It gives the model a taxonomy and asks it to classify each incoming subtask. That is the philosophical tip in action: judgment is the thing you want the frontier model to spend tokens on, not micro-management.

Why the Math Works

Anthropic's pricing spread across the family is roughly:

Tier	Input $/M	Output $/M	Typical fit
Fable 5	$10	$50	Judgment, review, synthesis
Sonnet 5	$2	$10	Implementation of well-scoped tasks
Haiku 4.5	$1	$5	Mechanical rewrites, regex, single-file edits

A typical coding-agent trajectory splits roughly 30% judgment / 50% implementation / 20% mechanical. Under the delegation pattern, only the 30% judgment portion pays Fable rates; the rest shifts to Sonnet and Haiku:

Baseline (everything on Fable): 1.0x cost.
Delegated: 0.30 × 1.0 + 0.50 × 0.20 + 0.20 × 0.10 = 0.42x cost.

That is a ~58% reduction in the raw arithmetic. Real-world savings are closer to 40-50% because delegation itself costs some Fable tokens (the model has to read and reason about each subtask before deciding where to route), and because some sub-model outputs get rejected and reissued. Even so, halving your Fable bill by editing a memory file is unusual leverage.

What Should Never Delegate

The technique breaks when delegation targets tasks that require whole-project context or cross-cutting judgment:

Debugging a real bug. Sonnet handles a self-contained reproduction well, but chasing a bug across five files with faded stack traces is a Fable job.
Adversarial review. When you ask "did this diff introduce a regression," you want the frontier model looking for subtle problems.
Architectural decisions. "Should this be a service or a library" is exactly the judgment call you are paying Fable rates for.
Novel API design. Sonnet tends to produce plausible-but-off idioms; Fable's judgment on shape and naming is measurably better.

Add an explicit "do not delegate" list to your memory file. Willison's own version calls out "test writing strategy" and "commit message tone" as owned by the main loop.

Setup: Four Lines of Config

In Claude Code the entire setup is:

Open your project's CLAUDE.md.
Add the delegation instruction above at the top under a heading like "Model routing".
Optionally add a "do not delegate" list of task shapes you always want Fable to handle.
Confirm the harness supports sub-model dispatch (Claude Code and Cursor Composer both do; Aider and Continue do not natively — you would need a router layer like OpenRouter).

Measuring the Actual Savings

Do not trust vibes. Anthropic exposes per-model token counts in the API response; Claude Code shows per-session cost breakdowns. Before enabling the delegation prompt, log one week of baseline. After enabling, log one week under the same workload. Compare the ratio of Fable-only tokens to sub-model tokens, and the dollar total.

If your ratio does not shift meaningfully, one of three things is true:

Your work is judgment-heavy — the pattern will not save much.
Your prompt is too vague and Fable is not routing tasks; make the taxonomy more concrete.
Your harness does not actually dispatch — verify the sub-model calls in logs.

The Bigger Pattern

Model-family delegation is the second major cost trick to emerge in 2026, after prompt caching. It works because frontier models are now capable enough to route themselves, and because smaller models have caught up to yesterday's frontier on well-scoped tasks. It will keep working until either sub-models become expensive enough to erase the price gap, or providers start metering delegation with a routing fee.

For now, Willison's memory-file trick is one of the cheapest coding-cost wins available. Copy the pattern, measure the effect, and adjust the taxonomy to your codebase.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

What is Simon Willison's Fable delegation technique?

It is a prompt saved in Claude Code memory (CLAUDE.md) that instructs Fable to dispatch mechanical work to Haiku, substantive implementation to Sonnet, and keep only judgment, review, and multi-file synthesis in the Fable main loop. Willison reported noticeably slower Fable token burn after adopting it.

How much does the delegation pattern save on Claude coding costs?

Arithmetic says ~58% reduction for a typical 30/50/20 judgment/implementation/mechanical task split. Real-world savings land closer to 40-50% because delegation itself costs some Fable tokens and some sub-model outputs get rejected.

Which tasks should never be delegated to a smaller Claude model?

Multi-file debugging, adversarial code review, architectural decisions, and novel API design. These require whole-project context or subtle judgment that Sonnet or Haiku will get plausibly-wrong more often than Fable.

Do all AI coding harnesses support sub-model delegation?

Claude Code and Cursor Composer support it natively. Aider and Continue currently do not — you would need an external router like OpenRouter or a custom proxy to redirect specific subtasks to smaller models.

How do I verify the delegation prompt is actually saving money?

Log one week of baseline usage before enabling the prompt, then log one week after under similar workload. Compare the ratio of Fable-only tokens to sub-model tokens and the total dollar amount. If the mix has not shifted, the prompt is either too vague or the harness is not dispatching.

Claude Fable 5 and Mythos 5 Pricing: $10/$50 Per Million Tokens Is a 50% Price Cut

Anthropic launches Claude Fable 5 and Mythos 5 at $10/$50 per million tokens — a 50% price cut from Mythos Preview. Full pricing analysis comparing with Opus 4.8 and Sonnet 4.6 for AI coding workflows.

pxpipe: Rendering Prompts as PNG Cuts Claude Fable 5 Cost 59-70%

A local proxy renders dense text prompts to PNG so Claude Fable 5 charges by pixel size, not token count — a 59-70% end-to-end billing cut on real SWE-bench runs. Here's the mechanism and where it breaks.

Claude Code Now Generates Artifacts: What Shareable Debug Dashboards Cost in Tokens

Anthropic added artifacts to Claude Code — live, shareable web pages built from full session context for PR walkthroughs, incident timelines, and dashboards. Here's the token-cost angle developers should plan for.

← Previous