← Back to Blog

Eval-Driven Prompt Debugging: How Anthropic Engineers Cut Production Costs With XML Tags and Tool-Use Math

By Eric Bush · June 30, 2026 · 8 min read

Engineer working at a desk with multiple monitors showing structured prompt and eval output

The Talk That Reframed Prompt Engineering as Maintenance Work

At Code with Claude, Anthropic Applied AI engineer Margot Van Laar shared a pattern that surprised many in the audience: she spends most of her time debugging and maintaining production prompts, not writing new ones. The implication is that the cost story of LLM-powered apps is dominated by the production lifecycle of prompts, not the green-field design phase.

Her core thesis, in one line: evaluation is the only rigorous way to know your prompt is working. Everything else — eyeballing outputs, "looks good to me," gut-check on a few examples — is guesswork. And guesswork in production prompts is what drives cost surprises.

Two Scenarios, Two Patterns

Van Laar walked through two patterns from real customer engagements:

Scenario 1: Customer-service bot maintenance. The existing prompt had accumulated cruft over multiple model upgrades — explicit "do not do X" instructions written for older Claude versions that newer versions then over-fit to. Van Laar's fix: replace the legacy prohibitions with XML-tagged structure that gives the model clean decision boundaries. <tools>, <escalation_criteria>, <handoff_format>. Each section is independently testable.

The cost angle: when you precisely compute "should the model call this tool or hand off to a human?", you stop calling tools the model shouldn't call (which costs output tokens + tool execution) and stop handing off cases the model could close (which costs human time). For one customer this collapsed their bot's per-conversation cost by roughly 35%.

Scenario 2: Retail scheduling agent from scratch. Instead of building one monolithic prompt, decompose into three simpler prompts — generate, evaluate, repair — each pinned to its own model and eval. A stronger reasoning model (Opus-class) for the generation step; cheaper models for evaluation and repair. The result is a pipeline where the most expensive model only runs on the hardest sub-task.

Why XML Tags Save Money

XML tags in prompts work because Claude (and most modern frontier models) recognize them as structural boundaries. That has two cost effects:

1. Better tool-call precision. With <tool_descriptions> blocks, the model can refer back to specific tool specs without wading through unrelated context. Wrong-tool calls drop. Each wrong call costs output tokens + the wasted execution cycle.

2. Shorter conversations. When the model has clear handoff criteria in <escalation>, it doesn't churn through redundant clarification turns. Each saved turn is 2K-4K input tokens you don't pay for on the next call.

The Cost of NOT Doing Evals

Without an eval suite, three failure modes are common in production prompts:

Failure Mode Typical Cost Impact Detection Difficulty
Over-cautious model (handoffs spike) 10-25% conversation cost increase Easy if you track handoff rate
Tool over-calling (loop on minor disagreement) 2-5× output token use Medium — needs per-turn metrics
Silent regression after model upgrade 15-50% quality drop, cost flat Hard without an eval suite

Building a Minimal Eval Suite

For a production prompt that handles meaningful traffic, the minimum viable eval setup:

1. A test set of 50-200 examples covering happy paths, edge cases, and known historical bugs. Each example labeled with expected outcome (which tool, which handoff, which response category).

2. Three metrics tracked per eval run: correctness rate (does the model produce the expected outcome?), token cost per example, and tool-call accuracy (did the model call the right tools?).

3. A pre-merge gate. Any change to the production prompt must pass the eval suite before deploying. If correctness drops more than 3% or cost rises more than 10%, the change is blocked.

Cost of the Eval Suite Itself

For a 100-example eval set with average per-call cost of $0.05:

One full eval run: $5. Run per-PR before merging prompt changes: maybe $50-200/month for an active team. Compare against the alternative — a silent regression that adds 25% to your conversation cost on a $5,000/month bill ($1,250/month wasted) — the eval suite pays for itself in days.

The Practical Playbook

For existing production prompts: Run a one-time audit. List every "don't do X" instruction. For each, ask: was this written for a model version we no longer use? If yes, remove it and re-evaluate. Most prompts shrink by 20-40% with this pass alone.

For new prompts: Start with XML structure from day one. Build the eval set before you optimize the prompt. The eval set is the contract; the prompt is the implementation.

For model upgrades: Re-run the eval suite on the new model before flipping traffic. A model that scores 5% lower at 50% the cost is a win; a model that scores 2% higher at 200% the cost might not be.

The headline takeaway from Van Laar's talk: your prompt is a long-lived artifact, and the cost of running it dominates the cost of writing it. Treat it like production code with tests, version control, and regression gates. The teams that do this consistently spend 30-50% less on the same workload as teams that don't.

Want to calculate exact costs for your project?

Frequently Asked Questions

Why XML tags specifically in Claude prompts?

Claude's training data is heavy with XML-tagged content; the model recognizes tags as structural boundaries reliably. Other formats (JSON inline, Markdown headers) also work but XML tags have the most consistent behavior in Anthropic's published guidance.

How big should an eval suite be?

Start with 50-100 examples covering happy paths, edge cases, and known historical bugs. Add examples each time you find a real production failure. For high-traffic prompts, scale to 200-500 examples over time.

What's the cost of running a full eval pass before each prompt change?

For a 100-example suite at $0.05 per example, one full eval run costs about $5. Run before each PR that touches the prompt: $50-200/month for an active team. Pays for itself by catching a single regression.

Should I split one complex prompt into multiple smaller prompts?

Often yes. Van Laar's retail scheduling case decomposed into generate / evaluate / repair, with the expensive reasoning model only on the generation step. The pattern works because cheaper models can handle structured validation and repair tasks, isolating frontier-tier spend to the hardest sub-task.