AI Coding Agent Router Design: How Routing 70–80% of Traffic to Local Models Cuts AI Bill 90%

Q: What is an AI coding router?

An infrastructure layer that decides — for each incoming task — which model tier should handle it (local, async batch, or real-time cloud) and which specific model within that tier.

Q: Can I really cut 90% of my AI spend with routing?

Yes, at scale and with skill distillation. Without distillation, expect 40–60% savings. The 90% figure assumes 70–80% of traffic ends up on local or async lanes, which requires investment in fine-tuned local models.

Q: Do I need my own local models to build a router?

No — a router that only routes between DeepSeek V4-Flash (cheap hosted) and Claude Sonnet 5 (mid-tier hosted) can still save 50%+ vs an all-Sonnet baseline. Local models add another savings layer but aren't required.

Q: How do I decide what's cheap enough to be 'local-worthy'?

Instrument every request. If a skill category runs at high volume, has predictable shape, and the frontier version costs more than $500/month for that skill alone, it's a candidate for local model routing.

Q: What's the fastest way to start?

Add an OpenRouter or LiteLLM gateway in front of your current model. Even without any custom routing logic, you can rewrite prompts to add fallback rules like 'if input length < 500 tokens, use DeepSeek; else use Claude'. That single rule often cuts costs 30% before you build anything sophisticated.

By Eric Bush · July 2, 2026 · 10 min read

Network of glowing nodes and connecting lines on dark background

The Core Insight

Tomer Tunguz wrote in a July 2026 essay that when building AI agents, you should design the router before you pick the model. The claim: a well-designed router keeps 70–80% of production traffic on free local models or asynchronous inference, cutting AI spend by 90% or more without harming user experience.

Brian Armstrong (Coinbase CEO) has cited a version of this playbook: better defaults, better routing, and better caching cut Coinbase's total AI spend in half even as token usage kept growing. This isn't a theoretical pattern — it's how the most cost-efficient AI products at scale actually work.

Three Layers of a Router

A production-grade AI router has three sequential layers, each answering a different question:

Skill classifier. "What kind of task is this?" Categorizes requests into families like "code generation", "code explanation", "code review", "debugging", "refactoring", "test writing".
Router. "How complex and time-sensitive is this specific instance?" Decides synchronous vs asynchronous, and which tier of model.
Model selector. "Given the tier and skill, which specific model gives the best cost/quality?" Reads real-time latency and cost signals to pick.

Layer 1: The Skill Classifier

The skill classifier is where the biggest savings hide. Every LLM invocation for classification is usually a waste — the same job can be done by a 100M-parameter local model or even keyword rules at <1ms latency and $0.00 cost per call.

Implementation options, ranked by cost:

Keyword + regex — near-zero cost, sufficient for many well-scoped domains
Small local classifier — DistilBERT-scale, ~$0.0001/call including infrastructure amortization
Small hosted classifier — Cohere Classify, Hugging Face Inference, ~$0.001/call
Frontier LLM classifier — pattern to avoid; costs 100–1000x more than the alternatives

Layer 2: The Router

Given the skill category, the router splits traffic into three lanes:

Lane	When	Cost / Task
Local (self-hosted)	Straightforward, well-scoped, latency-tolerant	$0.00–$0.001
Async batch	Background, no user waiting	$0.001–$0.01
Real-time cloud	User is waiting, task is nontrivial	$0.01–$1.00

Async batch inference is systematically underused in AI coding products. Doc generation, changelog writing, initial code review passes, and test scaffolding rarely need to happen in the two seconds after a user click — they can happen overnight for a fifth to a tenth of the sync price.

Layer 3: The Model Selector

Within the real-time cloud lane, another decision: which specific model? A synchronous predictor flags high-complexity tasks; simple sync tasks go to a cheap model, complex ones go to a frontier model.

Complexity signals worth including:

Input length (longer inputs typically need bigger models)
Presence of code (adds difficulty)
Number of chained tool calls expected
User's session history (users with high-complexity patterns get bigger models by default)
Prior failure signal (if a cheap model failed, escalate)

Cost Math: A Real Distribution

Assume 10,000 tasks/day in a coding product. Without routing (all frontier), average cost $0.20/task = $60,000/month. With a three-layer router that pushes 75% of traffic to local/async:

50% local: 5,000 × $0.001 = $5/day
25% async batch: 2,500 × $0.005 = $12.50/day
20% real-time cheap: 2,000 × $0.02 = $40/day
5% real-time frontier: 500 × $0.20 = $100/day
Total: $157.50/day, ~$4,725/month

92% reduction. The math is dramatic because the frontier tier is 100–200x more expensive per token than local; you don't need to move much traffic off it to save massively.

Skill Distillation

The 70–80% number Tunguz cites is aspirational for most teams — it assumes you've done skill distillation, i.e., fine-tuned a local model on the specific skills your product uses most. Without distillation, local models typically handle 30–40% of coding traffic well; with distillation, that rises to 70–80%.

Distillation cost: for a 7B–13B parameter local model, expect $5K–$25K of compute for a good distillation run plus ongoing evaluation and drift detection. That pays back within one month at scale.

Where Routers Fail

Bad training data for the router itself. If the router misclassifies tasks, users get cheap-model output on tasks that needed the frontier tier. Quality regressions look like bugs but are actually routing errors.
Latency inversion. Sometimes a local model is fast in isolation but slow behind a queue, while the hosted frontier is faster end-to-end. Include queue depth in the routing decision.
Silent drift. A skill you distilled six months ago may drift as your product evolves. Nightly evals against a hold-out set are non-negotiable.
Overfitting to cost. A router that saves 92% at 85% user satisfaction is worse than one that saves 70% at 95% satisfaction. Cost is not the only optimization target.

Implementation Starting Point

For a team currently on all-frontier, a pragmatic three-week rollout:

Week 1: Instrument every call with skill category and input complexity. Build the classifier from that data.
Week 2: Wire in a router that splits traffic 90/10 between frontier and a cheap hosted alternative (Sonnet 5 or DeepSeek V4-Flash). Measure quality delta on cheap-model outputs.
Week 3: Add async batching for background tasks. Roll out local model for the top 2 skill categories where distillation is straightforward.

Even a naive router usually cuts spend 40–50% in month one. Full 90%+ savings takes 3–6 months of iteration.

Bottom Line

Router design is where AI coding cost engineering pays back the fastest. Every dollar spent on routing infrastructure typically saves 10–50 dollars in model API cost within twelve months. If your team is not thinking about routing today, that gap is where your competitors are pulling ahead.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Frequently Asked Questions

What is an AI coding router?

An infrastructure layer that decides — for each incoming task — which model tier should handle it (local, async batch, or real-time cloud) and which specific model within that tier.

Can I really cut 90% of my AI spend with routing?

Yes, at scale and with skill distillation. Without distillation, expect 40–60% savings. The 90% figure assumes 70–80% of traffic ends up on local or async lanes, which requires investment in fine-tuned local models.

Do I need my own local models to build a router?

No — a router that only routes between DeepSeek V4-Flash (cheap hosted) and Claude Sonnet 5 (mid-tier hosted) can still save 50%+ vs an all-Sonnet baseline. Local models add another savings layer but aren't required.

How do I decide what's cheap enough to be 'local-worthy'?

Instrument every request. If a skill category runs at high volume, has predictable shape, and the frontier version costs more than $500/month for that skill alone, it's a candidate for local model routing.

What's the fastest way to start?

Add an OpenRouter or LiteLLM gateway in front of your current model. Even without any custom routing logic, you can rewrite prompts to add fallback rules like 'if input length < 500 tokens, use DeepSeek; else use Claude'. That single rule often cuts costs 30% before you build anything sophisticated.

Why OpenAI Codex Now Drives 99.8% of Internal Token Output: Lessons for Your Own AI Coding Bill

OpenAI's internal report on June 27, 2026 disclosed that Codex now generates 99.8% of the company's internal token output — up from less than 10% a year ago. 80.6% of users launch tasks longer than 30 minutes. We work through the cost implications and what your own team can learn from how OpenAI runs Codex internally.

Bot Traffic Hits 57.5%: How AI Coding Agents Are Driving Up Infrastructure Costs

Cloudflare Radar reports bots now generate 57.5% of internet traffic. AI coding agents making API calls, fetching docs, and using MCP tools are a growing contributor. Here's what this means for your costs.

AI Coding Agent Sub-Agents: When to Use Cheap Models for Routing and Validation

How multi-agent coding systems use cheap models for routing, validation, and context preparation to cut AI costs by 60-70% without sacrificing code quality.

← Previous

Local AI vs Frontier API for Coding: The Real 4–8 Month Gap and What It Costs to Close

What Is Workflow-vs-Agent Architecture? A Cost Decision Framework for Production AI Coding