How to Reduce LLM Token Costs by 90% with Smart Model Routing
May 14, 2026 · 6 min read
The Core Insight: Not Every Token Needs a Frontier Model
The single biggest waste in AI coding budgets is sending every request to the same expensive model. When you use Claude Opus 4.7 at $5/$25 per million tokens to generate a docstring, you are paying 100x more than necessary. That same docstring could be written by GPT-5 Nano at $0.05/$0.4 with identical quality.
Smart model routing is the practice of automatically classifying each request by complexity and sending it to the cheapest model capable of handling it well. The result: you maintain quality on hard tasks while paying pennies on easy ones. In practice, this reduces blended costs by 80-90% compared to always using a frontier model.
The Three-Tier Routing Architecture
A practical routing system divides tasks into three tiers based on cognitive complexity:
- Tier 1 - Mechanical tasks (60-70% of requests): Code comments, docstrings, variable renaming, simple formatting, boilerplate generation, test name generation, type annotations for obvious types. Route to: GPT-5 Nano ($0.05/$0.4) or Gemini 2.0 Flash ($0.1/$0.4).
- Tier 2 - Standard implementation (20-30% of requests): Writing functions from clear specs, implementing CRUD endpoints, standard refactoring, unit test implementation, bug fixes with obvious causes. Route to: DeepSeek V4 Flash ($0.14/$0.28) or GPT-4.1 mini ($0.4/$1.6).
- Tier 3 - Complex reasoning (5-10% of requests): Architecture decisions, complex debugging with subtle causes, performance optimization, security review, system design. Route to: Claude Opus 4.7 ($5/$25) or GPT-5.5 ($5/$30).
Concrete Cost Calculation: A Real Project
Let's model a typical full-stack feature build: 200 total requests over a 2-day sprint, averaging 2,000 input tokens and 1,000 output tokens per request. Here is the cost comparison:
| Strategy | Model(s) | Total Cost | Savings vs All-Opus |
|---|---|---|---|
| All Frontier | Claude Opus 4.7 only | $7.00 | baseline |
| All Mid-Tier | DeepSeek V4 Flash only | $0.11 | 98% |
| Smart Routing | 3-tier mix | $0.72 | 90% |
The smart routing breakdown: 130 Tier 1 requests via GPT-5 Nano (cost: $0.07), 50 Tier 2 requests via DeepSeek V4 Flash (cost: $0.03), and 20 Tier 3 requests via Claude Opus 4.7 (cost: $0.62). Total: $0.72 instead of $7.00 — a 90% reduction while preserving frontier quality where it matters most.
How to Classify Request Complexity
The routing decision itself needs to be fast and cheap. Here are proven classification approaches:
- Keyword-based rules: The simplest approach. Requests containing "add comment," "rename," "format," or "type annotation" go to Tier 1. Requests with "implement," "write test," or "refactor" go to Tier 2. Anything with "architect," "debug," "security," or "optimize performance" goes to Tier 3.
- Token-count heuristics: Shorter prompts (under 500 tokens) are often simple tasks. Prompts over 3,000 tokens with multiple files referenced tend to be complex. This correlates surprisingly well with actual difficulty.
- Lightweight classifier model: Use the cheapest available model (GPT-5 Nano at $0.05/M) to read the prompt and output a tier number. At 200 tokens per classification, the overhead is negligible ($0.001 per request).
- Confidence-based escalation: Send everything to Tier 2 first. If the model's response includes uncertainty markers ("I'm not sure," low confidence patterns) or the output fails validation, automatically escalate to Tier 3.
Tools That Already Implement Routing
You don't have to build routing from scratch. Several platforms offer this today:
- OpenRouter: Their "auto" routing option automatically selects models based on prompt complexity. You can also define custom routing rules based on model capabilities and pricing. Their Pareto-optimal endpoint specifically optimizes for the quality/cost frontier.
- Martian: Offers intelligent routing that analyzes prompts and routes to the best-performing model for each specific query type, balancing quality and cost automatically.
- Custom middleware: For maximum control, build a lightweight proxy that sits between your agent and the API. Classify incoming requests, route to the appropriate model, and log outcomes for continuous optimization.
Pitfalls and How to Avoid Them
Smart routing is not without risks. Here are common failure modes:
- Under-routing: Sending a genuinely complex task to a cheap model produces bad code that takes more turns to fix, ultimately costing more. Solution: err on the side of escalation, and implement automatic fallback when quality drops.
- Context fragmentation: If different models handle different steps of the same task, you lose conversational continuity. Solution: keep complex multi-turn conversations on a single model; only route independent, stateless requests to cheap models.
- Latency variability: Cheap models may have different latency profiles. Solution: monitor p95 latency per tier and adjust routing if certain models introduce unacceptable delays.
Calculate Your Routing Savings
The exact savings depend on your task distribution. If most of your work is boilerplate-heavy (frontend components, CRUD APIs), you'll route 70%+ to Tier 1 and save over 90%. If your work is mostly architecture and debugging, the savings will be lower but still significant at 50-70%. The key is measuring your actual task distribution and modeling the cost difference. Use the AI Cost Estimator to compare costs across all available models and identify which tier each model falls into for your specific workflow.
Want to calculate exact costs for your project?
Estimate Your AI Coding Costs →