AI Coding Cost Observability: How to Track Tokens by Agent, Tool, and Workflow
May 24, 2026 · 6 min read
The AI Coding Bill Is Not One Number
AI coding cost observability means tracking where token usage comes from, not just how many tokens were used. A monthly bill might say the team spent $800 on model usage, but that number is not actionable. Was the spend caused by code generation, code review, tests, long context, screenshots, MCP tools, or subagents?
Without attribution, teams either ignore the bill or overreact by restricting all AI usage. The better approach is to measure cost by workflow and outcome. Some expensive workflows are worth it. Some cheap workflows are pure waste because they produce no accepted code.
The Minimum Metrics to Capture
A useful cost dashboard does not need to be complicated. Start with a small set of fields that connect model usage to engineering work.
| Metric | Why it matters |
|---|---|
| Model name | Different models have very different input and output prices. |
| Input and output tokens | Coding agents are usually input-heavy, but generated code can be output-heavy. |
| Agent or session ID | Links spend to the process that caused it. |
| Tool or MCP server | Shows whether external context sources are inflating prompts. |
| Repository and PR | Connects cost to shipped engineering work. |
| Outcome | Separates accepted work from discarded exploration. |
Track Workflows, Not Just Prompts
Prompt-level tracking is useful, but workflow-level tracking is where cost decisions become clear. A single pull request might include planning, editing, test generation, failure analysis, code review, documentation, and a final summary. Each step has a different expected return.
- Planning: high reasoning value, usually modest output cost.
- Repository exploration: high input cost, often repetitive.
- Code generation: output-heavy and easy to measure by accepted diff.
- Testing and debugging: can loop, so it needs explicit caps.
- Review: often worth premium models because mistakes are expensive.
The Best Unit: Cost per Accepted Change
The most useful metric is not cost per token. It is cost per accepted engineering outcome. For coding agents, that might be cost per merged pull request, cost per bug fixed, cost per test added, or cost per feature shipped. A $12 agent run that merges a production fix is cheap. A $1 agent run that produces discarded code is expensive.
To calculate this, tag each AI-assisted session with an outcome. Did the code ship? Was it edited heavily by a human? Did it fail tests? Did it get abandoned? Over time, this shows which workflows deserve premium models and which should use cheaper models or stricter prompts.
Common Cost Observability Mistakes
- Tracking only total monthly spend.
- Ignoring input tokens, even though coding agents read large contexts.
- Not separating user-visible chat from background agents.
- Counting generated lines of code without measuring accepted lines of code.
- Failing to attribute MCP and tool output to the workflow that requested it.
A Simple Starting Dashboard
Start with five charts: spend by model, spend by repository, spend by workflow type, spend by tool or MCP server, and cost per merged pull request. That is enough to identify most waste. If one repository has unusually high cost per PR, inspect context size. If one tool dominates spend, tighten the query. If one model dominates cost without better outcomes, route routine tasks to a cheaper model.
Once you know the workflow shape, use the AI Cost Estimator to compare what the same workload would cost on different models.
Want to calculate exact costs for your project?
Related Articles
Multi-Agent Workflows: How Much Do They Really Cost?
Multi-agent systems multiply your token usage fast. Learn how to estimate and control costs when running orchestrator, coder, and reviewer agents together on real projects.
What Is an AI Coding Agent and How Much Does It Cost Per Task?
Learn what AI coding agents are, how they differ from autocomplete tools, and the real cost per task for bug fixes, new features, and refactors using Claude Code, Cursor, and more.
Claude Code v2.1.145 Adds Agent JSON and Better OTEL Traces: Why Observability Matters for AI Coding Spend
Claude Code v2.1.145 adds JSON output for agent sessions, better OpenTelemetry parent-child traces, and permission fixes. Here is why those changes matter for AI coding cost tracking.