How Many Screenshots Can a Browser Agent Afford Before Context Costs Explode?
May 20, 2026 · 6 min read
Screenshots Are Hidden Token Spend
Browser agents do not just send text to a model. They often send screenshots, DOM extracts, tool logs, previous clicks, and task instructions. Screenshots are the expensive part because visual inputs are converted into model context. A practical estimate is 1,000-1,800 tokens per screenshot, depending on resolution and detail.
That means a browser agent can spend more on looking than on writing. If it captures a screenshot every turn and keeps the full history, context grows quickly even if the web task itself is simple.
The 100-Screenshot Problem
A 200k context window sounds enormous, but visual agents can fill it surprisingly fast. At 1,500 tokens per screenshot, 100 screenshots are already 150,000 tokens before counting instructions, tool results, page text, and model outputs. That leaves little room for reasoning or task history.
| Screenshots kept | Estimated visual tokens | Risk |
|---|---|---|
| 10 | ~15,000 | Usually safe |
| 50 | ~75,000 | Expensive but manageable |
| 100 | ~150,000 | Context pressure |
| 150 | ~225,000 | Over common context limits |
The Direct Cost Example
Suppose a browser agent uses 60 screenshots in a session, averaging 1,500 input tokens each. That is 90,000 visual input tokens. On Claude Sonnet 4.6 at $3.00 per million input tokens, the screenshots cost about $0.27 before output tokens. On Claude Opus 4.7 at $5.00 per million input tokens, they cost $0.45.
That sounds small for one session, but the multiplier matters. If 1,000 automated browser sessions run per day, 60 screenshots each becomes 90 million visual input tokens per day. On Sonnet 4.6, that is $270/day for screenshots alone. On Opus 4.7, it is $450/day.
How Many Screenshots Should You Keep?
For most browser tasks, keep the last 2-5 screenshots and summarize older ones. The agent usually needs the current page state, maybe the previous state, and a short memory of what it already tried. It rarely needs every pixel from 40 turns ago.
- Short form fill: keep 1-2 screenshots.
- Shopping or search workflow: keep 3-5 screenshots plus text summaries.
- Long research session: keep recent screenshots and compact old findings into structured notes.
- High-stakes workflow: keep audit logs separately, not necessarily in the model context.
Resolution Matters Too
Do not send a 4K screenshot when a 1280×720 view would answer the question. If a small region matters, crop or zoom. If text is available in the DOM, send extracted text instead of an image. The best browser agents combine visual input with structured page data so the model does not need to infer everything from pixels.
Bottom Line
Browser agents can afford screenshots, but they cannot afford unlimited screenshot history. Count them, prune them, cache stable prompts, and compact long sessions. The difference between keeping 3 screenshots and 100 screenshots can decide whether your agent is cheap enough to run in production.
Use the AI Cost Estimator to model the text side of your workflow, then add a screenshot budget on top for browser and computer-use agents.
Want to calculate exact costs for your project?
Related Articles
How Agent Memory and Context Offloading Cut Token Costs by 60%
Long-running AI coding agents waste tokens re-reading context. Learn how agent memory and context offloading techniques reduce token consumption by 60% on multi-step tasks.
Replit Parallel Agents: How Multi-Agent Coding Multiplies Your Token Costs
Replit launched parallel agents that work on multiple files simultaneously. We analyze the token cost multiplier effect and when parallelism saves money versus wastes it.
Google Antigravity CLI Replaces Gemini CLI: What It Means for Multi-Agent Coding Costs
Google is transitioning consumer Gemini CLI usage to Antigravity CLI, a multi-agent terminal experience with background workflows. Here is how that changes AI coding cost, throughput, and budget planning.