AI Cost Estimator

Estimate your AI coding costs

← Back to Blog

How Many Screenshots Can a Browser Agent Afford Before Context Costs Explode?

May 20, 2026 · 6 min read

Screenshots Are Hidden Token Spend

Browser agents do not just send text to a model. They often send screenshots, DOM extracts, tool logs, previous clicks, and task instructions. Screenshots are the expensive part because visual inputs are converted into model context. A practical estimate is 1,000-1,800 tokens per screenshot, depending on resolution and detail.

That means a browser agent can spend more on looking than on writing. If it captures a screenshot every turn and keeps the full history, context grows quickly even if the web task itself is simple.

The 100-Screenshot Problem

A 200k context window sounds enormous, but visual agents can fill it surprisingly fast. At 1,500 tokens per screenshot, 100 screenshots are already 150,000 tokens before counting instructions, tool results, page text, and model outputs. That leaves little room for reasoning or task history.

Screenshots kept Estimated visual tokens Risk
10~15,000Usually safe
50~75,000Expensive but manageable
100~150,000Context pressure
150~225,000Over common context limits

The Direct Cost Example

Suppose a browser agent uses 60 screenshots in a session, averaging 1,500 input tokens each. That is 90,000 visual input tokens. On Claude Sonnet 4.6 at $3.00 per million input tokens, the screenshots cost about $0.27 before output tokens. On Claude Opus 4.7 at $5.00 per million input tokens, they cost $0.45.

That sounds small for one session, but the multiplier matters. If 1,000 automated browser sessions run per day, 60 screenshots each becomes 90 million visual input tokens per day. On Sonnet 4.6, that is $270/day for screenshots alone. On Opus 4.7, it is $450/day.

How Many Screenshots Should You Keep?

For most browser tasks, keep the last 2-5 screenshots and summarize older ones. The agent usually needs the current page state, maybe the previous state, and a short memory of what it already tried. It rarely needs every pixel from 40 turns ago.

  • Short form fill: keep 1-2 screenshots.
  • Shopping or search workflow: keep 3-5 screenshots plus text summaries.
  • Long research session: keep recent screenshots and compact old findings into structured notes.
  • High-stakes workflow: keep audit logs separately, not necessarily in the model context.

Resolution Matters Too

Do not send a 4K screenshot when a 1280×720 view would answer the question. If a small region matters, crop or zoom. If text is available in the DOM, send extracted text instead of an image. The best browser agents combine visual input with structured page data so the model does not need to infer everything from pixels.

Bottom Line

Browser agents can afford screenshots, but they cannot afford unlimited screenshot history. Count them, prune them, cache stable prompts, and compact long sessions. The difference between keeping 3 screenshots and 100 screenshots can decide whether your agent is cheap enough to run in production.

Use the AI Cost Estimator to model the text side of your workflow, then add a screenshot budget on top for browser and computer-use agents.

Want to calculate exact costs for your project?