How Many Screenshots Can a Browser Agent Afford Before Context Costs Explode?

By Eric Bush · May 20, 2026 · 6 min read

Programming on dual screens with dark theme

Screenshots Are Hidden Token Spend

Browser agents do not just send text to a model. They often send screenshots, DOM extracts, tool logs, previous clicks, and task instructions. Screenshots are the expensive part because visual inputs are converted into model context. A practical estimate is 1,000-1,800 tokens per screenshot, depending on resolution and detail.

That means a browser agent can spend more on looking than on writing. If it captures a screenshot every turn and keeps the full history, context grows quickly even if the web task itself is simple.

The 100-Screenshot Problem

A 200k context window sounds enormous, but visual agents can fill it surprisingly fast. At 1,500 tokens per screenshot, 100 screenshots are already 150,000 tokens before counting instructions, tool results, page text, and model outputs. That leaves little room for reasoning or task history.

Screenshots kept	Estimated visual tokens	Risk
10	~15,000	Usually safe
50	~75,000	Expensive but manageable
100	~150,000	Context pressure
150	~225,000	Over common context limits

The Direct Cost Example

Suppose a browser agent uses 60 screenshots in a session, averaging 1,500 input tokens each. That is 90,000 visual input tokens. On Claude Sonnet 4.6 at $3.00 per million input tokens, the screenshots cost about $0.27 before output tokens. On Claude Opus 4.7 at $5.00 per million input tokens, they cost $0.45.

That sounds small for one session, but the multiplier matters. If 1,000 automated browser sessions run per day, 60 screenshots each becomes 90 million visual input tokens per day. On Sonnet 4.6, that is $270/day for screenshots alone. On Opus 4.7, it is $450/day.

How Many Screenshots Should You Keep?

For most browser tasks, keep the last 2-5 screenshots and summarize older ones. The agent usually needs the current page state, maybe the previous state, and a short memory of what it already tried. It rarely needs every pixel from 40 turns ago.

Short form fill: keep 1-2 screenshots.
Shopping or search workflow: keep 3-5 screenshots plus text summaries.
Long research session: keep recent screenshots and compact old findings into structured notes.
High-stakes workflow: keep audit logs separately, not necessarily in the model context.

Resolution Matters Too

Do not send a 4K screenshot when a 1280×720 view would answer the question. If a small region matters, crop or zoom. If text is available in the DOM, send extracted text instead of an image. The best browser agents combine visual input with structured page data so the model does not need to infer everything from pixels.

Bottom Line

Browser agents can afford screenshots, but they cannot afford unlimited screenshot history. Count them, prune them, cache stable prompts, and compact long sessions. The difference between keeping 3 screenshots and 100 screenshots can decide whether your agent is cheap enough to run in production.

Use the AI Cost Estimator to model the text side of your workflow, then add a screenshot budget on top for browser and computer-use agents.

Want to calculate exact costs for your project?

Estimate Your AI Coding Costs →Compare Token Pricing →

Alibaba's Page Agent Skips Screenshots: How Text-Only DOM Compression Cuts Browser Agent Coding Costs

On July 3, 2026, Alibaba open-sourced Page Agent, a JavaScript client library that lets pure-text LLMs operate DOM elements directly. We break down the FlatDomTree compression trick and calculate how much cheaper text-only browser agents are vs traditional screenshot-based approaches.

NVIDIA ASPIRE Uses Claude Opus 4.6 with 1M Context as Robotics Coding Agent: What It Costs Per Task

NVIDIA and academic partners built ASPIRE, a self-improving robotics framework whose programming brain is Claude Opus 4.6 in 1M-token mode. Success rates jump from 4% to 31% on unseen long-horizon tasks — but every LIBERO-Pro trial burns real tokens. Here is the per-task cost math.

Context Graph vs Vector RAG vs Raw History: Which Multi-Agent Memory Costs Less per Query?

A deterministic benchmark across three memory architectures shows context graphs hit 88.9% accuracy at 26.9 tokens per query while raw history dump costs 18x more for worse accuracy. We unpack what these numbers mean for multi-agent coding cost budgets in 2026.

← Previous

Multi-Agent Coding Cost Calculator: How Background Agents Multiply Token Usage

AI Agent Compute Commitments vs Pay-As-You-Go Tokens: Which Pricing Model Saves More?