
Trying to make AI agents behave like human users in a browser could be far more expensive than wiring them directly into back-end APIs, according to new benchmark data from enterprise platform provider Reflex.
The company compared two approaches to using Anthropic’s Claude Sonnet model to operate the same web application: one through the graphical interface using screenshots and clicks, and the other through direct HTTP API calls.
In Reflex’s test, both agents were given the same instruction:
“A customer named Smith has complained about a recent order. Find the Smith with the most orders, accept all their pending reviews, and mark their most-recent ordered order as delivered.”
The only difference was how Claude Sonnet interacted with the app:
- Vision agent: Used browser-use 0.12 to navigate the web UI, relying on screenshots, image processing, and optical character recognition to understand what was on screen.
- API agent: Called the same HTTP endpoints the UI relies on, receiving structured data instead of page images.
“Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app’s HTTP endpoints directly,” wrote Palash Awasthi, head of growth at Reflex, in a blog post describing the setup. Both used the same model (Claude Sonnet), the same pinned dataset and the same task; the interface was the only variable.
On raw performance, the API-driven agent was markedly faster. According to Reflex, the API agent finished in about 20 seconds and needed just eight calls to:
- List pending customer reviews
- Accept those reviews
- Mark the relevant order as delivered
The vision agent struggled with the same workflow. It initially found only one of four pending reviews because it did not scroll the page, leaving three reviews off-screen and effectively invisible to the model.
Even after Reflex revised the prompt to help the vision system behave more effectively, the vision agent took around 17 minutes to complete the task still dramatically slower than the API approach.
Token burn: ‘seeing’ costs 45x more
The more striking gap was in token usage, which directly affects compute load and, in many commercial settings, cost.
Reflex reports that the vision agent consumed roughly:
- ~500,000 input tokens
- ~38,000 output tokens
By contrast, the API agent used about:
- ~12,150 input tokens
- ~934 output tokens
That translates to the vision agent using around 45 times more tokens than the API agent to finish the same business task on the same app.
Awasthi argues that the gap reflects a fundamental architectural difference: vision agents “need to see,” and every screenshot they ingest comes with a large token footprint. Parsing a single image is significantly heavier than handling the equivalent structured response from an HTTP endpoint.
Anthropic’s own guidance underscores this cost. The company estimates that processing a 1,000×1,000-pixel image with Claude Sonnet 4.6 consumes about 1,334 tokens. Multiply that by the number of screenshots needed to navigate a non-trivial workflow and the token count climbs quickly.
In this benchmark, the repeated screenshot capture and interpretation required for clicking around a web UI accounts for the bulk of the half-million input tokens burned by the vision agent.
By comparison, the API agent’s work is dominated by compact, structured requests and responses. The system calls known endpoints, receives JSON-like data, and asks the model to reason over that data rather than decoding pixels.
Beyond the raw token numbers, Reflex highlights that interpreting a web page visually is inherently more complex for a model than working against predefined tools and APIs. The vision agent must understand layout, detect scrolling needs, and interpret on-screen elements correctly — all from image snapshots that may hide crucial information off-screen, as the missed reviews demonstrate.
Reflex has made its test available as a benchmark for others who want to reproduce or extend the results. While detailed methodology beyond the figures above was not provided in the excerpt, the aim is to give teams a concrete way to compare “vision UI” automation against API-centric designs in their own environments.
Awasthi’s takeaway is pragmatic: vision-style agents are likely to remain important when dealing with software you do not control, where APIs are missing, incomplete or inaccessible. But for internal or controllable systems, he suggests targeting APIs first, given the large differences in speed and token consumption exposed by this experiment.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.







