Skip to content
Go back

API → Connector → Browser → Computer Use → Human: A Cost-Justified Hierarchy for Agent Tooling

TL;DR


I run a fleet of autonomous agents for my own daily operations — research, scheduling, knowledge management, recurring workflows. Most of those agents only need APIs or MCP connectors to do their work. But some tasks don’t have structured endpoints — and the same is true at enterprise scale. Verify patient eligibility on an insurer’s web portal, then update the record in a legacy desktop EHR. Pull supplier pricing from a vendor site, then enter it into an on-premise ERP that predates REST APIs. Extract data from a web-based reporting dashboard, then reconcile it in a desktop compliance tool. The pattern is the same whether you’re building for yourself or for thousands of tenants: one half of the workflow lives in a browser, the other half lives in a desktop app that was never going to ship an API. Either tool alone leaves the other half stranded.

So I ran experiments. Not to benchmark a model release — Opus 4.8 happened to drop during this work and I used it — but to answer a practical question: when an agent needs to interact with a UI, what does that actually cost to run, and what’s the right architecture? Computer use? Browser automation? Both, somehow?

The answer turned out to be sharper than I expected, and it’s the headline of this post.

This post lays out the hierarchy and the experimental data behind it. The posts that follow go deeper into specific findings.

The hierarchy you should actually use:

┌─────────────────────────────────────────────────────┐
│  Tier 1: Native API                       cheapest  │
│         ↓ (only if no API exists)                   │
│  Tier 2: Existing connector (MCP, vendor SDK)       │
│         ↓ (only if no connector)                    │
│  Tier 3: Browser-based automation                   │
│         ↓ (only if web UI not enough)               │
│  Tier 4: Computer use (vision + mouse/keyboard)     │
│         ↓ (only if model can't complete it)         │
│  Tier 5: Human                            most $    │
└─────────────────────────────────────────────────────┘

Walk down it in order. Stop at the first tier that fits the task. Don’t skip up because computer use is more general or because the demo was impressive — generality has a 45× cost penalty against a structured API. A Reflex.dev benchmark on an admin-panel task: vision-based computer use needed 53 steps, 551K input tokens, 17 minutes. The equivalent through structured API calls was 8 calls, 12K tokens, 20 seconds. 45× cheaper, 51× faster, same correctness. That’s not a tuning problem; it’s the architecture. Vision agents pay for every screenshot of every intermediate state.

This isn’t a contrarian take — it’s roughly the hierarchy Anthropic itself recommends. What’s missing from most marketing copy is how steep the cost gap is between tiers.

I confirmed it from the other direction with eight harness configurations on Bedrock. Cost ranged from $0.03 to $5.44 on the same task. The 180× spread came entirely from implementation choices — not the model. A purpose-built Playwright script at $0.03 proved that browser automation done right is structurally cheaper than even the best-tuned computer use run ($1.17). And the same optimization logic applied to the hybrid brought that cross-app task from $5.44 down to $0.17.

At a SaaS tenant doing 1,000 lookups a month, that gap is $14K vs $65K per year — and the API path would be ~$400. Same outcome.

The architectural hierarchy is the buried lede. Computer use isn’t bad — it’s powerful in the cases it covers. But the default reach for it (because it’s general, because Anthropic markets it well, because it works for demos) is wrong for most production loops. The rest of this post is the data behind that claim.

What the three lower tiers actually look like

If you’re at tier 3 or 4 — browser tool, computer use, or some hybrid of the two — there are three architectural shapes worth knowing apart. They’re not interchangeable, and confusing them is a common mistake.

LoopWhat it isHow it worksStrengthsWeaknesses
Tool-calling loop (Playwright, browser-use, AgentCore Browser)Reliable execution backend with structured actionsAgent calls click(selector), type(field, text), screenshot() against a browser via CDPFast, cheap, precise. 89-92% success on common browser tasks.Browser-only. Needs selectors.
Vision loop (Anthropic computer use, Opus 4.8)Vision-based intelligence controlling a desktopScreenshot → LLM reasons over pixels → mouse/keyboard actionsGeneral purpose. Works on any app. 84% on Online-Mind2Web for Opus 4.8.Slower. More expensive. Lower precision.
HybridTool-calling backend + vision as supervisor/fallbackOuter agent decides per turn which tool to callBest of both for genuinely cross-app tasks.More moving parts. Most expensive.

Mapping this back to the hierarchy: the tool-calling loop is what tier 3 (browser-based automation) actually is. The vision loop is what tier 4 (computer use) actually is. Hybrid is when your task spans both — and you should only reach for it when no API or single-tool path covers the workflow.

Numbers from public benchmark reports — Playwright-driven agent frameworks land in the high 80s to low 90s on common browser tasks; vision-based computer use lands around 84%. Hybrids outperform pure vision on web tasks because selectors beat pixel guessing.

When to use which

The industry shorthand is “80% Playwright, 20% browser-use.” That’s incomplete — it leaves out computer use entirely. Here’s the actual decision tree:

Can the task be done with stable DOM selectors?
    ├── YES → Playwright (code, 0 LLM calls in loop)
    └── NO: does it require a browser?
            ├── YES, DOM accessible → browser-use (LLM adapts per page)
            └── NO browser / need desktop → Computer use (vision loop)

In practice:

SituationRight toolWhy
Stable DOM, predictable navigationPlaywrightNo LLM needed in the loop
Dynamic UI, novel pages, site changesbrowser-useLLM adapts, but ~60 calls per task
Desktop apps (LibreOffice, Excel, legacy software)Computer useNo browser alternative exists
Site blocks CDP / Playwright connectionsComputer useBrowser path unreachable
Cross-app: stable web + desktopPlaywright + Computer useEach leg optimized independently
Cross-app: dynamic web + desktopbrowser-use + Computer useMore expensive, but covers both

The mistake most builders make is treating computer use as a more powerful browser-use. It isn’t. It’s for tasks the browser can’t reach — desktop apps, sites that block CDP, or workflows that need both. If the DOM is accessible, computer use is always the more expensive option.

If you’ve been treating these three as interchangeable, stop. The choice within tiers 3-4 is the most important implementation decision you’ll make. The data shows the cost spread within this tier is larger than the cost spread between most models.

Test setup

Every run was fully autonomous — no human in the loop between task prompt and final output.

The stack: a Strands outer agent on Amazon Bedrock received the task and decided which tools to call. For browser work, it called a tool that provisioned a fresh AgentCore Browser session — AWS-managed Chromium running in an isolated container — and drove it either via browser-use (an LLM sub-agent that reasons over screenshots and generates click/type actions) or via a purpose-built Playwright script (code that navigates by URL and queries the DOM directly). For desktop work, it called a tool that booted an ephemeral Docker container with Xvfb, LibreOffice, and a terminal, then ran Anthropic’s computer-use loop against it. Each tool call was one round trip with a fresh sandbox; no state persisted across calls.

All costs are real token counts pulled from CloudWatch (AWS/Bedrock namespace, InputTokenCount, OutputTokenCount, cache read/write metrics) and converted to USD using public Bedrock pricing. Wall times are measured end-to-end including container boot, browser provisioning, and formatting.

Task prompt


┌─────────────────────────────────────┐
│  Outer agent (Strands + Bedrock)    │
│  Decides which tool to call         │
└──────────┬──────────────────────────┘

     ┌─────┴─────┐
     │           │
     ▼           ▼
┌─────────┐  ┌──────────┐
│ Browser │  │ Desktop  │
│  tool   │  │  tool    │
└────┬────┘  └────┬─────┘
     │             │
     ▼             ▼
┌─────────┐  ┌──────────────────┐
│AgentCore│  │ Docker container │
│ Browser │  │ Xvfb + LibreOffice│
│(managed │  │ + terminal        │
│ Chrome) │  └────────┬─────────┘
└────┬────┘           │
     │           computer-use
     │           loop (Bedrock)
  browser-use         │
  OR Playwright        │
     │                │
     └────────┬────────┘


         Final output
    (costs logged to CloudWatch)

Each tool call gets a fresh sandbox — new browser session or new container — torn down after the call. No shared state between calls.

The tests

Two benchmark tasks, all costs from CloudWatch.

Simple task — browser research only:

“It’s Thursday at lunchtime. I’m at 525 Market Street. Find me 3 highly-rated Mexican lunch options within 5 minutes’ walk. Save as markdown.”

A deliberately messy task — fuzzy criteria, a site that blocks scrapers, location-dependent results with no structured endpoint. Chosen because it forces real decision-making rather than form-filling.

RunWhat drives the browserWallCost
Playwright structuredCode — 0 LLM calls in browser loop1:09$0.03
browser-use + cacheLLM (~60 Sonnet calls, cached)17:48$2.43
browser-use vanillaLLM (~60 Sonnet calls, uncached)20:11$3.41
Computer use, tightened promptVision loop, convergence rules5:11$1.17
Computer use, base promptVision loop, no stopping criteria20:30 (timeout)$2.13

Note: Playwright and browser-use runs used Sonnet 4.6 as inner model; computer use runs used Opus 4.8. The ordering holds at comparable model tiers but the gaps narrow.


Hybrid task — browser research + desktop spreadsheet creation:

“Find 3 highly-rated Mexican restaurants near 525 Market Street. Then create a spreadsheet at /tmp/lunch.ods inside the desktop container with columns Name | Rating | Walking time.”

One half lives in a browser; the other half lives in a desktop app. Neither tool alone finishes the job.

RunBrowser legDesktop legWallCost
OptimizedPlaywright (0 LLM calls)CLI-first prompt2:12$0.17
Browser fixed onlyPlaywright (0 LLM calls)Base prompt2:07$0.16
Desktop fixed onlybrowser-use + cacheCLI-first prompt12:10$0.29
Baselinebrowser-use (LLM-driven)Base prompt21:52$5.44

Reading the tables

Simple task table — three findings:

The Playwright row changes the browser tier argument. The browser-use runs ($3.41/$2.43) made tier 3 look expensive compared to tuned computer use ($1.17). But browser-use is a general LLM vision agent that makes ~60 model calls per task. A purpose-built Playwright script with zero LLM calls in the browser loop ran the same task for $0.03. That’s what tier 3 looks like at its best. The hierarchy holds; the browser-use numbers were measuring the wrong implementation.

Within the browser tier, implementation dominates. $3.41 (generic LLM agent) → $2.43 (cached) → $0.03 (Playwright). Most of the $3.38 savings came from removing the LLM from the loop entirely, not from caching it. Caching is a 29% improvement; eliminating the calls is a 99% improvement.

Tier 3 done right beats tier 4 done right by 39×. Playwright at $0.03 vs computer use at $1.17. The structural reason is the same as the API vs computer use gap: when you remove LLM calls from the execution loop, cost collapses.

Hybrid task table — two findings:

The browser leg dominates. Fixing only the browser leg (Playwright, base desktop) dropped the hybrid from $5.44 to $0.16 — almost all of the savings. Fixing only the desktop leg ($0.29) helped, but the browser leg was doing most of the work in the original cost.

CLI-first prompt is a model-dependent optimization. Fixing both legs ($0.17) is nearly identical to fixing just the browser leg ($0.16). The CLI-first desktop instruction barely moved the cost — because Sonnet 4.6 with the base prompt was already efficient, finding the headless CLI path on its own. The CLI-first instruction was designed to fix Opus 4.8, which would fight the GUI before recovering. With Sonnet, it solved $0.01 worth of problem. The lesson: prompt instructions compensate for model behavior; better models need less compensating.

The hybrid runs worth highlighting

The baseline hybrid ($5.44) surfaced something the single-tier runs didn’t: when the inner desktop sub-agent failed (twice, different failure modes), the outer model noticed the text summary of the failure, diagnosed the problem, and issued a completely new instruction — switching from driving LibreOffice through the GUI to using libreoffice --headless from the command line. The second attempt succeeded. That self-recovery behavior is the subject of the third post.

The optimized hybrid ($0.17) never triggered recovery — because the CLI-first prompt eliminated the failure mode entirely. Six clean desktop turns, straight to libreoffice --headless, no failed GUI attempt. The cross-app task completed for $0.17 vs $5.44. Same capability, 32× cheaper.

The two runs together show the tradeoff: if you want to observe self-recovery behavior, run the baseline. If you want the cheapest production path, run the optimized version. For production, $0.17 is the right number — and the recovery architecture is still there if a different failure mode appears.

What the experiments showed

Three patterns across eight runs.

1. “Browser automation” is not one thing — and the difference is $3.38.
A generic LLM browser agent costs $3.41. A purpose-built Playwright script costs $0.03. Same AWS infrastructure, same task, same output quality. The difference is whether an LLM decides every click or code does. Most browser automation benchmarks don’t measure this gap because they test the agent, not the implementation.

2. The browser leg dominates hybrid cost — fix it first.
In the hybrid runs, fixing only the browser leg (Playwright) dropped cost from $5.44 to $0.16. Fixing only the desktop prompt dropped it to $0.29. The browser was doing most of the damage. When optimizing a multi-leg workflow, profile which leg is expensive before tuning both.

3. Prompt instructions compensate for model behavior; better models need less compensating.
The CLI-first desktop prompt saved ~$5 with Opus 4.8 (by preventing a GUI fight) and $0.01 with Sonnet 4.6 (which found the CLI path on its own). This is a general pattern: optimization techniques designed for one model tier may be redundant or irrelevant at another.

The hierarchy isn’t theoretical. The cost gaps are structural. Vision-based loops pay for every screenshot of every intermediate state. Structured paths don’t.

The thing I’m still unsure about

In both hybrid runs, the inner desktop agent saved files inside the ephemeral container and did not proactively return them. The outer model noticed and offered to copy them, but treated persistence as optional.

This raises a real architectural question: When a sub-agent produces artifacts, should the system assume they need to be explicitly returned, or should there be a convention for automatic (but safe) propagation?

I chose explicit for now. At small scale this is fine. At larger scale it may create either lost work or excessive cognitive load on the outer model. I haven’t decided which risk is worse.

So what

The hierarchy that actually holds up in practice is:

Native API → Existing connector → Browser-based automation → Computer use → Human last resort.

Walk down it in order. Stop at the first tier that can do the job. Only reach for computer use when nothing above it works — legacy desktop apps, third-party SaaS with no usable endpoints, or genuine cross-app workflows that no single structured surface covers.

The experiments confirmed what the Reflex benchmark suggested: the cost gaps are structural, not a tuning problem. Vision agents pay for rendering every intermediate state. Structured paths give the model the data directly. Even well-tuned computer use is dramatically more expensive than the tiers above it for most work.

Computer use is genuinely powerful in the narrow set of cases it uniquely enables. Treating it as the default because the demos look impressive is one of the fastest ways to turn a reasonable automation project into an expensive recurring cost.

If you only read this post, the takeaway is simple: pick the cheapest tier that can actually complete the task. The model is the easy part.


If you want to go deeper

The series digs into each finding separately:


Part 1 of the Agent Tooling series. Part 2: Harness optimization →


Share this post on:


Previous Post
I Had a Working Agent at $3.41 a Run. Here's Where the Money Was Actually Going.
Next Post
Composer 2.5 Earned a Daily-Driver Slot — After Two Days of Real Wiring Work