TL;DR
Playwright CLI made a bet back in April 2026: stop streaming DOM/accessibility trees into an agent’s context window through MCP tool calls, and instead persist browser state to disk as compact YAML that the agent reads on demand. Three months on, Vercel Labs independently made the identical bet with agent-browser. That’s not a coincidence — it’s the shape the whole ecosystem is converging on. This post maps where every other piece fits: cloud browser infra, AI-native test reporting, and the consumer-facing agentic browsers that are a different animal entirely.
Introduction
My previous post on Playwright + Claude Code tracked one specific transition — Playwright’s MCP server giving way to a CLI-first approach for coding agents. That was useful in isolation, but it’s one tool in a much bigger space. Since then I kept running into adjacent tools solving overlapping problems: Vercel shipping their own browser CLI, Anthropic pushing Claude for Chrome to GA, and a wave of “AI-native” test reporting startups layering LLM analysis on top of Playwright/Cypress CI runs.
This post is a landscape map, not a tutorial. Three categories, what each tool actually is, and — because a lot of the marketing in this space overlaps — a decision guide for which one you actually reach for.
The three categories
(Claude Code, Cursor, etc.)"] subgraph L["1. Automation primitives"] PC["Playwright CLI"] AB["agent-browser (Vercel Labs)"] SH["Stagehand"] BU["Browser Use"] SK["Skyvern"] CLOUD["optional cloud execution:
Browserbase, Browser Use Cloud,
Vercel Sandbox"] CUR["Cursor Cloud Agents
(coding agent, not automation infra)"] end subgraph P["2. Consumer / enterprise agentic browsers"] CC["Claude for Chrome"] OP["OpenAI Operator"] ED["Edge Copilot Mode"] end subgraph Q["3. AI-native QA reporting"] RP["ReportPortal"] TD["TestDino"] CU["Currents.dev"] AT["Allure TestOps"] TO["Testomat.io"] end A --> L L -->|"CI results feed into"| Q L -.->|"same underlying mechanics,
different execution context"| P
1. Automation primitives — “how does an agent drive a browser”
This is the layer your coding agent actually calls.
Playwright CLI
Playwright CLI — Microsoft’s CLI-first tool, covered in depth in my earlier post. Since then: run-code now accepts a file argument instead of only inline snippets, show --annotate adds visual/structural annotations for agents, a new drop command handles drag-and-drop, and the CLI self-checks for stale installed Skills. Worth correcting my own record here: Playwright’s MCP server is not deprecated — Microsoft still recommends it for exploratory, self-healing, long-running agentic sessions that benefit from persistent state. CLI+Skills is the pick for high-throughput coding-agent workflows where token budget dominates. ~11.8k stars, 235 dependent projects — real adoption, still smaller than the OSS agent-browser frameworks below.
What it actually feels like to use
Setup is two commands — no .mcp.json, no server process to keep alive:
npm install -g @playwright/cli@latest
playwright-cli install --skills
That second command drops a Skill into .claude/skills/ (or the equivalent for Copilot/Cursor/Codex), so the agent discovers every command by reading playwright-cli --help on its own — no tool schema to hand-wire into context.
You can drive it by hand first, the same way the agent will:
playwright-cli open https://demo.playwright.dev/todomvc/ --headed
playwright-cli type "Buy groceries"
playwright-cli press Enter
playwright-cli type "Water flowers"
playwright-cli press Enter
playwright-cli check e21
playwright-cli screenshot
The part that actually changes how an agent works is what comes back after every command — not a blob of accessibility-tree JSON stuffed into the response, but a pointer to a YAML file on disk:
> playwright-cli goto https://demo.playwright.dev/todomvc/
### Page
- Page URL: https://demo.playwright.dev/todomvc/
- Page Title: TodoMVC
### Snapshot
[Snapshot](.playwright-cli/page-2026-02-14T19-22-42-679Z.yml)
The agent only opens that file when it actually needs to reason about the DOM — e.g. to find the e21 ref for a checkbox — instead of re-parsing a full tree on every single turn. That’s the whole “state on disk, not tool schemas in context” bet from the TL;DR, made concrete: ten actions in a row cost ten cheap CLI invocations plus however many snapshot reads the agent actually decides it needs, not ten full-page trees round-tripped through the model.
In practice, an end-to-end run from Claude Code looks like this:
> Use playwright skills to test the "add todo" flow on
https://demo.playwright.dev/todomvc, then write a Playwright test for it.
Claude works it like a person would — open the page, snapshot to get refs, type and submit two todos, check one off, screenshot as evidence — and only after the manual walkthrough succeeds does it emit the automated test:
playwright-cli run-code --filename=verify.ts # sanity-check a snippet before committing to it
playwright-cli generate-locator e21 # turn a ref into a stable Playwright locator
import { test, expect } from "@playwright/test";
test("can add and complete a todo", async ({ page }) => {
await page.goto("https://demo.playwright.dev/todomvc/");
await page.getByPlaceholder("What needs to be done?").fill("Buy groceries");
await page.getByPlaceholder("What needs to be done?").press("Enter");
await page.getByPlaceholder("What needs to be done?").fill("Water flowers");
await page.getByPlaceholder("What needs to be done?").press("Enter");
await page.getByRole("listitem").filter({ hasText: "Buy groceries" })
.getByRole("checkbox").check();
await expect(page.getByRole("listitem").filter({ hasText: "Buy groceries" }))
.toHaveClass(/completed/);
});
Same manual-exploration-then-generate-test shape as the MCP workflow in my earlier post — the difference isn’t the workflow, it’s that every step along the way is a lightweight CLI call and a snapshot pointer instead of a tool-call response carrying the accessibility tree. If something misbehaves mid-run, playwright-cli show opens a live dashboard with a screencast of the session — useful for actually watching what the agent is doing rather than reading a wall of tool-call logs after the fact.
agent-browser (Vercel Labs)
vercel-labs/agent-browser — a Rust CLI from Vercel Labs, same philosophy as Playwright CLI: accessibility-tree snapshots with persistent element refs (@e1, @e2) instead of re-serializing the DOM every turn.
-
Integrations:
@agent-browser/sandboxfor running itself inside ephemeral Vercel Sandbox microVMs, plus its own MCP server (agent-browser mcp) for tools that still expect one. Documented to work with Claude Code, Cursor, Codex, Copilot, Windsurf, Gemini CLI, Goose, and OpenCode. - Adoption: 37.9k stars — bigger than Playwright CLI, though that’s partly Vercel’s reach rather than a signal of deeper coding-agent adoption specifically. One credible independent write-up: Pulumi’s engineering blog used it with Claude Code in a self-verifying CI loop.
What it actually feels like to use
Setup mirrors Playwright CLI’s two-command simplicity:
npm install -g agent-browser
agent-browser install # downloads Chrome for Testing on first run
Driving the same TodoMVC flow by hand, using refs from a snapshot instead of Playwright CLI’s e1/e2 style:
agent-browser open https://demo.playwright.dev/todomvc/
agent-browser snapshot
- textbox "What needs to be done?" [ref=e1]
- list "todo-list" [ref=e2]
agent-browser fill @e1 "Buy groceries"
agent-browser press Enter
agent-browser fill @e1 "Water flowers"
agent-browser press Enter
agent-browser snapshot
agent-browser check @e6
agent-browser screenshot todo.png
Same shape as the Playwright CLI session — snapshot, act on a ref, snapshot again — just with @e1-style refs instead of bare e1, and fill/press instead of type/press. The open on the first line pays a browser-launch cost once; everything after it is fast because a background daemon stays warm between commands. Under the hood that daemon is Rust talking straight to the Chrome DevTools Protocol (CDP — the interface Chrome/Chromium expose for controlling pages, dispatching input, reading network traffic), where Playwright’s own bindings reach the same protocol through a bundled Node.js driver process instead. It detects an existing Playwright or Puppeteer install to reuse the Chrome binary, but doesn’t depend on either at runtime.
Where it diverges from Playwright CLI is what happens once the exploration succeeds. Playwright CLI’s agent loop ends by emitting an @playwright/test file, because Playwright ships a full test runner. agent-browser doesn’t — there’s no test-runner equivalent — so the natural artifact is a repeatable script of the same CLI commands via batch, which also cuts the per-command process overhead:
agent-browser batch \
'["open", "https://demo.playwright.dev/todomvc/"]' \
'["fill", "@e1", "Buy groceries"]' \
'["press", "Enter"]' \
'["fill", "@e1", "Water flowers"]' \
'["press", "Enter"]' \
'["check", "@e6"]' \
'["screenshot", "todo.png"]'
That’s the practical tradeoff versus Playwright CLI: a faster, dependency-free daemon and one CLI surface for both exploration and repeatable automation, in exchange for giving up Playwright’s mature test-runner ecosystem (fixtures, trace viewer, CI reporters) — you’d still reach for @playwright/test or Stagehand if you actually want a maintained regression suite out of the session.
Stagehand
Stagehand (Browserbase, MIT, 23.3k stars) — four primitives: act, extract, observe, agent. Where Playwright CLI and agent-browser are CLIs a coding agent shells out to, Stagehand is a TypeScript SDK you write code against directly — the LLM decision loop lives inside your own script, not behind a separate command-line process.
What it actually feels like to use
Setup is a scaffolding CLI, not a global install:
npx create-browser-app
cd my-stagehand-app
cp .env.example .env # add OPENAI_API_KEY, BROWSERBASE_API_KEY, etc.
npm start
That scaffolds a runnable index.ts using the same four primitives. Adapted to the TodoMVC flow from the previous two sections:
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";
const stagehand = new Stagehand({ env: "LOCAL" }); // or "BROWSERBASE" for a cloud session
await stagehand.init();
const page = stagehand.context.pages()[0];
await page.goto("https://demo.playwright.dev/todomvc/");
await stagehand.act("type 'Buy groceries' into the new-todo box and press Enter");
await stagehand.act("type 'Water flowers' into the new-todo box and press Enter");
const observeResult = await stagehand.observe("what can I click to mark the first todo as done?");
const actResult = await stagehand.act(observeResult[0]); // replay the previewed action, not a fresh prompt
console.log(actResult.cacheStatus); // "MISS" the first time, "HIT" on every rerun after
const { count } = await stagehand.extract(
"how many completed todos are there?",
z.object({ count: z.number() }),
);
observe() is the preview step — it returns a candidate action without running it, so you can inspect or approve it before act() executes it. Run the same script a second time and act() reports cacheStatus: "HIT": Browserbase’s server-side cache keys on the instruction, page content, and options, so a repeated call skips the LLM entirely and replays deterministically — until the page’s structure actually changes, at which point it falls back to a fresh model call automatically. That’s the “prompt-first while exploring, code-first once stable” pitch made concrete: the same three lines cost real inference on the first run and roughly zero on every rerun after, with no separate step to convert exploration into a locator.
For the same “just describe the outcome” shape as Playwright CLI’s and agent-browser’s agent-driven sessions, agent() wraps the whole flow:
const agent = stagehand.agent();
await agent.execute(
"Add two todos, 'Buy groceries' and 'Water flowers', then mark the first one complete."
);
The competitive angle versus the CLI tools: Playwright CLI and agent-browser hand an agent raw commands and a snapshot file, and the agent decides what to do next every turn — there’s no built-in mechanism for skipping repeated LLM calls once a flow is known-good. Stagehand’s caching does that natively, and switching from a local Chromium to a cloud one is a one-line config change (env: "BROWSERBASE") rather than a separate sandbox package or infra wiring. Browserbase itself is managed headless Chromium at scale, plus a Search API (Exa-powered), a Fetch API, and “Agent Identity” — Free / Developer $20/mo / Startup $99/mo / custom Scale, backed by a $40M Series B at ~$300M valuation.
Browser Use
Browser Use (MIT, Python, 102.8k GitHub stars as of writing — actively maintained, latest release 0.13.3, July 2026) takes a different shape than everything above it. Where Stagehand exposes act/extract/observe as primitives you compose yourself, Browser Use is fully agentic by default: you hand it one task string and an Agent.run() loop drives the whole session end-to-end, closer in spirit to Skyvern than to Stagehand.
What it actually feels like to use
Install is a package, not a CLI-first tool like the previous two sections:
uv add browser-use # or: pip install browser-use
Requires Python 3.11+. The minimal shape is a single Agent, not a sequence of act/type/click calls:
import asyncio
from browser_use import Agent, BrowserProfile, ChatBrowserUse
async def main():
agent = Agent(
task=(
"Go to https://demo.playwright.dev/todomvc/, add two todos "
"'Buy groceries' and 'Water flowers', then mark the first one complete."
),
llm=ChatBrowserUse(model="openai/gpt-5.5"),
browser_profile=BrowserProfile(headless=False),
)
history = await agent.run()
print(history.final_result())
asyncio.run(main())
Compare that to the Stagehand agent() call covering the identical task in the previous section — same one-sentence-task shape, but there’s no lower-level act/observe escape hatch to drop into if you want a specific step to be deterministic instead of re-planned by the model. Extensibility works the other direction: you register custom Tools() via @tools.action decorators that the agent can call mid-run (e.g. a “read this file” or “post to Slack” action), rather than scripting page-level steps yourself.
There’s also a scaffolding command, same idea as Stagehand’s create-browser-app:
uvx browser-use init --template default # writes a runnable browser_use_default.py
And a separate CLI surface (browser-use, powered by a companion project called Browser Harness) that drops you into a scripted REPL for direct page control rather than task-driven automation — plus a browser-use skill command that registers itself as a Claude Code/Codex Skill, letting a coding agent drive a connected browser directly instead of going through the Python Agent class at all.
The competitive pitch: multi-provider LLM support out of the box (ChatBrowserUse proxies OpenAI, Anthropic, and Google models behind one BROWSER_USE_API_KEY, plus direct ChatOpenAI/ChatAnthropic classes and local Ollama models), and a company-run benchmark (browser-use/benchmark, “BU Bench V1”) comparing model success rates — the project is as much a benchmarking/model-optimization effort as a browser-automation library. Where it’s genuinely thinner than Stagehand: no documented cache-and-replay mechanism, so every run costs a full agentic loop of LLM calls rather than a cheap cache hit on repeat.
Unlike Stagehand’s one-line env: "BROWSERBASE" swap, cloud isn’t a constructor flag on the same Agent class here — Browser Use Cloud (cloud.browser-use.com) is a separate hosted product: submit a natural-language task over the API, it spins up a remote browser and agent, and gives you a live, watchable session URL. Pay-as-you-go from $0.06/browser-hour, or $100/mo Starter.
Skyvern
Skyvern (AGPL-3.0, Python, 22.1k stars) is the odd one out in this category: not a CLI, not a library you import and drive yourself, but a local service — a FastAPI server plus a bundled web UI — that you talk to through a Python/TS SDK or the UI directly. Its pitch is vision-LLM-assisted form-filling at scale (insurance quotes, government forms, checkout flows), with a hybrid grounding model rather than a pure no-selectors one: you can drive it with plain CSS/XPath, a natural-language prompt, or “AI fallback” (selector first, AI only on failure).
What it actually feels like to use
Quickstart spins up the whole service, not just a package:
pip install "skyvern[all]"
skyvern quickstart # starts the FastAPI server + UI at http://localhost:8080
The SDK then layers Playwright-level control and AI-augmented calls on the same page object — the interesting part is that they compose, so you choose per-step how much AI you want:
from skyvern import Skyvern
skyvern = Skyvern.local() # or Skyvern(api_key=...) for Skyvern Cloud
browser = await skyvern.launch_cloud_browser()
page = await browser.get_working_page()
await page.goto("https://demo.playwright.dev/todomvc/")
await page.click(prompt="the new-todo input box") # AI-located, no selector written
await page.agent.run_task( # or hand off the whole flow as one task
"Add two todos, 'Buy groceries' and 'Water flowers', then mark the first one complete."
)
result = await page.extract(
"how many completed todos are there?", schema={"count": "int"}
)
page.act(prompt) / page.extract(prompt, schema) / page.validate(prompt) are the Stagehand-shaped primitives here; page.agent.run_task(...) is the Browser-Use-shaped one-liner. page.agent.run_workflow(workflow_id) invokes a saved, structured multi-step definition built in Skyvern’s no-code workflow builder (Browser Task/Action blocks, data extraction, loops, HTTP requests, custom code) — the closest thing here to a “record once, replay forever” artifact, though it’s explicitly authored rather than auto-generated from a successful AI run.
Skyvern’s blog/technical-report material references a “Route Memorization” feature that auto-converts a successful AI path into a deterministic Playwright script — not documented in the README, so treat it as unconfirmed rather than a shipped feature.
The genuinely distinctive bits versus Browser Use and Stagehand: first-class 2FA/TOTP handling (QR, email, SMS) with dedicated docs, password-manager integration (Bitwarden today, 1Password/LastPass planned), an MCP server, and Zapier/Make/N8N integrations — this is a tool built around unattended, credentialed form-filling rather than exploratory browsing. Its 85.85% WebVoyager score is self-reported by Skyvern; a neutral aggregator (Steel’s leaderboard) repeats the same number, but that’s a compilation of vendor-reported runs, not an independent re-execution — treat it as directionally useful, not a controlled benchmark.
Cloud is Skyvern Cloud (Skyvern(api_key=...) instead of Skyvern.local()) — the README is explicit that the OSS core is feature-complete except for what it calls anti-bot measures: proxy rotation and CAPTCHA solving are cloud-exclusive, alongside no-infra hosting and parallel run capacity.
Cursor Cloud Agents
A different shape of “cloud browser” than everything above: Cursor Cloud Agents (cursor.com/cloud) isn’t infra you point a browser-automation library at — it’s a coding-agent product that happens to bundle a browser, not a standalone automation service like Browserbase or Browser Use Cloud.
You dispatch a task from wherever you already work, not just the IDE: Cursor Desktop’s “Cloud” option, the web dashboard at cursor.com/agents, the mobile apps, an @cursor mention on a GitHub/GitLab/Bitbucket PR comment, Slack, Teams, a Linear/Jira assignment, or the API. Each task gets its own ephemeral cloud VM — full desktop and terminal, not just a headless container — that clones the repo onto a working branch, installs dependencies (agent-led setup, a saved snapshot, or a custom Dockerfile via .cursor/environment.json), and gets network access. Nothing is streamed to you live by default: a dashboard lists running agents, and completion produces a merge-ready PR plus a video recording of the agent’s actual desktop/browser session, screenshots, and logs — though you can grab remote-desktop control of a still-running VM if you want to watch or intervene.
The browser-verification part isn’t a flag you set — it’s the agent deciding, mid-task, to open the app it just changed and click through it to check the result, then attaching that session as evidence rather than handing off its first untested attempt. Cursor’s own launch post claimed 30% of their internally merged PRs came from agents running this way; CEO Michael Truell has since cited 35%+ in press coverage as adoption grew. Billing is API/token-metered per model rather than a separate SKU — cloud agents always run in Max Mode, gated by a spend limit you set once, and the capability itself is bundled into Ultra/Teams/Enterprise plans.
2. Consumer / enterprise agentic browsers — same tech, different job
These use the same LLM-drives-a-browser mechanics as category 1, but the split holds up: they act inside your authenticated browser session — your email, your accounts, your cookies — not a disposable automation instance, and they ship as closed products from a single vendor rather than a library you wire into your own agent loop.
Claude for Chrome
A Chrome extension, not a separate app — install it from the Chrome Web Store and it acts inside your existing logged-in tabs. Still in beta (Anthropic’s own product page calls it “a beta feature with unique risks”), available on all paid plans rather than gated to a specific tier. Two concrete capabilities beyond one-shot “do this now” prompts: scheduled tasks that re-run on a daily/weekly/monthly cadence, and multi-tab workflows where you drag tabs into a Claude tab group so the agent can read and act across all of them at once.
The programmable primitive underneath, if you want to build the same click/type/screenshot loop yourself rather than use the extension, is Anthropic’s Computer Use tool on the Developer Platform (also available via Bedrock and Vertex): the model returns a tool_use action — screenshot, left_click, type, scroll — as JSON, your code executes it against a VM/container you run, and you feed the resulting screenshot back in.
Worth knowing: CyberScoop reported a real vulnerability chain — a permissive *.claude.ai origin allowlist combined with a DOM XSS bug in a CAPTCHA component let another browser extension achieve zero-click prompt injection and hijack the agent’s session. Disclosed December 2025, patched in extension v1.0.41. This class of risk (an agent that can read and act on any page you have open) is structural to the category, not unique to Anthropic’s implementation — expect similar disclosures from other in-browser agents as they mature.
OpenAI Operator / ChatGPT agent
Folded into ChatGPT’s unified agent mode. Third-party reviews through 2026 have been mixed to critical (overcautious, constant confirmation prompts) — treat that sentiment as opinion-blog territory, not a controlled study.
Microsoft Edge Copilot Mode
Enterprise/IT-policy-gated agentic browsing for Edge for Business, still in limited preview. Notable mainly as market context: Microsoft is betting on agentic browsing at the OS/browser level with Edge, while separately shipping Playwright CLI as the dev-tool-layer bet — same company, two different audiences.
One neutral cross-tool reference point, if you want actual numbers instead of scattered vendor claims: Steel’s open leaderboard tracks Skyvern, Browser Use, Operator, and Claude Computer Use against WebVoyager/OSWorld/Online-Mind2Web. It’s still self-reported per source, dated and attributed rather than independently re-run, and the maintainers themselves flag that WebVoyager has saturated (>90% scores common at the top) — useful for rough ordering, not a rigorous head-to-head.
3. AI-native QA reporting
These tools don’t drive a browser. They consume Playwright/Cypress CI output and add an AI layer on top — failure classification, flakiness scoring, root-cause tagging. Easy to conflate with category 1 because they’re all “AI + Playwright,” but they solve “why did this test fail” rather than “how do I drive this test.”
ReportPortal
ReportPortal — open-source (Apache-2.0, EPAM-originated), self-hosted-first test analytics platform, and the most technically substantive AI story of the group: its service-auto-analyzer component runs similarity/clustering search over indexed logs and stack traces (Elasticsearch-backed) to auto-assign new failures to previously triaged defect types — genuine ML/information-retrieval, not an LLM wrapper. Ships an official MCP server (reportportal/reportportal-mcp-server) exposing auto-analysis, unique-error-analysis, quality gates, and launch/log queries to any MCP-speaking agent. Free to self-host indefinitely — the only tool in this group without a paywall between you and the AI features.
TestDino
TestDino — AI failure classification into four buckets (actual bug / UI-change failure / flaky / misc) with confidence scores, per-test stability tracking, GitHub PR annotations. The one thing that separates it from older reporting tools: a genuine open-source MCP server with 27 tools letting Claude/Cursor/Copilot query test-run and root-cause data directly — confirmed as a real engineering artifact, not just marketing copy, though I found no independent user reviews (G2/Reddit/HN) corroborating how well it works in practice.
Currents.dev
Currents.dev — a Playwright/Cypress-focused CI dashboard and test orchestrator: aggregates run results, auto-balances specs across CI machines (claiming up to ~50% faster than native Playwright sharding), flaky-test detection, and analytics/regression trends. Its AI story is an official MCP server plus a Playwright “agent skill” and IDE extension, letting Claude/Cursor/Copilot query historical test data directly — the flaky-detection and error-classification mechanisms themselves aren’t documented as ML- or LLM-based, so treat “AI-driven” there as marketing rather than a verified technique.
Allure TestOps
Allure TestOps (Qameta) — the paid layer on top of the widely-used open-source Allure Report standard: manual + automation test management, CI/BTS/TMS integrations, high-load result storage. Its AI story is younger and narrower than ReportPortal’s — no native ML failure analysis — built instead around an official MCP server (public beta since release 26.1.1, ~13 tools) for creating/searching test cases via its query language and managing launches, results, and test plans from Claude Desktop, Claude Code, or Cursor.
Testomat.io
Testomat.io — a manual + automated test-management tool built around AI throughout the workflow rather than one bolt-on feature: AI Failure Clusterization (groups automated failures by detected pattern, shipped on all tiers including free — older claims that it “lacks AI analysis” are stale) and an embedded AI Agent Test Assistant. No first-party MCP server surfaced as of this research, so its AI integration is in-product rather than agent-facing.
Feature and pricing comparison, verified directly against each vendor’s own site/docs/pricing page:
| Tool | AI/ML mechanism | Agent integration | Entry price | Notes |
|---|---|---|---|---|
| ReportPortal | ML/similarity clustering (Elasticsearch-backed auto-analysis) | Official MCP server (Docker, stdio/HTTP) | Free, self-hosted (Apache-2.0) | SaaS also available, ~$570–600/mo (Startup tier), if you don't want to self-host |
| TestDino | 4-bucket failure classification with confidence scores | Open-source MCP server, 27 tools | $39/mo (annual) / $49/mo (monthly) | Free tier: 5k executions/mo |
| Currents.dev | Undocumented mechanism (marketed as "AI-driven" flaky detection) | Official MCP server + Playwright agent skill | $49/mo | Usage-based, 10k results included, $5/1k overage |
| Allure TestOps | None native — AI story is the MCP server, not ML analysis | Official MCP server, public beta, ~13 tools | $39/user/mo, sliding to $30/user/mo at scale | Per-seat, not per-execution |
| Testomat.io | AI Failure Clusterization + AI Agent Test Assistant | None confirmed | $30/mo/user ($27 annual) | AI clusterization ships on all tiers, including free |
None of these generate or drive tests — they’re analytics/reporting layers on top of whatever ran the browser in category 1.
Decision guide
| If you're... | Reach for |
|---|---|
| Testing/exploring a web app from Claude Code, token budget matters | Playwright CLI or agent-browser
|
| Running the automation itself inside ephemeral Vercel Sandbox microVMs | agent-browser |
| Want natural-language actions with fallback to deterministic code, plus caching | Stagehand |
| Want one task string to drive the whole session instead of composing primitives | Browser Use |
| Automating credentialed, form-heavy flows at scale (2FA, password managers) | Skyvern |
| Need production-grade cloud browser sessions behind an API | Browserbase |
| Want a coding agent that verifies its own change by driving a browser, fully in the cloud | Cursor Cloud Agents |
| Want AI triage on top of existing Playwright/Cypress CI failures | ReportPortal — free, self-hosted, and the most substantive ML story (TestDino/Currents/Allure otherwise, if you'd rather pay for SaaS than run infra) |
| Want a human-facing agentic browser, not a coding-agent tool | Claude for Chrome |
Conclusion
The clearest signal in this space right now isn’t any single tool — it’s that Playwright CLI’s and Vercel’s agent-browser independently converged on the same architecture: state on disk, not tool schemas in context. That’s a strong sign the MCP-tool-call pattern for driving a browser is being displaced by CLI-first design specifically for coding agents, even as MCP stays relevant for other integration points.
Category 1 itself splits further than “primitives” implies: Playwright CLI, agent-browser, and Stagehand hand you building blocks to compose; Browser Use and Skyvern go further and take a single task string, running the whole agentic loop themselves. That’s a real architectural choice, not a maturity gradient — which one you want depends on whether you need a specific step to stay deterministic or you’re fine with the model re-planning every turn.
There’s a second, almost inverse convergence in category 3: TestDino, Currents.dev, ReportPortal, and Allure TestOps are all independently shipping MCP servers, but not to drive anything — they let an agent query test-run and root-cause data instead. Opposite job from category 1’s story, same protocol. Everything else in this post — consumer agentic browsers — sits in an adjacent lane that’s easy to lump in under “AI browser tools” but solves a different problem: acting inside your session rather than a disposable automation instance.
References
- https://github.com/microsoft/playwright-cli
- https://github.com/vercel-labs/agent-browser
- https://github.com/browserbase/stagehand
- https://github.com/browser-use/browser-use
- https://github.com/skyvern-ai/skyvern
- https://www.browserbase.com/
- https://cloud.browser-use.com/
- https://cursor.com/cloud
- https://reportportal.io/
- https://testdino.com/
- https://github.com/testdino-hq/testdino-mcp
- https://currents.dev/
- https://qameta.io/
- https://testomat.io/
- https://leaderboard.steel.dev/
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool
- https://cyberscoop.com/claude-chrome-extension-allows-plugins-to-hijack-ai/
- My earlier post: Testing with Playwright and Claude Code
- Topics:
- ai (26) ·
- ai (34) ·
- agents (19) ·
- testing (3) ·
- playwright (1) ·
- developer-tools (15)