Alternative / VM sandbox vs accessibility tree

VM sandbox vs accessibility tree. Two axes, not alternatives.

The two phrases get compared in the same blog posts, threads, and roadmap docs, as if a team building a computer use agent has to pick one. They do not. A sandbox is an execution environment, an answer to where does the agent process run. The accessibility tree is a grounding primitive, an answer to how does the agent identify what to click. They sit at different layers of the same stack, and any honest deployment ends up picking one of each. This page is the long form of that argument, anchored in one open-source repo that ships both layers next to each other.

Matthew Diakonov, Written with AI

Published May 20, 20268 min

Direct answer (verified 2026-05-20)

They are different layers, not alternatives. VM sandbox is where the agent runs (your host machine, a Vagrant box, a Scrapybara or Browserbase remote desktop, an E2B sandbox). Accessibility tree is how the agent finds targets (structured roles and names from UIA on Windows or AXUIElement on macOS, versus pixel coordinates from a screenshot). You always pick both. Most public benchmarks compare Anthropic’s pixel-in-a-Docker reference to host-running AX frameworks like Terminator, and ascribe the differences to the sandbox axis when half the gap is on the grounding axis. The same Terminator binary runs on the host AND inside a Vagrant Windows 10 VM, gated by one env var (TERMINATOR_HEADLESS) read at virtual_display.rs line 165. The UIA walker is the same in both runs.

The two axes are independent

Pull them apart and the cells fall out cleanly. Execution environment along one axis, grounding primitive along the other. Each cell is a real shipped stack today, and the comparisons that matter run within a column, not across the whole matrix.

Four shipped stacks, two independent axes

Host + AX tree

Terminator default. Drives user's logged-in apps via UIA/AX. No virtualization.

Host + pixels

Local screenshot agents. PyAutoGUI, OpenCV templates, Anthropic CU in a window.

VM + AX tree

Terminator with TERMINATOR_HEADLESS=1 in a Vagrant Windows 10 VM. Same engine.

VM + pixels

Anthropic's reference Docker container. Scrapybara, Browserbase, E2B desktops.

Anthropic’s reference Claude Computer Use container is cell four (VM + pixels). Scrapybara, Browserbase’s computer use product, and E2B’s desktop sandboxes default to cell four as well. PyAutoGUI scripts and Anthropic’s desktop client running locally sit in cell two (host + pixels). Terminator’s default install sits in cell one (host + AX). Cell three is the configuration the rest of this page is about: Terminator running unchanged inside the bundled Vagrant Windows 10 VM with TERMINATOR_HEADLESS=1 set on the guest. Same Rust crate. Same MCP tool surface. Same selector grammar. Different cell.

Where the conflation came from

Anthropic’s anthropic-quickstarts/computer-use-demo repository ships a single Dockerfile that bundles four things: a virtual display server, a screenshot loop, the computer_20251124 tool that returns clicks at pixel coordinates, and a Linux guest with a small set of preinstalled apps. It works as one artifact, which is exactly why people compare it as one artifact. “Should we use the Anthropic VM sandbox or build on the accessibility tree?” reads as a clean binary, but it folds the sandbox-versus-host axis and the pixel-versus-tree axis into one question. The two changes happen to ship together in that reference; that is not the same as the two changes being the same change.

The unbundled comparison is the useful one. A team building a finance back-office bot that needs to drive an installed Windows desktop app in production should compare cell one to cell three: AX on the user’s machine versus AX inside a managed VM. A team prototyping a research agent that has to read pixel charts in Figma should compare cell two to cell four: pixels on a workstation versus pixels in a hosted sandbox. The cross-cell comparisons (one versus four, two versus three) are real product decisions, but the differences between those cells split fifty-fifty between the two axes, and pinning the differences on one axis is the bug.

Same engine, two cells: the proof in the repo

Open the repo and look at how thin the diff is between cell one (host + AX) and cell three (VM + AX). Two files do the work. The Vagrantfile provisions the guest. The headless detection function flips the session ID. The accessibility tree walker is unchanged.

Same Terminator binary, two execution environments

# Terminator on the user's host machine. # Same Rust binary, default mode. $ npx -y terminator-mcp-agent@latest [info] Using session ID: 1 [info] AX walker: Windows UIA (target_os = "windows") [info] Driving user's existing Chrome session, cookies intact. # Agent calls get_window_tree on Slack: # returns clean role+name where Electron exposes it, # OmniParser fallback where it doesn't. # Same MCP grammar as the VM run below.

no virtualization, drives user's existing app sessions
session ID 1 (real desktop)
UIA walker reads role + name from native apps directly

The Vagrantfile lives at vagrant/Vagrantfile. It pulls the stromweld/windows-10 box, gives it 8192 MB of RAM and 4 CPU cores, turns on nested paging and large pages for VT-x, sets monitor-count to 3, forwards port 22 for SSH and 8080 for the MCP server and 9222 for Chrome remote debugging, syncs the host workspace into C:/Users/vagrant/terminator, then runs a PowerShell provisioning script that installs OpenSSH, Scoop, Git, Python, NodeJS, Rust, and Chrome. That is everything an AX-grounded agent needs to operate, and none of it touches the accessibility-tree code path.

0 lines

“default TreeWalker does not traverse windows, so we need to traverse windows manually”

crates/terminator/src/platforms/tree_search.rs, line 1

Zero is the number of lines that change in the AX walker between the two cells. The TLDR above is the first comment in the platform shim that backs both runs. It describes a quirk of the Windows UIA API that the implementation has to work around, and that work-around code is identical on the host and inside the VM. The session-ID flip happens one layer up, in virtual_display.rs. The walker never knows which session it’s talking to.

The numbers that define the boundary

Four numbers from the repo, not benchmarks. Each one is a property of the boundary between the two axes.

0env var (TERMINATOR_HEADLESS) flips host mode to VM mode

0MB RAM in vagrant/Vagrantfile, Win10 box stromweld/windows-10

0guest port forwarded for Chrome remote debugging in the VM

0lines of AX-walker code change between host and VM runs

The headline number is zero. Zero lines of AX-walker code differ between the cell-one and cell-three runs. If the two layers were really one (if sandbox-versus-host and pixels-versus-AX were the same axis), that number could not be zero. It is zero because they are not the same axis.

When the sandbox axis actually matters

None of this is an argument against VM sandboxing. There are three deployments where sandboxing is the right call regardless of grounding. First, server-side automation that has no user desktop to share, where the host case does not even exist and you need a guest to operate. Second, security-bounded automation where you are running scripts or third-party MCP servers whose blast radius needs to be contained to one virtual machine. Third, reproducibility-critical automation where the same Windows build, same DPI, same locale, same default-app set has to apply every run, and you want the Vagrant guest’s known-clean state instead of whatever drift accumulates on a developer’s laptop over six months.

Outside those three, the host run is cheaper and more useful, because it operates inside the user’s existing sessions. The user has Slack logged in, the browser has the cookies, the IDE has the project open. A VM means re-authing all of that inside the guest. None of those tradeoffs are about the accessibility tree. The decision is on the sandbox axis. Picking AX-grounding or pixel-grounding inside whichever environment you chose is a separate decision, made for separate reasons (token cost, app type, surface complexity).

What to take to your roadmap doc

Replace any line in your roadmap that reads “VM sandbox vs accessibility tree” with two separate lines. The first asks where the agent process is going to live: on a user’s host machine, in a Vagrant or VirtualBox guest you provision yourself, or in a hosted sandbox like Scrapybara or Browserbase. Answer that one on isolation, sharing, and operational cost. The second asks how the agent identifies what to click: structured accessibility-tree nodes from UIA or AX, pixel detection from OCR or vision models, or both at once with the merging logic from tree_formatter.rs. Answer that one on token cost, app coverage, and latency. The two answers compose into a cell. Cells three and one are both legitimate Terminator deployments. Cell four is the Anthropic reference. Cell two is a PyAutoGUI script. They are all real, and they are all the answer to two questions, not one.

Designing a computer use stack and stuck between host and VM, AX and pixels?

30 minutes. Bring your two axes, leave with a concrete cell choice for each phase of your rollout.

Questions about VM sandbox versus accessibility tree

Are VM sandbox and accessibility tree two ways to do computer use?

No. They sit at different layers. A VM sandbox is an execution environment: where the agent process runs and which OS it sees. The accessibility tree is a grounding primitive: how the agent identifies what to click. You always have to pick both. Most articles treat them as a binary because Anthropic's reference Claude Computer Use container ships a Docker image with pixel-only grounding hardcoded inside; people see that one stack and assume sandbox means pixels. It does not. AX-based agents work unchanged inside any VM that exposes its own desktop session.

Where does the conflation come from?

Anthropic's anthropic-quickstarts/computer-use-demo on GitHub ships a Dockerfile with a custom virtual display, a custom screenshot loop, and the computer_20251124 tool calling pixel coordinates. The whole stack is one artifact. When developers compare 'VM sandbox computer use' to a Playwright-style framework, they are really comparing two stacks that happen to differ on both axes at once: pixel-Docker versus AX-on-host. Then they ascribe the differences to one axis. The honest comparison fixes one axis at a time.

Can Terminator run inside a VM today?

Yes. The repository ships vagrant/Vagrantfile out of the box. It provisions a stromweld/windows-10 box with 8192 MB RAM, 4 CPU cores, monitor-count 3, RDP enabled, OpenSSH installed, Scoop, Git, Python, NodeJS, Rust, and Chrome installed during provisioning. Port 8080 is forwarded for the MCP server and port 9222 for Chrome remote debugging. Set TERMINATOR_HEADLESS=1 inside the VM and the same terminator-mcp-agent npm package runs with a virtual session ID instead of the desktop session ID. The Windows UIA adapter is identical in both runs.

What is TERMINATOR_HEADLESS actually doing?

It is a string check inside is_headless_environment() at crates/terminator/src/platforms/windows/virtual_display.rs line 165. If the env var equals 'true' or '1', VirtualDisplayManager::initialize() sets session_id to 0 (a virtual session) and calls create_virtual_session(). Otherwise it sets session_id to 1 (the desktop session) and skips virtual setup. The AX walker that reads UIA roles, names, and bounds is downstream of both branches. The clustering pass, the MCP tool surface, and the click resolver do not know which branch was taken.

When should you actually pick a VM sandbox?

Three cases worth the cost. First, when the agent is going to run on a server you control and there is no user desktop to drive (overnight batch automation, CI tests against installed-software flows, anything that would otherwise need a 'dedicated Windows laptop'). Second, when isolation is a security requirement (testing a flow against installers you do not trust, untrusted scripts, third-party MCP servers whose blast radius you want bounded to a single VM). Third, when reproducibility matters more than fidelity (you want the same Windows build, same DPI, same locale, same default-app set every run). Outside those cases, running on the host is cheaper, faster, and uses the user's already-logged-in browser sessions.

When should you keep the agent on the host?

When the goal is to act for the user inside their own apps. The user has Slack open with their team, Notion with their workspace, Cursor with their project, Chrome with twelve tabs and forty cookies. Spinning up a VM means losing all of that and re-authing inside the guest. The host run preserves the user's logged-in state, their installed extensions, their existing windows, and their actual screen. Terminator was built around this case, which is why the README leads with 'Uses your browser session, no need to relogin' and 'Doesn't take over your cursor or keyboard, runs in the background.' The accessibility-tree primitive is what makes 'in the background' possible; the host stack is what makes 'your browser session' possible.

Why do pixel-only stacks usually ship inside a VM?

Two reasons. First, pixel agents capture full-screen screenshots and inject mouse and keyboard events at OS level. Running that on the user's actual machine is hostile: every click steals focus, every screenshot includes the user's private windows, the user cannot use the machine while the agent runs. A VM gives the agent its own pixel buffer. Second, pixel grounding is portable across operating systems in a way that AX is not, so vendors ship one Linux Docker and call it cross-platform. The VM is not solving a computer-use problem there. It is solving an isolation and packaging problem. Pick a sandbox for the right reason.

Can an accessibility-tree agent in a VM still see the host's apps?

No. By design. The Windows UIA APIs walk the session that the calling process lives in. A Terminator instance running inside a Vagrant guest sees the guest's Notepad, the guest's Chrome, the guest's Office. It does not see the host's apps. If you need the agent to drive user-installed software, run it on the host; if you need a clean slate every run, run it in the guest and install software during provisioning. The Vagrantfile in the repo does exactly that: Chrome, Rust, Python, NodeJS, and Git all get installed during 'vagrant up' so the guest is a fresh environment ready for AX-driven automation.

What about cloud sandboxes like Scrapybara, Browserbase, or E2B?

Those are hosted sandboxes (a VM somewhere with a remote-control protocol). The same orthogonality holds: you can run Terminator inside one of them by installing the npm package and setting TERMINATOR_HEADLESS=1, or you can use their built-in pixel-based tool. The choice between AX grounding and pixel grounding is independent of whether the sandbox is yours or hosted. Today these vendors ship pixel tools by default because their reference clients are pixel-only, not because their VMs cannot host an AX agent.

Where is the headless detection function in the repo?

crates/terminator/src/platforms/windows/virtual_display.rs, function is_headless_environment, around line 165. The struct VirtualDisplayConfig (line 7) carries the resolution and refresh rate. VirtualDisplayManager::initialize (line 44) reads the env var, picks the session ID, and calls create_virtual_session. The default HeadlessConfig at line 187 wires use_virtual_display directly to the env-var check, which is why one variable is enough to switch the whole stack. None of those functions live inside the UIA adapter; they sit one layer above it.

Does this all work the same on macOS?

The orthogonality argument does. The implementation is Windows-first today. The published npx terminator-mcp-agent binary targets Windows, and the Vagrantfile provisions Windows 10. macOS support is in the Rust core (the AX walker lives at crates/terminator/src/platforms/tree_search.rs) and on app.mediar.ai for users, but the headless-VM workflow in this article runs on the Windows side. On macOS the sandbox question is usually less urgent because the user is in front of a Mac running the agent on their host; on Windows the headless server case is common, which is why the VM tooling shipped there first.

More on the grounding axis, where most of the writing already lives

Adjacent reads

Alternative

Accessibility tree vs pixel for computer use: the framing is wrong

Tree vs pixel is not a per-agent choice. It is a per-region choice. Here is the union-find clustering pass that lets the model pick a source per click.

Read

Alternative