Accessibility tree desktop agents: the browser is already a UIA element, that is what closes the gap

The most common framing of accessibility-tree agents treats the tree as a substitute for the DOM, useful when the workflow leaves the browser. That undersells it. The tree is a superset. Every running browser is itself a node in the OS accessibility tree, and the page inside the tab is exposed as accessibility children of a role:Document element under the browser process. One selector grammar reaches both worlds. The browser-to-native gap is a tooling gap, not a semantic one.

UIAutomationAXUIElementrole:DocumentMCPdevicePixelRatio

Matthew Diakonov, Written with AI

Published May 6, 20268 min read

Direct answer (verified 2026-05-06)

Desktop agents use the accessibility tree by walking the OS surface a screen reader walks: UIAutomation on Windows, AXUIElement on macOS, AT-SPI2 on Linux. They address elements by role + name selectors (role:Button name:Save) instead of pixel coordinates, and they call platform actions (UIInvokePattern on Windows, AXPress on macOS) instead of synthesizing mouse input. Because every running browser is itself a UIA / AX subtree, the same grammar also reaches the page inside an active tab.

Reference implementation is open source. Browser process list at applications.rs:53, viewport reconciliation at server.rs:838, cached subtree fetch at tree_builder.rs:388.

The thesis: the OS already merged the two worlds

A blind user opens a Chrome tab and a screen reader announces the page heading, the buttons, the form fields. That same user triggers a download and the screen reader announces the OS Save dialog without missing a beat. There is no "browser adapter" and "OS adapter" on the assistive-tech side. There is one tree. Microsoft has spent two decades making sure browsers expose their content into UIA so that screen readers work; the side effect is that any tool that reads UIA already has structural reach into every page rendered by every shipping browser. The same is true of AXUIElement on macOS.

What is "new" is treating that tree as the default input for an agent rather than for a screen reader. The screen-reader use case reads; the agent use case reads and writes. The grammar is the same: address an element by role and name, then invoke an action that the platform exposes. The implementation is the same: walk a tree of UIA / AX elements, match a selector, dispatch through UIInvokePattern.invoke (Windows) or AXUIElementPerformAction (macOS).

What an agent actually does, end to end

Three steps, repeated for every action. The shape is the same on a Notepad Save dialog as it is on a button inside a Gmail composer.

how a tree-driven action resolves

Read
get_window_tree returns the cached UIA / AX subtree for the focused process as JSON: role, name, bounds, focus state for every element
Match
Selector grammar resolves role:Button && name:Save against the tree; same grammar works in a native dialog or inside a browser tab
3
Act
click_element invokes the platform action (UIInvokePattern on Windows, AXPress on macOS); no synthetic mouse input, no coordinate guessing

The selector grammar does not change at the boundary

Watch a single agent task cross from inside a browser tab into the OS Save dialog. The tool calls are identical. Only the element that resolves on the other side is different.

One selector grammar, page side and OS side

The agent never branches on "am I in the browser or out of it." That decision is below the API. Internally, Terminator detects browser processes by name (chrome, msedge, firefox, brave, arc, vivaldi, opera, iexplore, at applications.rs:53) and routes those PIDs through a slightly different lookup path for performance, but the selector grammar exposed to the agent stays the same.

chrome.exerole:Document

msedge.exerole:Document

firefox.exerole:Document

brave.exerole:Document

arc.exerole:Document

vivaldi.exerole:Document

opera.exerole:Document

iexplore.exerole:Document

Every browser process Terminator recognizes as a tab host. The page inside lives under role:Document in the OS tree.

What this looks like when the workflow leaves the page

This is the case that breaks browser-only agents and works cleanly for tree-driven ones. The user clicks something, a download starts, the OS Save dialog opens, the agent picks a location, the file lands on disk, the next app opens it. None of those steps require a different tool surface.

What happens when the workflow leaves the page

The agent reaches the Subscribe button via document.querySelector. It clicks. The site triggers a download. Chrome opens the OS Save dialog. document.querySelector returns nothing for it. The agent's tool surface goes silent. The next step in the user's workflow lives in a window the agent cannot see. The user has to step in, click Save manually, and hand control back. Every flow that touches a download, a native authenticator, an Open With handler, or an OS notification breaks at this seam.

tool surface ends at the DOM boundary
downloads, dialogs, authenticators all invisible
user has to step in for any non-page interaction
can't reuse the agent for native-only workflows

The reconciliation step nobody mentions: viewport offset times devicePixelRatio

This is the anchor fact most articles on this topic skip past. DOM coordinates are CSS pixels relative to the viewport. UIA coordinates are physical pixels relative to the screen. On a Retina or 4K display these are not even close to each other. Without an explicit reconciliation, a click computed from a DOM rect lands hundreds of pixels off where it should.

Terminator's capture_browser_dom_elements at server.rs:838 does it in two steps. First it locates the UIA element with role:Document for the active tab and reads its bounds; the (x, y) of that element is the viewport offset in screen space. Second, for every DOM element it serializes, it multiplies the getBoundingClientRect values by window.devicePixelRatio (lines 911 to 914 of the same file). After both scaling steps, DOM-derived coordinates and UIA-derived coordinates are in the same frame. A click works the same regardless of which side produced the rect.

The Document-element lookup matters more than it looks. You cannot derive the viewport offset from JavaScript window properties because window.screenX and window.screenY silently drift on multi-monitor setups with non-uniform DPI scaling. The OS tree gives you the truth: the UIA Document element's screen bounds. Anchor there, scale by DPR, forget the rest.

1 selector grammar

“The accessibility tree is not a fallback for the DOM. It is the surface that contains it.”

Source: Terminator core, applications.rs and server.rs

Where the tree wins, and where it does not

The honest list. Worth memorizing before you build an agent on top of this stack, because the failures are predictable and you want to know which third tool to plug in for them.

Tree fit, surface by surface

Native Win32, WinUI, UWP apps: tree is the right default; cached subtree fetch returns in tens of ms
Browser tabs (Chrome, Edge, Firefox, Brave, Arc, Vivaldi, Opera): tree reaches the page through role:Document under the browser process
Electron and Office apps: tree covers most controls; an Office canvas needs the in-page enrichment for cells and shapes
OS dialogs (Save, Open, Print, credentials, OAuth consent): tree wins outright; this is exactly what assistive tech is built for
macOS AppKit, SwiftUI: AXUIElement returns the same role / name shape on the same selector grammar
Fullscreen DirectX, OpenGL, Metal surfaces: tree is empty by design; vision is the only path
Canvas drawing tools (Figma surface, Excalidraw, Miro board): one opaque canvas in the tree, vision fallback required
Sandboxed RDP and VM viewers: AX tree does not cross the host boundary; vision fallback required

The pattern is simple. Anything that the OS itself describes structurally to a screen reader, the tree covers. Anything that paints pixels directly past the OS compositor (a game frame, a VM viewer, a custom-drawn legacy control) requires vision. A production agent stack uses the tree as the default and falls through to OCR, Omniparser, Gemini, or coordinates as a second-class path when the tree returns nothing useful. Reaching for vision first is the wasteful loop, not the other way around.

Why the cached subtree fetch matters more than people think

The default UIA traversal walks one node at a time, with one cross-process IPC call per property per node. On a populated window that adds up to thousands of round trips, and the tree takes seconds to return. That latency is most of why people stop using UIA and reach for screenshots; the tree is fine, the naive traversal is the problem.

The fix lives at tree_builder.rs:388. Build a UIA CacheRequest with the seven properties an agent actually needs (ControlType, Name, BoundingRectangle, IsEnabled, IsKeyboardFocusable, HasKeyboardFocus, AutomationId), set tree scope to TreeScope::Subtree, and fetch the whole subtree in one IPC call. After that, building the tree structure is a pure-Rust walk over cached properties, no further cross-process calls. Same tree, two orders of magnitude faster on windows of typical complexity. That is the difference between "agent can do this" and "agent in theory could but no one would actually use it."

Building an agent that needs to cross the browser-to-native boundary?

If you are wiring a tree-driven agent and want a second pair of eyes on the selector grammar, the viewport reconciliation, or the cached subtree path, send a calendar invite.

Frequently asked questions

What is the accessibility tree, and why do desktop agents read it instead of pixels?

Every modern OS exposes a structured tree of every visible UI element so that screen readers and assistive tech can describe a window without reading pixels. On Windows that surface is UI Automation (UIA); on macOS it is AXUIElement; on Linux it is AT-SPI2. Each node carries a role (Button, Edit, MenuItem, Document), a name (Save, Subject, Cancel), enabled and focused state, and a stable bounding rectangle. A desktop agent that reads this tree gets the same description the OS gives a screen reader, in milliseconds, with no inference required. A pixel-driven agent has to reconstruct that same information from a screenshot every step, with a vision model, at thousands of visual tokens per window. The structured path is faster, cheaper, and survives DPI changes and theme changes that break pixel matching.

How does this close the gap between browser-only agents and native-app agents?

Because the browser itself is a node in the OS accessibility tree. On Windows a Chrome window appears as a UIA element with role:Document underneath the chrome.exe process; on macOS the equivalent AX role exists on the WebKit and Blink content area. The web page inside that document is already exposed as accessibility children that mirror the DOM's accessible nodes. So an agent that speaks role:Button name:Save against the OS tree can walk a native dialog, and an agent that speaks role:Button name:Subscribe against the OS tree can walk the page inside the active tab. There is no second tool, no second selector grammar, no boundary the agent has to be aware of. The tree spans both worlds because the OS already has to span both worlds for assistive tech.

If the OS tree already covers the page, why does Terminator also ship a browser extension?

Two reasons that come up in real workflows. First, the OS accessibility view of a page is structurally accurate but coarser than the DOM. It does not include CSS attributes, computed styles, the full hierarchy of generic divs, or the pseudo-state of hover and focus rings. Second, in-page JavaScript execution (filling a controlled React input, dispatching a synthetic event, reading shadow DOM) needs a real JS context. The Manifest V3 extension at crates/terminator/browser-extension/manifest.json (version 0.24.32) holds a WebSocket on 127.0.0.1:17373 and gives the agent a way to call execute_browser_script when the structural tree is not enough. The extension is an enrichment over the AX path, not a replacement for it. The agent decides which one to use per call, and the click target lands in the same screen-space coordinate either way.

What is the actual coordinate problem when both adapters are live?

DOM coordinates and OS coordinates are in different spaces. getBoundingClientRect returns CSS pixels relative to the viewport. UIAutomation returns physical pixels relative to the screen. On a Retina or 4K display window.devicePixelRatio is 2.0 or higher, so a CSS pixel rect at (320, 200) is at (640, 400) in the OS frame, plus the screen offset of the browser viewport. Terminator's capture_browser_dom_elements at crates/terminator-mcp-agent/src/server.rs:838 reconciles them in two steps. It first locates the UIA element with role:Document for the active tab and reads its bounds(); the (x, y) of that element is the viewport offset in screen space. Then for every DOM element it serializes, it multiplies the CSS pixel rect by window.devicePixelRatio (server.rs:911-914). After those two scaling steps a click computed from a DOM rect lands in the same place as a click computed from a UIA element. Without that reconciliation, the agent drifts on every laptop with non-100% scaling.

Does this work on macOS and Linux, not just Windows?

Yes for the tree itself. AXUIElement on macOS exposes the same role / name / bounds shape, with values like AXButton, AXTextField, AXWindow that map to the same selector grammar Terminator uses on Windows. AT-SPI2 on Linux is consistent for GTK and Qt apps; coverage of Electron and Chromium on Linux varies by distribution. The Manifest V3 browser extension that fills in the in-page enrichment is platform-agnostic because Chrome is the same on all three, but the macOS port of the MCP server has a smaller surface than the Windows build at this point. If you are building a native-only agent on macOS or Linux, the AX path is solid; the cross-OS browser extension path on macOS still has rough edges worth pinging us about before you commit.

Where does the accessibility tree fail, and what should an agent do about it?

Three concrete failure surfaces. First, fullscreen DirectX, OpenGL, or Metal surfaces (games, some IDE editors, the Figma drawing canvas) have an empty or single-node AX subtree because the app paints pixels directly. Second, sandboxed RDP and VM viewers do not bridge the AX tree across the host boundary; you see one opaque element where the remote screen is. Third, custom-drawn controls in legacy line-of-business apps that never implemented a UIA provider show up as Pane or generic Custom elements with no useful name. The right architecture is to keep the tree as the default path because it is fast and tiny-model-friendly, and stack a vision-grounded fallback for those three cases. Terminator's click_element MCP tool exposes seven grounding modes for exactly this reason; the AX selector path is mode one, the OCR / Omniparser / Gemini / DOM / coordinate paths are the fallbacks.

How fast is reading the tree compared to taking a screenshot?

On a moderately complex window (a Notepad++ session, an Outlook compose pane, a populated browser tab) the cached UIA subtree fetch in tree_builder.rs:388 with TreeScope::Subtree comes back in tens of milliseconds because the cache pre-fetches every property in one IPC call (ControlType, Name, BoundingRectangle, IsEnabled, IsKeyboardFocusable, HasKeyboardFocus, AutomationId at lines 404-412). Walking the same window node-by-node with separate IPC calls is one to two orders of magnitude slower; this is why the cached path exists. A screenshot of the same window costs roughly the same wall time on the capture side, but the perception cost on the model side is many thousands of visual tokens to read versus a few thousand text tokens for the tree. The tree wins on every dimension except the three failure surfaces above.

Will agents stop needing screenshots entirely if the tree is this good?

No. Vision still wins for fullscreen rendered surfaces, for sanity-checking that the agent is on the right screen at all, and for any task where the visual layout itself is the data (a slide layout, a chart, a design canvas). The right framing is that the accessibility tree is the default tool the agent reaches for, screenshots are the escalation when the tree is silent, and a production agent stack stitches both. The thing this page is arguing against is the symmetric mistake: building a screenshot-only agent for the desktop. That stack burns visual tokens to read text the OS already gave you for free, and it inherits every brittleness of pixel matching across DPI, theme, language, and resolution.

I am building a Claude Code or Cursor MCP agent. What is the smallest path to give it OS-level reach?

Install the MCP agent: claude mcp add terminator npx -y terminator-mcp-agent@latest. That registers the tool surface (get_window_tree, click_element, type_into_element, press_key, capture_screenshot, execute_browser_script) with your assistant. The defaults connect to the OS UIA / AX tree out of the box; the browser extension is opt-in. From inside Claude Code or Cursor, ask the agent to run get_window_tree on the focused window and inspect the JSON it returns; that is the same input shape any model is going to see. Once you can read the tree, the click and type tools accept role+name selectors directly. Repo at github.com/mediar-ai/terminator, source for everything in this article is in crates/terminator-mcp-agent/src/server.rs.

Companion deep dives on the tree, the boundary, and the local-model angle

Related guides

Browser bridge

Browser agents leaving the DOM: where the tool surface ends, and what to do about it

Companion piece focused on the implementation: the Manifest V3 extension on ws://127.0.0.1:17373, the coordinate reconciliation, and how the same MCP tool dispatches across both adapters.

Read

Local AI

Local AI for native apps: a 1B model drives the desktop because the accessibility tree is text, not pixels

Why structured AX-tree text beats screenshots for tiny local models. gemma3:1b on Ollama, four files of Rust, the filter step that makes it work.

Read

Tree vs pixels

Accessibility tree automation vs PyAutoGUI: the two clicks are not the same operation

Tree-driven click invokes UIInvokePattern through COM. PyAutoGUI drives SendInput. Different syscalls, different brittleness profiles, side by side in the source.

Read