Computer use on native apps: what the accessibility tree actually looks like to the agent

If you landed here from a thread about computer use agents, here is the part the screenshots-versus-tree debate skips. A computer use agent never touches the raw operating system tree. It reads a flattened, numbered text list. Every element it can click has a number, and the agent acts by emitting the number. No pixel coordinates. This page shows you exactly what that list looks like.

UIAutomationAXUIElementget_window_treeMCPRust

Matthew Diakonov, Written with AI

Published May 15, 20267 min read

Direct answer (verified 2026-05-15)

A computer use agent controls a native app by reading the operating system's accessibility tree, not its pixels. Windows exposes that tree through UIAutomation, macOS through AXUIElement, Linux through AT-SPI2. The agent does not see the raw tree. A framework like Terminator serializes it into a compact numbered list where each clickable element has a 1-based index. The agent picks an index or a selector, and the framework resolves it to a real click, so the model never emits a screen coordinate.

Verifiable in the source: the format_node function in crates/terminator-mcp-agent/src/tree_formatter.rs writes one line per element and maps every index back to a real UI element.

What the agent actually receives

Most write-ups describe the accessibility tree in the abstract: structured metadata, element types, labels, parent-child links. All true, and none of it tells you what arrives in the model's context. Here is the concrete answer. When Terminator's get_window_tree tool runs on a focused native app, the response is this:

# get_window_tree on the focused Calculator process
# returns this, not the raw UIAutomation objects:

#1 [Window] Calculator (bounds: [760,300,322,500], focused)
  #2 [Group] Number pad (bounds: [760,540,322,260])
    #3 [Button] Seven (bounds: [760,540,107,65], focusable)
    #4 [Button] Eight (bounds: [867,540,107,65], focusable)
    #5 [Button] Nine  (bounds: [974,540,107,65], focusable)
  #6 [Text] Display is 0 (bounds: [760,460,322,80])
  #7 [Button] Equals (bounds: [974,735,107,65], focusable)
  - [Text] Calculator mode: Standard

That is the whole Calculator window. Indentation is tree depth. Lines that start with # are clickable: they have a bounding rectangle and a stable index. Lines that start with a dash are structure, containers and labels the agent reads for context but should not try to click. A model does not need a screenshot to know there is a Seven button at index 3. It is right there as text.

From a 40-property tree to a one-line element

The operating system tree is built for screen readers, so it is exhaustive. The serializer's job is to throw away everything a model cannot act on and keep the four things it can: where the element is, what kind it is, what it is called, and what state it is in. Watch the transformation.

get_window_tree, step by step

01 / 04

AXUIElement / IUIAutomationElement
├─ role, name, value, description
├─ BoundingRectangle, RuntimeId
├─ IsEnabled, IsKeyboardFocusable, ...
└─ 40+ properties per node, deeply nested

1. The OS tree

The native accessibility API holds a deep tree. Every element carries dozens of properties: control type, runtime id, bounding rectangle, pattern support, automation id. Useful for a screen reader, far too verbose for a model's context window.

The index is the whole trick

This is the part that does not show up in the usual explanations, and it is the reason an accessibility-tree agent is steadier than a screenshot agent. A vision-only computer use agent has to output a coordinate. To click Save it recognizes the button in the pixels, estimates its center, and emits something like (1218, 22). That number is wrong the moment the window moves, the DPI scale changes, or the theme shifts.

Terminator removes coordinates from the model's job entirely. When format_node assigns an element a #index, it also stores that index in a map called index_to_bounds, keyed by the number, valued by the element's role, name, bounds, and selector. The model reads the list, picks #7, and calls click_element in index mode. The framework looks 7 up, recovers the real element, and drives the click through the platform's native invoke path. The model produced an integer. It never needed to know a coordinate exists.

/// Format a UI tree as compact YAML with #index [ROLE] name format
///
/// Output format:
///   #1 [ROLE] name (bounds: [x,y,w,h], additional context)
///     #2 [ROLE] name (bounds: [x,y,w,h])
///       - ...
///
/// Elements with bounds get a clickable index first.
/// Elements without bounds use a dash prefix.

That doc comment sits directly above format_tree_as_compact_yaml in the source. The index is not a presentation detail. It is the contract between the model and the framework. The deeper version of this idea, where the same indexed list also covers OCR and vision sources, is in the seven-mode click_element router guide.

Reading one node

Every indexed line follows the same grammar. Once you can read one, you can read the whole window.

#6 [Button] Save (bounds: [1180,8,80,28], focusable, disabled)

  #6          1-based clickable index. This is what the agent emits.
  [Button]    role. Button, Edit, MenuItem, CheckBox, ComboBox, ...
  Save        name. The accessible label the OS already exposes.
  bounds      [x, y, width, height] in screen pixels. Present = clickable.
  focusable   state flags: focusable, focused, disabled, selected, toggled.

The presence of a bounds rectangle is what decides whether an element gets an index at all. An element with bounds is on screen and clickable, so it earns a #index. An element without bounds gets a dash. State flags such as disabled, focused, selected and toggled are appended only when they are true, which keeps each line short. A model can answer "is the checkbox already on?" without a screenshot because the word toggled either is or is not on the line.

The same window, two ways a model can see it

Raw accessibility element vs the serialized line

The operating system hands you an IUIAutomationElement (or an AXUIElement on macOS). One node carries a control type id, a runtime id, a bounding rectangle struct, a process id, an automation id, a class name, localized control type strings, and a list of supported patterns: Invoke, Toggle, Value, SelectionItem, ExpandCollapse. Multiply that by a few hundred nodes per window. Sending it to a model is impossible and pointless: most of those fields are plumbing the model cannot use.

40-plus properties per node, COM or Objective-C objects
deeply nested, hundreds of nodes per real window
far too large for a model context window
most fields are plumbing, not decision-relevant

Why this matters for native apps specifically

Browser automation has the DOM. It is structured, queryable, and stable, which is why Playwright works as well as it does. Step outside the browser and that structure disappears. A native app has no DOM. What it has is the accessibility tree, the one structured description of its UI the operating system already maintains for screen readers. A computer use agent that reads the tree gets a DOM-like target for Notepad, Outlook, the Settings app, and a legacy line-of-business tool, all through one interface.

And because the browser itself publishes an accessibility tree, the same agent code reaches into a web page when it needs to. The tree does not stop at the browser window the way the DOM does. Terminator is shaped like Playwright on purpose, but the surface it targets is the whole operating system rather than one tab.

Native surfaces the tree covers

NotepadCalculatorFile ExplorerOutlookExcelWordSettingsTask ManagerTextEditFinderSystem SettingsChromeVS CodeSlackSpotify

Where the tree is honest about its limits

The accessibility tree is the right primary input for native apps, but it is not complete. Some surfaces render their content as pixels with no accessible structure underneath. An agent built only on the tree will hit those surfaces and stop. A production agent keeps the tree as the default and falls through to a vision source when the tree returns nothing useful.

Tree coverage by surface

Native Win32, WinUI, UWP, AppKit and Office apps: rich tree, indexed elements line up with what you see.
A web page inside a browser: the browser publishes its own accessibility tree, so the same code path works.
Electron apps: the tree exists but can be thin until Chromium accessibility is enabled in the process.
Office document canvas: cells, shapes and paragraphs render inside one element with no AX children.
Games, DirectX or OpenGL surfaces, canvas design tools: the tree is a single opaque node, fall back to vision.
Remote desktop and VM viewer windows: the accessibility bridge does not cross the host boundary.

The honest framing is not "tree or vision". It is tree first because it is fast, structured, and cheap on tokens, then vision second for the surfaces above where the tree is a single opaque node. Terminator wires both behind the same click_element tool so the agent does not have to switch interfaces when it crosses that boundary.

Building a computer use agent for native apps?

If you are wiring an accessibility-tree pipeline into an agent and want to compare notes on serialization, the index contract, or vision fallbacks, grab a slot.

Frequently asked questions

What does the accessibility tree of a native app look like to a computer use agent?

Not the way the operating system stores it. The OS keeps a tree of accessibility elements with dozens of COM or Objective-C properties each. A computer use agent reads a flattened, compact version. Terminator's format_node function in crates/terminator/src/tree_formatter.rs walks the tree and writes one line per element: an indent for depth, a clickable index, the role in brackets, the accessible name, and a parenthesised context list of bounds and state flags. A whole Calculator window collapses to roughly a dozen short lines of text. That is what reaches the model, not the raw UIAutomation or AXUIElement objects.

Why does the agent emit an index instead of pixel coordinates?

Because pixels are a moving target and an index is not. The same Save button lands at different screen coordinates depending on DPI scale, window position, theme, and resolution. A model that outputs (x, y) has to re-solve that regression on every screen. Terminator's serializer assigns every element that has bounds a stable 1-based index and stores index to (role, name, bounds, selector) in a HashMap called index_to_bounds. The agent reads the list, emits the number, and the click_element tool in Mode 2 (index plus vision_type ui_tree) looks the number up and performs the click. The model never has to know a coordinate exists.

Is the tree the same on Windows and macOS?

The source APIs differ. Windows exposes UIAutomation (IUIAutomationElement), macOS exposes the Accessibility API (AXUIElement), Linux exposes AT-SPI2. Terminator has a separate adapter per platform that walks the native tree. Above those adapters the shape is normalized: every node becomes a role, a name, optional bounds, optional state flags, and a selector. So the compact list a computer use agent receives looks the same whether the native app is Notepad on Windows or TextEdit on macOS. The platform-specific quirks stay inside the adapter.

How is the accessibility tree different from the browser DOM?

The DOM only describes one web page inside one browser tab. The accessibility tree describes every window of every running application: native Win32 and AppKit apps, Office, Electron shells, and the browser itself. A browser also publishes an accessibility tree, built from the DOM plus ARIA, so a computer use agent that targets the OS tree can drive a web page and a native settings dialog with the same code path. That is the point of using the tree rather than the DOM: it does not stop at the edge of the browser window.

What happens to an element that has no bounds?

It still appears in the list, but it gets a dash prefix instead of an index. In format_node, an element with a bounds rectangle is written as #{index} [Role] name, and an element without bounds is written as - [Role] name. The dash elements are containers, layout groups, and offscreen nodes the agent should not try to click. They stay in the tree because they carry structure and labels the model uses to reason about what is on screen, but only the indexed elements are valid click targets.

Which tool actually performs the click after the model picks an index?

click_element. It has three modes documented in crates/terminator-mcp-agent/src/server.rs. Mode 1 is Selector: pass a process and a selector string like role:Button|name:Save. Mode 2 is Index: pass the number from a previous get_window_tree response plus vision_type ui_tree. Mode 3 is Coordinates: raw x and y, the escape hatch. For accessibility-tree-driven computer use, Mode 2 is the normal path. The index resolves through index_to_bounds back to the real element and the framework drives the click through the platform's invoke pattern.

Do Electron apps show up in the accessibility tree?

Partly, and it depends on the app. Electron wraps a Chromium renderer, and Chromium only publishes its internal accessibility tree when accessibility is enabled in the process. Some Electron apps expose a rich tree, some expose only the outer window frame until you trigger the accessibility flag. When the tree is thin, a computer use agent needs a fallback grounding source. Terminator handles this by letting click_element resolve indexes against OCR, Omniparser, Gemini vision, or the browser DOM in addition to the UIA tree, so a missing subtree does not stop the agent.

Can a small or local model handle the accessibility tree?

Yes, and that is one of the real advantages of the tree over screenshots. A focused window serializes to a few thousand tokens of structured text, well inside the context of a 1B-parameter model, whereas the same window as a screenshot costs many thousands of visual tokens that small models cannot reason over. Terminator ships a four-file example that drives native apps with gemma3:1b over a local Ollama endpoint. The walkthrough is in our guide on running a local model on native apps through the accessibility tree.

More on the accessibility tree as a computer use surface

Related guides

Grounding

Accessibility API for computer use agents: the seven-mode click_element router

When the accessibility tree is silent, the agent falls through to OCR, Omniparser, Gemini vision, or the browser DOM. One click tool, seven grounding paths.

Read

Local AI

Local AI on native apps: a 1B model drives the desktop because the tree is text

The compact tree is small enough for gemma3:1b. A four-file Rust example wires the accessibility tree into a local Ollama endpoint with zero network egress.

Read

Tree vs pixels

Accessibility tree automation vs PyAutoGUI: the two clicks are not the same operation

A tree-based click invokes the element directly and bypasses synthetic HID input. Why that matters for reliability, line by line in the source.

Read

Computer use on native apps: what the accessibility tree actually looks like to the agent

What the agent actually receives

From a 40-property tree to a one-line element

get_window_tree, step by step

1. The OS tree

The index is the whole trick

Reading one node

The same window, two ways a model can see it

Raw accessibility element vs the serialized line

Why this matters for native apps specifically

Where the tree is honest about its limits

Building a computer use agent for native apps?

Frequently asked questions

Related guides

Accessibility API for computer use agents: the seven-mode click_element router

Local AI on native apps: a 1B model drives the desktop because the tree is text

Accessibility tree automation vs PyAutoGUI: the two clicks are not the same operation

Comments (••)

Comments ()