Desktop automation and the accessibility tree: what one node costs to capture

Almost every explainer about the accessibility tree stops at the same place: it is a hierarchy of UI elements, each with a role and a name, and it beats screenshots. True, and not the interesting part. The interesting part is that the tree is not a structure you read. It is a structure you build, one cross-process call at a time, and a desktop automation framework lives or dies on the cuts it makes while building it. This page walks through what a single node carries, and which of its fields Terminator deliberately refuses to load by default.

Matthew Diakonov, Written with AI

Published May 15, 202611 min read

Direct answer (verified 2026-05-15)

The accessibility tree is the hierarchy of UI elements the operating system already maintains so assistive technology can describe an app. Every window, panel, button, text field, and list item is a node, and each node carries semantic data: a role, an accessible name, a value, state flags, and a bounding rectangle. Desktop automation frameworks walk this tree to find and act on elements instead of matching pixels in a screenshot. Windows exposes it through UI Automation, macOS through the Accessibility API, Linux through AT-SPI2. Terminator captures it through these native APIs and matches elements with a selector grammar shaped like Playwright's.

The tree is real, and so is its cost

Here is the mental model most people carry away from a tutorial: the accessibility tree exists, somewhere, fully formed, and a library hands it to you the way a browser hands you the DOM. That picture is wrong in a way that matters.

On Windows, the tree lives inside each running application's UI Automation provider. Your automation process does not share memory with it. When you ask for a node's name, that is a COM call that crosses a process boundary, gets serviced by the target app, and returns one string. Ask for its role, its value, its bounding rectangle, whether it is enabled: each one is a separate round trip. A single node with six interesting properties is six cross-process calls.

Now multiply. A bare Notepad window has a couple hundred nodes. A Chrome window with a real web app inside it has many thousands. A framework that captures every property on every node is making tens of thousands of cross-process calls for one tree dump, and the user watches their machine hitch while it happens. The naive implementation is not slightly slow. It is unusable inside an agent loop that re-reads the tree after every action.

So the real engineering question for desktop automation is not "how do I get the accessibility tree." It is "how little of it can I get away with capturing, and how do I capture even that without freezing the host." The rest of this page is Terminator's answer, read straight out of the source.

What one node actually carries

Terminator's representation of a node is the UIElementAttributes struct in crates/terminator/src/element.rs (lines 309 to 343). It defines roughly 17 fields. Below is the whole set. The orange fields are the ones the default Fast mode loads for every node. The amber fields are loaded conditionally (state flags, structural counts, the focusable-gated bounds). The plain fields are only filled when you explicitly ask for Complete or Smart mode.

roleString

Button, Edit, CheckBox, ...

nameOption<String>

the accessible label

labelOption<String>

associated label text

textOption<String>

rendered text content

valueOption<String>

edit / value content

descriptionOption<String>

help / tooltip text

application_nameOption<String>

cached per tree

propertiesHashMap

free-form extra props

is_keyboard_focusableOption<bool>

gates bounds

is_focusedOption<bool>

has keyboard focus

is_toggledOption<bool>

checkbox / switch state

boundsOption<(f64,f64,f64,f64)>

focusable nodes only

enabledOption<bool>

interactable or greyed out

is_selectedOption<bool>

list item / tab state

child_countOption<usize>

direct children

index_in_parentOption<usize>

position among siblings

Fast mode (default)conditionalComplete / Smart only

One node is not just attributes. The UINode wrapper (lib.rs, lines 328 to 338) adds three more things: an id, a children vector, and a selector string. That selector is the full chained path from the root window down to this node, something like role:Window && name:Untitled >> role:Edit. It is built as the tree is walked, so every node in a captured tree is already addressable: copy its selector straight into desktop.locator() and you have a locator for it.

Why the default loads two fields, not seventeen

The choice of how much to capture is a single enum. PropertyLoadingMode in platforms/mod.rs (lines 57 to 64) has three variants, and the comments in the source describe each one exactly: Fast is "only load essential properties (role + name) - fastest," Complete is "load all properties for complete element data - slower but comprehensive," and Smart is "load specific properties based on element type - balanced approach." The default for tree building is Fast.

That default looks aggressive until you remember the cost model. Selector matching needs role, name, and id. It does not need a tooltip description or a value string for the layout container three levels up that you will never touch. Loading those anyway means one extra cross-process call per field, per node, across the whole tree. Fast mode is not cutting corners. It is declining to pay for data nothing will read.

PropertyLoadingMode, the two ends of the dial

Every field on every node. The tree is comprehensive: you get description, value, enabled, toggled, selected, the properties map, all of it, on the layout containers as much as on the buttons. It is the right mode when you genuinely need a full audit of an app's UI. It is the wrong mode as a default, because each extra field is another COM round trip multiplied across thousands of nodes.

one cross-process call per property, per node
a deep browser tree can take seconds to dump
most captured fields are never read by a selector
too slow to re-run after every action in an agent loop

The anchor: bounds only for what you can focus

Here is the single most telling line in the whole tree builder. In element.rs, line 334, the bounds field of a node is declared with this comment attached to it:

pub bounds: Option<(f64, f64, f64, f64)>,
// Only populated for keyboard-focusable elements

That comment is a design decision written down. The tree builder's get_configurable_attributes function checks element.is_keyboard_focusable() and only when that is true does it call element.bounds() to attach a rectangle. Asking for a node's bounding rectangle is itself a cross-process call, so this is not a cosmetic skip. It is the framework declining to pay for the pixel coordinates of things you will never click.

The logic underneath it is clean. The elements you act on (buttons, edit fields, checkboxes, list items, links) are precisely the ones the OS marks keyboard-focusable. A static text label, a decorative image, a layout group: not focusable, not clicked, no bounds. Then the tree formatter closes the loop. It assigns a numeric click index only to nodes that have bounds. So the indexed, clickable subset of a captured tree is the keyboard-focusable subset, by construction, with no separate filtering pass. One inline comment, and the whole capture-cost story is consistent.

“bounds: Option<(f64, f64, f64, f64)>, // Only populated for keyboard-focusable elements”

Terminator source

crates/terminator/src/element.rs, line 334, MIT licensed

Walking the tree without freezing the machine

Capturing less per node is half the job. The other half is the walk itself. build_ui_node_tree_configurable in tree_builder.rs does not just recurse blindly. It uses an explicit work queue, so a pathologically deep app cannot blow the call stack: recursion is capped at depth 100, and anything deeper gets pushed back onto the queue and processed iteratively.

One pass of the tree builder

Enumerate children

UIA returns the child elements of the current node

Load role + name

Fast mode reads only the two essential properties

Bounds if focusable

is_keyboard_focusable() gate before element.bounds()

Batch 50, yield 1ms

thread::sleep keeps the host UI responsive

Recurse, depth-capped

work queue takes over past recursion depth 100

The yield step is the one developers underestimate. The builder tracks how many elements it has processed, and every 50 of them (the yield_every_n_elements default), plus between large child batches, it calls thread::sleep for 1 millisecond. That tiny pause hands CPU back to the rest of the system so the host UI does not visibly hitch while a big tree is being captured. Children are processed in batches of 50 (batch_size), and the tree depth itself defaults to 50 levels, raised to 500 for browsers because web apps nest far deeper than native windows.

None of this is glamorous. It is the unglamorous machinery that turns "read the accessibility tree" from a slogan into a function that reliably returns in a few hundred milliseconds without making the user's desktop stutter. When a guide tells you the tree is fast, this is the code that earned the word.

What a captured tree actually looks like

Once the builder has run, the formatter renders the tree as a compact, indented, YAML-like block. Nodes that carry bounds get a #N click index; nodes without bounds get a plain dash. Here is a trimmed dump of a Notepad window captured in Fast mode.

capture the tree

Read the last two lines. 214 nodes captured, only 2 of them indexed, because only 2 are keyboard-focusable and therefore only 2 carry bounds. The MenuBar and the StatusBar text are in the tree (you can still read their names and roles) but they are not clickable targets, so the formatter does not waste an index on them. And the selector for node #1 is already written out, ready to paste into a locator. That is the Fast-mode philosophy made visible: capture what addresses and acts on elements, skip the rest.

The numbers, straight from the source

0fields on a UIElementAttributes node

0of them loaded by the default Fast mode

0children per batch, and tree depth default

0 msCPU yield every 50 elements walked

Seventeen fields defined, two loaded by default. That ratio is the whole argument. Everything else (the batching, the depth limits, the millisecond yield) exists to make even those two reads, multiplied across thousands of nodes, finish fast enough that an AI agent can re-capture the tree on every single step of a workflow without the human noticing.

Where the tree is thin, and what catches the fall

The accessibility tree is the right default for desktop automation, but it is honest to name where it gets thin. Three cases come up repeatedly.

Custom-drawn surfaces. A fullscreen game, a 3D modeller, a browser canvas (Figma's drawing area, Excalidraw, Miro): these paint their own pixels and expose a single opaque node with no useful children. The tree cannot help you there because the app never built one.

Shallow trees. Some apps implement accessibility poorly. Electron apps are known for flat, unhelpful trees. And a genuinely deep web view can exceed the depth limit, which is exactly why Terminator raises the default depth to 500 for browsers and why "element not found" often means "deeper than the depth you captured" rather than "not there."

Lying providers. A few targets return success on a tree action while doing nothing, most notoriously browser web views on macOS. The tree is not wrong about what exists; it is wrong about what acting on it will do.

Terminator's response is to treat the tree as the fast first layer of a stack, not the only layer. Its own description is "accessibility tree + DOM + OCR + vision AI for maximum reliability": the tree resolves the overwhelming majority of elements at CPU speed, DOM access through a Chrome extension covers web content the tree mangles, and OCR plus vision catch the custom-drawn surfaces. The tree is the default precisely because it is cheap and structured; the other layers exist because no single layer is complete.

Capturing it yourself

Everything above is one function call away. From the Node SDK, desktop.getWindowTree('notepad') returns a UINode you can walk or serialize; getWindowTreeResult additionally hands back the formatted indexed string and an index-to-bounds map for click-by-index. From an MCP agent in Claude Code, Cursor, or VS Code, the same capture is exposed as a tool: the agent asks for the tree, reads the structured nodes, and picks a selector, all running the same Rust tree builder this page describes.

install the MCP server

claude mcp add terminator "npx -y terminator-mcp-agent@latest"

or pull the SDK directly: npm i @mediar-ai/terminator

Tuning a tree capture that is too slow for your agent loop?

Talk through PropertyLoadingMode, depth limits, and where your target app's tree gets thin with the people who wrote the builder.

Questions about the accessibility tree

What is the accessibility tree in desktop automation?

It is the hierarchy of UI elements the operating system already maintains so screen readers can describe an app. Every window, panel, button, text field, list item, and menu entry is a node, and each node carries semantic properties: a role (Button, Edit, CheckBox), an accessible name, a value, state flags, and a bounding rectangle. Desktop automation frameworks walk this tree to find and act on elements instead of matching pixels in a screenshot. On Windows the tree is exposed by UI Automation, on macOS by the Accessibility API, on Linux by AT-SPI2. Terminator captures it through these native APIs and matches elements with a selector grammar.

Is reading the accessibility tree actually free or fast?

No, and that is the part most guides skip. The tree is not a JSON blob sitting in memory that you copy out. Each node and each property of each node is a cross-process call: your automation process asks the target application's UI Automation provider for one value at a time, over COM on Windows. A node with role, name, value, bounds, enabled state, and focus state is six separate round trips. A complex app like Chrome with a deep web view has thousands of nodes. A naive 'capture everything on every node' dump can take several seconds and tens of thousands of round trips. The engineering work in a desktop automation framework is not getting the tree, it is getting it fast enough to be usable.

What does a single tree node carry in Terminator?

The UIElementAttributes struct in crates/terminator/src/element.rs (lines 309 to 343) defines roughly 17 fields: role, name, label, text, value, description, application_name, a free-form properties map, is_keyboard_focusable, is_focused, is_toggled, bounds, enabled, is_selected, child_count, and index_in_parent. A UINode (lib.rs lines 328 to 338) wraps those attributes plus an id, a children vector, and a selector string that is the full chained path from the root to that node. The selector field is why every node in a captured tree is directly addressable: you can copy it straight into desktop.locator().

Why does the default mode only load role and name?

Because the PropertyLoadingMode enum (crates/terminator/src/platforms/mod.rs lines 57 to 64) defaults to Fast, and Fast loads only the essential properties. Selector matching needs role, name, and id; it does not need description, value, or bounds for every node in the tree. Loading all 17 fields on all nodes is the Complete mode, and it is several times slower because every extra field is another cross-process call multiplied across the whole tree. Smart mode sits between them, loading properties based on element type. Fast is the default because an agent loop that re-captures the tree after every action cannot afford Complete mode latency on each pass.

Why are bounds only populated for keyboard-focusable elements?

Look at element.rs line 334. The bounds field carries the inline comment 'Only populated for keyboard-focusable elements'. The tree builder's get_configurable_attributes function checks element.is_keyboard_focusable() and only then calls element.bounds() to attach a rectangle. The reasoning: a static text label or a layout container is not something you click, so its pixel rectangle is dead weight in the tree. The elements you actually act on (buttons, edit fields, checkboxes, list items) are exactly the keyboard-focusable ones, and those get bounds. The tree formatter then assigns a click index only to nodes that have bounds, so the indexed, clickable subset of the tree is the focusable subset by construction.

How does Terminator walk the tree without freezing the machine?

build_ui_node_tree_configurable in tree_builder.rs uses an explicit work queue instead of pure recursion, so a pathologically deep app cannot blow the stack; recursion is capped at depth 100 and anything deeper is pushed back onto the queue. Children are processed in batches (batch_size defaults to 50). Every 50 elements (yield_every_n_elements) and between large batches, the builder calls thread::sleep for 1 millisecond to yield CPU so the host UI stays responsive while the tree is being captured. Tree depth itself defaults to 50 levels, raised to 500 for browsers because web apps nest deeply. None of this is exotic; it is the unglamorous machinery that turns 'read the tree' from a slogan into a function that returns in a few hundred milliseconds.

How is the accessibility tree better than screenshots for an agent?

Two reasons. First, size and speed: a formatted accessibility tree for a typical app is a few kilobytes of structured text, while a screenshot is hundreds of kilobytes of pixels that a vision model has to process. Sending text to an LLM is faster and cheaper than sending an image. Second, precision: a node says 'this is a Button named Save at these bounds' with no guessing, where a vision model has to infer what is clickable. Terminator's own positioning is that this is what makes it deterministic and roughly 100x faster than screenshot-driven agents, because actions resolve through the tree at CPU speed instead of through an inference call per step.

When does the accessibility tree fail, and what do you fall back to?

Three known gaps. Custom controls that draw their own pixels (some games, some 3D tools, canvas drawing surfaces) show up as a single opaque node with no useful children. Some apps implement accessibility poorly: Electron apps are known for flat, shallow trees, and a deeply nested web view may exceed the default depth so you have to raise it or scope the search. And a few targets, notably browser web views on macOS, return success on a tree action while doing nothing. Terminator's answer is layered: accessibility tree first, then DOM access through a Chrome extension, then OCR, then vision AI, so the tree is the fast default and the other layers catch what it misses.

How do I capture the tree myself to see this?

From the Node SDK: const tree = desktop.getWindowTree('notepad'). It returns a UINode you can walk or serialize to JSON. getWindowTreeResult additionally gives you the formatted, indexed string and an indexToBounds map for click-by-index. From an MCP agent (Claude Code, Cursor, VS Code), the same capture is exposed as a tool, so the agent asks for the tree, reads the structured nodes, and picks a selector. Install with claude mcp add terminator "npx -y terminator-mcp-agent@latest". Either path runs the same Rust tree builder described here.

Does the tree look the same on Windows and macOS?

The shape is the same (a hierarchy of role-bearing nodes) but the providers differ. Windows UI Automation has the richest provider coverage and is Terminator's primary platform. macOS exposes the Accessibility API (AXUIElement) and requires the user to grant accessibility permission. Linux uses AT-SPI2. Role names differ across platforms (a Windows 'Edit' is a macOS 'AXTextField'), which is why Terminator's selector grammar matches on substrings and why you inspect the real tree of your target app before writing selectors rather than guessing role names.

The rest of the accessibility-tree series on Terminator