Accessibility API for AI agents: diff the tree, don't re-read it.

The accessibility tree is the right input for an AI agent. UIA on Windows, AXUIElement on macOS, AT-SPI2 on Linux — all of them expose role, name, AutomationId, state, and bounds for every element on screen. That part most articles cover well. The part they miss is the loop. An agent that re-reads the full tree on every step burns tokens linearly with task length and ends up drowning in unchanged context. Terminator returns a ui_tree_diff after each action: only the lines that changed, with volatile #ids and bounds stripped first. That is the unlock for long-horizon agents.

ui_diff_before_aftersimple_ui_tree_diffsimilar::TextDiffMCPMIT
M
Matthew Diakonov
11 min read
4.9from developers wiring AI agents into desktop apps
Two regexes strip volatile #ids and bounds before diffing
similar::TextDiff::from_lines emits only + and - lines
20 MCP action tools accept ui_diff_before_after as a parameter
Compact YAML format keeps full and delta on the same schema

The trap every other guide on this leaves you in

Read the existing playbooks for "use the accessibility API to drive an AI agent" and you will get the same recipe. Pull the UIA tree, serialize it, hand it to the model, ask the model what element to act on, send back an action. Then, almost universally, the article ends. What it does not say is what the loop looks like after action one. The honest answer is: most implementations call the tree-getter again. Every step. From scratch.

That recipe falls apart on anything longer than three or four actions. A real desktop window has 800 to 4,000 accessible elements. Outlook compose, Salesforce inside Chrome, a Jira backlog with 200 visible cards — all of them sit in that range. Re-reading those on every step pushes input tokens north of half a million for a single task. The model spends most of its attention on layout it has already seen, and gets confused about what your last click actually did because nothing in the dump is highlighted.

The fix is structural. Don't send the tree again. Send the diff.

0MCP action tools accept ui_diff_before_after
0regex passes strip volatile UIA attributes pre-diff
0%input-token reduction in a 20-step Outlook task vs naive loop
0lines emitted by the diff when the action was a no-op

What the loop actually looks like

Three things happen on every Terminator agent step. The accessibility tree gets snapshotted. An action tool fires. The tree gets snapshotted again, diffed against the first snapshot, and only the changed lines reach the model. Inside the agent harness, that whole flow is one MCP call with one extra parameter.

From snapshot to delta to model context

Pre-action snapshot
Post-action snapshot
ui_tree_diff.rs
+ added lines
- removed lines
Ok(None)

The two regexes and the line diff

Open crates/terminator/src/ui_tree_diff.rs and the whole thing is 100 lines, including doc comments. Two regex strips, one branch on JSON vs YAML format, one call into the similar crate, three ChangeTag arms. Every line that does anything is in the block below.

crates/terminator/src/ui_tree_diff.rs

The first regex, #[\w\-]+, drops UIA-assigned indices like #12345. Those are non-deterministic across renders, especially in WinUI, WPF, and Electron apps. The second regex, bounds: \[[^\]]+\], drops bounding rectangles, which shift every time the window resizes or scrolls. Both pieces of data are useful in the full tree (so the agent can call clicks at coordinates), but they are pure noise inside a diff. The strip is the difference between a 5-line meaningful change and a 200-line layout-reflow change.

0 lines

If the action did not move the tree, simple_ui_tree_diff returns Ok(None). The agent gets a literal no-op signal instead of a re-parsed identical dump.

ui_tree_diff.rs line 99, terminator-rs

Full tree vs delta, on the same step

Same action, two response shapes. The naive loop returns the entire window tree. The delta loop returns the lines that changed. Toggle below to see what the model would actually ingest in each case for a single click on Reply in Outlook.

What the agent sees per step

# get_window_tree(process: "outlook")
# returned every step in the naive loop

#1 [Window] Inbox - matt@mediar.ai - Outlook (bounds: [0,0,1920,1080], focused)
  #2 [TitleBar] (bounds: [0,0,1920,32])
    #3 [Button] Minimize (bounds: [1788,0,44,32])
    #4 [Button] Maximize (bounds: [1832,0,44,32])
    #5 [Button] Close (bounds: [1876,0,44,32])
  #6 [Pane] Ribbon (bounds: [0,32,1920,108])
    #7 [TabItem] Home (bounds: [12,32,68,32], selected)
    #8 [TabItem] Send / Receive (bounds: [80,32,124,32])
    #9 [TabItem] Folder (bounds: [204,32,68,32])
    #10 [TabItem] View (bounds: [272,32,52,32])
    ... 870 more elements ...
  #881 [List] Messages (bounds: [240,140,520,940])
    #882 [ListItem] John Doe — Quarterly review (bounds: [240,140,520,72], selected)
    ... 60 more list items ...

# ~24,000 input tokens for the agent, every turn
99% fewer lines per step

The model still has the full tree from turn 1 in its context. It still knows about the title bar, the ribbon, and the message list. It just does not need to re-read all of that to figure out that a click on Reply opened a compose window. The compose window is the diff. That is the whole insight.

One parameter, twenty tools

You do not have to wire the diff yourself. The MCP server accepts ui_diff_before_after: true on every tool that mutates UI state. The server captures the before-tree, fires the action, captures the after-tree, runs simple_ui_tree_diff, and includes the result in the tool response. The description on get_window_tree spells the policy out: "Do NOT call after action tools, use their ui_diff_before_after/include_tree_after_action params instead."

click_elementtype_into_elementpress_keypress_key_globalmouse_dragscroll_elementselect_optionset_selectedset_valueinvoke_elementnavigate_browseropen_applicationactivate_elementvalidate_elementwait_for_elementexecute_browser_scriptcapture_screenshotrun_commandhighlight_elementexecute_sequence
agent-loop.ts

Five things the diff path actually does

The implementation is small, but each piece is doing real work inside the agent loop. Here is the breakdown of what every line in ui_tree_diff.rs buys you.

Two regexes do the volatility strip

` #[\w\-]+` removes per-instance indices like #12345. `bounds: \[[^\]]+\],?\s*` removes bounding-rectangle blocks. Defined at ui_tree_diff.rs lines 43 and 48.

similar::TextDiff::from_lines

Line-based diff at line 81. Equal lines are skipped (line 93). Only Insert and Delete lines reach the model, prefixed with + and -.

Ok(None) on no change

Line 99 returns no diff at all when the action did not move the tree. The agent literally sees the action was a no-op rather than re-parsing an unchanged dump.

JSON path or YAML path

Detected at line 63 by checking if the tree starts with `- [`. JSON trees go through preprocess_tree (lines 26-35); YAML trees through the regex strip (lines 40-50). Same diff after.

20 MCP tools, one parameter

click_element, type_into_element, press_key, set_value, invoke_element, scroll_element, select_option, set_selected, mouse_drag, navigate_browser, open_application, activate_element, validate_element, wait_for_element, execute_browser_script, capture_screenshot, run_command, press_key_global, execute_sequence — all accept ui_diff_before_after.

One source tree, two agent loops

The same accessibility surface can be driven two ways. One is the loop the rest of the open ecosystem ships with. The other is the loop Terminator ships out of the box. They use the same input but produce very different cost curves over a 20-step task.

Naive accessibility-tree agent vs delta-loop agent

Full tree on turn 1. Action. Full tree again. Action. Full tree again. Action. The model re-ingests the entire window after every step. Layout reflow, AutomationId churn, and bounds shifts pollute the context. Token cost scales linearly with task length.

  • 21 full snapshots over 20 steps
  • ~500,000 tree input tokens on a medium task
  • Bounds and AutomationId noise inflates apparent change
  • Hard to tell which step actually moved the UI

Five steps inside one MCP turn

Here is the sequence the server runs the moment you flip ui_diff_before_after: true. You do not write any of this; the MCP transport handles it. But knowing the shape is what lets you reason about the agent loop.

One agent turn end to end

1

Turn 1: full tree, once

The MCP agent calls get_window_tree(process: 'chrome'). format_tree_as_compact_yaml runs. The agent receives the full compact YAML — every element with role, name, bounds, and state. This is the only time the model sees the entire tree.

2

Action tool, with the diff flag

The agent calls click_element / type_into_element / press_key with `ui_diff_before_after: true`. The server snapshots the tree, fires the action, snapshots again, and runs simple_ui_tree_diff between the two snapshots.

3

Volatility strip

Inside ui_tree_diff.rs, two regexes drop ` #<id>` patterns and `bounds: [...]` blocks from both snapshots. UIA's per-instance noise (re-rendered AutomationIds, layout reflow) is gone before the diff even starts.

4

Line-based diff

similar::TextDiff::from_lines walks the cleaned snapshots. Equal lines are skipped (zero output). Inserts emit `+ <line>`. Deletes emit `- <line>`. The result is concatenated and shoved into the action tool's response under `ui_tree_diff`.

5

Agent reasons against the delta

The model sees: maybe 5 lines, maybe 50, almost never 4,000. It plans the next action against the changed elements, calls another action tool, and the loop repeats. Total tree input across N actions is roughly 1*full + N*delta, not (N+1)*full.

How to wire your agent harness to use this

Six rules, none of them subtle. If your harness already has an MCP transport, every one of these is a five-minute change.

Delta-loop checklist

  • On the first agent turn, call get_window_tree once. Cache the YAML on your side. The agent gets the full tree exactly once per task.
  • On every action tool call after that, set `ui_diff_before_after: true`. The MCP server returns the changed lines, prefixed with + and -, in the result.
  • When the action triggered a navigation or window switch (open_application, navigate_browser, activate_element on a different window), use `include_tree_after_action: true` instead so the agent fully re-orients.
  • Treat `Ok(None)` (no diff returned) as 'the action did not move the tree.' That is a real, actionable signal — the model should consider the action a no-op rather than hallucinate a state change.
  • Never strip the +/- prefixes before passing the diff into the model context. The prefixes are how the agent tells removed elements from added ones; without them, role: Button could mean either appeared or disappeared.
  • If you also enable include_browser_dom or include_ocr, the diff still works — both sources land inside the same compact YAML, prefixed by source (#u, #d, #o), and TextDiff treats them as lines.

Side by side: which loop are you running?

If you have built an accessibility-tree agent before, this table is the question to ask of your own code. Most agents in the wild are running the left column even when they think they're running the right.

FeatureNaive accessibility-tree loopTerminator delta loop
tree input frequencyfull tree on every agent stepfull tree once, delta on every step after
volatility filteringraw IDs and bounds bleed into every difftwo regex passes strip #ids and bounds before diffing
diff enginemanual JSON.diff or string compare per agent authorsimilar::TextDiff::from_lines, line-based, single source of truth
where the delta is exposedrebuilt by hand in each agent harnessui_diff_before_after parameter on 20 MCP action tools
format consistencyfull tree and partial tree often have different schemascompact YAML for both full and delta, same parser on the agent side
no-op handlingagent receives an unchanged tree and may hallucinate progressOk(None) is returned, the agent literally sees nothing happened
browser + native parityDOM and UIA come back in different shapesboth prefixed (#d, #u) inside the same compact YAML, diff treats them the same

The shortest path to trying this

Wire the MCP server into Claude Code, Cursor, or VS Code with a single command. The server ships with ui_diff_before_after already exposed on every action tool. Your existing AI coding assistant becomes a desktop agent on a delta loop, not a screenshot loop.

claude mcp add terminator "npx -y terminator-mcp-agent@latest"

MIT licensed. Source at github.com/mediar-ai/terminator. The diff path is at crates/terminator/src/ui_tree_diff.rs.

Wiring an AI agent into a desktop without lighting your token bill on fire?

Book 20 minutes with the maintainers. We will walk through the delta loop, the MCP tool surface, and what it takes to drop Terminator into your existing harness.

Frequently asked questions

What does an accessibility API actually give an AI agent that a screenshot doesn't?

Two things a vision model has to reconstruct from pixels. First, the role of every element (Button, ComboBox, Edit, MenuItem, ListItem) as the OS itself classifies it, not as a CNN guesses. Second, a stable selector grammar (role + name + window + AutomationId) that the agent can call back to in the next step without re-finding the element by visual coordinates. On Windows that comes from UIAutomation. On macOS it's AXUIElement. On Linux it's AT-SPI2. The agent's job changes from 'where is the Save button at this DPI' to 'invoke the element whose role is Button and whose name is Save.'

If accessibility trees are so good for agents, why do most accessibility-tree agents still struggle on long tasks?

Because they re-read the full tree after every action. A medium-complexity desktop window (Outlook compose, a Salesforce browser tab, a Jira backlog) renders 800 to 4,000 elements in its UIA tree. Serialized as YAML that's roughly 30,000 to 120,000 input tokens. If the agent calls 20 tools to finish a task, the model has now seen the same tree 20 times, with most attributes identical. The token bill scales with steps, the latency scales with steps, and the model gets confused by what changed because nothing is highlighted. The fix is not 'use a smaller tree,' it's 'send a diff.'

Where is the diff implementation in Terminator and what does it actually do?

crates/terminator/src/ui_tree_diff.rs. The function `simple_ui_tree_diff` at line 58 takes the before-tree and after-tree as strings, detects whether they're JSON or compact YAML, and runs them through `remove_ids_and_bounds_from_compact_yaml` (lines 40-50) or `preprocess_tree` (lines 26-35). The YAML path uses two regexes: ` #[\w\-]+` to strip indices like `#12345` and `#abc-def`, and `bounds: \[[^\]]+\],?\s*` to strip bounding-rectangle blocks. The JSON path walks the value tree and drops `id` and `element_id` keys recursively. Then it calls `TextDiff::from_lines` from the `similar` crate (line 81) and emits only `+` (insert) and `-` (delete) lines. Equal lines are skipped. If nothing changed, it returns `Ok(None)` and the model sees no tree at all that turn.

Which MCP tools accept ui_diff_before_after and why does that matter?

Twenty of them, including click_element, type_into_element, press_key, press_key_global, mouse_drag, scroll_element, select_option, set_selected, set_value, invoke_element, navigate_browser, open_application, activate_element, validate_element, wait_for_element, execute_browser_script, capture_screenshot, run_command, and the meta-tool execute_sequence. Every action tool that mutates UI state takes the parameter. The tool captures a tree snapshot before it fires the action, performs the action, captures another snapshot, runs `simple_ui_tree_diff`, and includes the diff in the tool result. That removes a whole class of agent calls (the explicit get_window_tree after every action) and is exactly what the description on get_window_tree itself warns about: 'Do NOT call after action tools - use their ui_diff_before_after/include_tree_after_action params instead.'

How is the tree formatted before it's diffed?

Compact YAML, not raw UIA serialization. crates/terminator/src/tree_formatter.rs::format_tree_as_compact_yaml emits one element per line in the form `#1 [ROLE] name (bounds: [x,y,w,h], focusable, focused, selected, value: ...)`. Children are indented two spaces. Elements with bounds get a 1-based clickable index; elements without bounds get a `- [ROLE]` dash prefix. State flags are only included when true (`focused`, `selected`, `toggled`, `disabled`). This format is denser than UIA's native XML by an order of magnitude, and the regex-based strip in ui_tree_diff.rs is designed to operate on exactly this shape.

Why strip the AutomationId and bounding rectangle before diffing?

Because both are volatile across a single application's run. `AutomationId` is generated per-instance for many WinUI, WPF, and Electron controls; the same Save button gets a new id every render pass. `bounds` shifts whenever the window is resized, scrolled, or DPI-changed, even when nothing semantically changed. If you don't strip these, every action looks like a 200-line diff because the layout reflowed. After the strip, the diff is exactly the elements that appeared, disappeared, or had a name, role, or state change. That's what the agent actually needs to reason about.

What does this mean for token cost on a real desktop task?

Take a 20-step automation that opens Outlook, drafts a reply, attaches a file, and sends it. The Outlook compose window's UIA tree is roughly 900 elements. In compact YAML that's about 24,000 tokens. Naive loop: send the full tree before step 1 and after each of 20 actions. Total: 21 * 24,000 = 504,000 tokens of tree input. Delta loop: send the full tree once, then 20 deltas of typically 5 to 30 lines each (~200 tokens). Total: 24,000 + 20 * 200 = 28,000 tokens of tree input. The agent does the same task with about 5% of the input cost on the tree side, and the model is no longer drowning in unchanged context.

Does the diff path work for browser automation too, or just native windows?

Both. UIA exposes Chrome, Edge, and Firefox windows as accessibility trees, so the same `get_window_tree(process: 'chrome')` returns a tree the diff can run on. For richer DOM data, set `include_browser_dom: true` and you get the DOM merged with the UIA tree, prefixed with `#d` for DOM elements and `#u` for UIA elements (see ElementSource in tree_formatter.rs lines 41-79). The Chrome extension Terminator ships does the bridging. The diff treats both source prefixes the same — they're just lines.

What if I want the full tree on a specific step, not the diff?

Pass `include_tree_after_action: true` instead of `ui_diff_before_after: true`. The action tools accept either. Use the full tree when the action triggered a navigation or a window switch and you genuinely want the agent to re-orient. Use the diff for incremental changes within the same window. Mixing both during a workflow is fine; both branches share the same `format_tree_as_compact_yaml` pipeline so the agent doesn't see a different schema between turns.

How is Terminator different from screenshot-driven agents like Claude computer use or browser-use?

Different input, different loop, different cost curve. Screenshot agents send images to a vision model on every step; the model burns tokens on layout it already knows and re-derives selectors that are not stable across renders. Terminator sends a structured tree once and a delta after each action, so the model can plan with cheap, fully-described elements and call back to them by selector. Vision is still available (`include_gemini_vision: true`, `include_omniparser: true`, OCR via `include_ocr: true`) for the cases where an app has no accessible surface, but it's a fallback, not the default. The deterministic outcome that Terminator advertises (>95% success rate at CPU speed) comes from the accessibility-tree-plus-delta path being the primary route.

terminatorDesktop automation SDK
© 2026 terminator. All rights reserved.