Accessibility API for computer use agents: seven click paths, one MCP tool.
Most articles on this topic frame it as a binary. Either you build an accessibility-tree agent that clicks by role and name, or you build a computer use agent that clicks by pixel coordinates from a screenshot. The reality of any agent that ships against real desktop apps is messier. Office documents render their content inside a single canvas with no AX children. Electron apps hide their renderer behind one outer window. Custom controls in IDEs and design tools paint pixels directly. An agent that knows only one click path stalls on those surfaces.
Terminator's MCP server exposes one click tool, click_element, that dispatches across seven distinct grounding paths under the same JSON shape. The implementation is small, the agent contract is simple, and the entire fall-through ladder is in two enums and one function. This page is the tour.
The framing problem
Pick up any guide on accessibility versus computer use and you get the same diagram: a tree on the left, a screenshot on the right, an arrow between them labeled "or." The accessibility tree is presented as the structured, deterministic, fast option; vision is presented as the slow, fragile, expensive option. Pick one.
That framing is fine for a thought experiment. It does not match what the agent actually has to do. Real desktop work crosses surfaces. A daily flow can start in Outlook (rich AX tree), navigate into a Word document (opaque canvas), drop into an Electron-shell internal tool (AX hidden), and finish in a Chrome tab (DOM tree available, AX truncated). The agent that can only click by AX selector breaks at the second step. The agent that can only click by pixel coordinates is slow and re-reads the same elements every turn.
What the agent needs is a single click tool that accepts whichever grounding source actually saw the element on this turn, and converges on the same OS-level click underneath. That is what click_element is.
The seven paths
Three click modes, five vision sources. The combinations the server actually accepts are: selector, AX-tree index, OCR index, Omniparser index, Gemini-vision index, DOM index, and raw coordinates. Seven, total. Every one of them goes through one function.
Selector → AX engine
click_element with selector: 'role:Button && name:Save'. Resolves through the platform AX engine (UIA on Windows, AXUIElement on macOS, AT-SPI2 on Linux). The most stable path; survives re-render, resize, DPI change. Use as default.
Index + UiTree
click_element with index: 88 and vision_type: 'uitree' (the default). Looks up uia_bounds, the cache populated by get_window_tree. Same AX provenance as selector mode but referenced by the integer the model just read.
Index + Dom
Browser only. include_browser_dom merges the Chromium DOM tree with UIA, prefixed #d. click_element index plus vision_type 'dom' looks up dom_bounds from the live page. Use when AX exposes only the outer Chrome window but the page itself is what matters.
Index + Omniparser
include_omniparser runs Microsoft's icon detector on the screenshot and emits indexed boxes prefixed #m. click_element index plus vision_type 'omniparser' resolves omniparser_items. Use when the target is an unlabeled icon or graphical control.
Index + Gemini
include_gemini_vision asks Gemini to label salient elements; the result is cached as vision_items and prefixed #v. click_element index plus vision_type 'gemini' resolves through that cache. Use as a heavier fallback when neither AX nor Omniparser caught the element.
Index + Ocr
include_ocr runs text recognition over the screenshot, prefixes detected text with #o, and stores ocr_bounds. click_element index plus vision_type 'ocr' clicks the center of a recognized phrase. Use for app UIs that paint text directly into a canvas.
Coordinates
click_element with x and y. No grounding, no cache lookup, just desktop.click_at_coordinates_with_type at server.rs:2552. The last-resort path when nothing else caught the element. Stable only as long as the window does not move.
Five vision caches feed one click router
The two enums that decide everything
The dispatch logic is two short enums and one match. ClickMode decides which top-level argument set the agent sent. VisionType, for index mode, decides which cache to dereference. Both are declared in crates/terminator-mcp-agent/src/utils.rs.
That is the entire surface area for choosing a click path. The interesting part is what the index branch does once vision_type picks a cache.
Five caches, one converging path
Each grounding source is a Mutex over a HashMap keyed by the 1-based index the agent saw in the tree. uia_bounds, ocr_bounds, omniparser_items, vision_items, dom_bounds. The match below is the entire vision-type fork in click_element. Every arm computes the same shape: a label and a (x, y, w, h) tuple. Every arm hands off to the same desktop call.
What the agent actually sees
The whole router is invisible to the model unless we make it visible. Terminator does that by prefixing each tree line with the source that produced it: #u for UIA, #d for DOM, #m for Omniparser, #v for Gemini-vision, #o for OCR. The agent reads one tree and copies the prefix and the integer straight into the next click_element call.
“The accessibility tree is the right input. It is also not the only input the agent will need.”
Terminator design notes
One agent step, two grounding sources
The agent does not have to commit to a grounding source up front. It can try the cheapest path, fall through on failure, and finish the click. Same tool, different arguments.
What it looks like end to end
One get_window_tree populates every cache. The agent then makes calls until the click lands. The bounds cache is shared across all clicks within the turn, so the second call hits a hot path.
Agent fall-through across click_element grounding sources
How an agent should pick the source
Stability ranks the order. AX selectors are the most stable because the OS itself classified the element. AX indices share that provenance. DOM indices are stable until the page navigates. Omniparser and Gemini-vision are stable for the screenshot they ran on. OCR is the least stable because text recognition jitters. Raw coordinates are the most fragile.
The fall-through ladder, in order
- 1
1. AX selector
Try role + name through the OS accessibility engine. UIA, AXUIElement, AT-SPI2. The most stable click; reuses the OS's own classification.
- 2
2. AX index
If the agent already has a UIA tree from get_window_tree, click by integer. Same provenance as selector, no string parsing.
- 3
3. DOM index
In Chrome, the bundled extension merges the live DOM into the same compact YAML. Use the #d-prefixed line when AX hides the page contents.
- 4
4. Omniparser index
For unlabeled icons and graphical controls. include_omniparser runs Microsoft's detector on the screenshot, indexes each box.
- 5
5. Gemini-vision index
Heavier vision pass when Omniparser missed the element. include_gemini_vision asks Gemini to label salient regions; results indexed and cached.
- 6
6. OCR index
When the target is text painted into a canvas with no element behind it. include_ocr runs recognition; click the center of the matched phrase.
- 7
7. Raw coordinates
Last resort. Pass x and y. Stable only as long as the window does not move. Use only when none of the indexed sources caught the element.
Why index beats raw coordinates for vision-grounded clicks
A common shortcut for computer use agents is to skip indices and ask the model for absolute pixel coordinates from the screenshot directly. Terminator's bundled gemini_computer_use loop does this, because the Gemini Computer Use API was designed around it. That loop ships at crates/terminator-computer-use/src/lib.rs and uses normalized 0-999 coordinates that get pushed through four sequential transforms (resize_scale, DPI, window x, window y) before hitting the OS. It works, but every click compounds error from four sources of float math.
Index-based vision sidesteps that. The detector ran on the screenshot at capture time and the server stored the resulting bounds in real screen-space coordinates. The agent passes an integer; the server picks up the bounds and calls click. There is no per-click math, no compounding error, no chance the model re-derives a slightly different coordinate next turn.
| Feature | Pure-AX or pure-vision agent | Terminator click_element router |
|---|---|---|
| click paths exposed to the agent | one (selector) or one (coordinates) | seven, switchable per click via the same tool signature |
| behavior on non-AX UI (Office canvas, Electron, custom controls) | the agent stalls or hallucinates a missing element | fall through to OCR, Omniparser, Gemini-vision, or DOM index |
| how vision-detected elements are addressed | raw pixel coordinates re-derived every turn | 1-based index into a server-side bounds map, stable across turns |
| DPI and window-offset handling for vision clicks | model emits normalized coords, convert_normalized_to_screen runs four transforms | bounds captured in real screen space, click_at_coordinates dispatched directly |
| where source-level grounding is declared | scattered across the agent harness | ClickMode at utils.rs:728, VisionType at utils.rs:1062, dispatch at server.rs:2486 |
| tree input shape | raw UIA XML or screenshot, depending on source | single compact YAML with #u, #o, #m, #v, #d prefixes per source |
| cost when the action did not move the tree | agent re-snapshots and re-reads the same elements | ui_diff_before_after returns the literal delta, often zero lines |
The numbers behind the router
Caches per agent turn
0
uia_bounds, dom_bounds, ocr_bounds, omniparser_items, vision_items. Each populated by one get_window_tree call with the matching include_* flag.
Click paths exposed
0
ClickMode::Selector, ClickMode::Index x five VisionType variants, ClickMode::Coordinates. All under one tool.
Wiring it into your agent
System prompt rules for the click_element router
- On the first agent turn, call get_window_tree with include_ocr, include_omniparser (or include_gemini_vision), and include_browser_dom in browsers. One screenshot powers all five vision-side caches.
- Have the agent prefer selectors. The model emits role:X && name:Y; the AX engine does the resolve. This is the cheapest, most stable path and what most clicks should land on.
- When the AX engine returns no match, do not retry with a fuzzier selector. Look at the unified tree, pick the next-best grounding source, and call click_element with the matching index plus vision_type.
- Carry the source prefix and integer verbatim from the tree line into the click_element call. The cache topology is per-source; mixing index 25 with the wrong vision_type misses.
- Reserve raw coordinates for the case when nothing in the tree caught the target. If the agent reaches that branch, log it; it is a signal that include_omniparser or include_gemini_vision was off when it should have been on.
- Pair every action call with ui_diff_before_after: true so the agent sees the delta after the click and can decide whether to keep going on the same source or re-call get_window_tree to refresh the caches.
What this changes for an agent author
You stop having to commit, at agent-design time, to whether the model is "an accessibility agent" or "a vision agent." The choice moves down to per-click. Your system prompt encodes the priority order, the model emits whichever shape best matches what it just saw, and the server resolves it. The same model can run a long workflow that crosses Outlook, Word, an internal Electron app, and a Chrome tab without changing tools.
The accessibility tree stays the default because it is the cheapest and most stable input. Vision sources stay available because the AX tree alone is not enough. The router is small enough to read in one sitting, which is the whole point: it is transparent to the agent and to the engineer wiring the agent.
Wiring a computer use agent against real desktop apps?
Book a 20-minute walkthrough of the click_element router and how to drop it into your existing agent loop.
Questions developers ask before wiring this in
Frequently asked questions
Why do computer use agents need anything beyond the accessibility tree?
Because the accessibility tree is not a complete description of every pixel that matters. Office documents render their content inside a single canvas element with no AX children for individual cells, paragraphs, or shapes. Electron apps wrap a Chromium renderer and often expose only the outer window through UIA, leaving the actual UI invisible to the AX tree until the chrome.accessibility flag is on. Custom-rendered controls in games, IDEs, and design tools paint pixels directly. An agent that can only click by AX selector hits these surfaces and stops. The fix is not to give up on accessibility, it is to layer alternative grounding sources behind the same click tool so the agent can keep the AX path as default and fall through when the tree is silent.
What are the seven click paths in Terminator and where are they declared?
Three modes plus five vision sources. ClickMode (Selector, Index, Coordinates) lives at crates/terminator-mcp-agent/src/utils.rs lines 728-736. VisionType (Ocr, Omniparser, Gemini, Dom, UiTree) lives at the same file lines 1062-1067. Selector mode resolves an AX selector through the platform engine. Index mode plus UiTree resolves a 1-based clickable index against a UIA tree captured by get_window_tree. Index plus Ocr resolves against the cache populated by include_ocr. Index plus Omniparser resolves against include_omniparser. Index plus Gemini resolves against include_gemini_vision. Index plus Dom resolves against the Chrome extension's DOM bounds. Coordinates is raw screen-space x and y. All seven dispatch through one click_element function at server.rs:2486.
How does the index work for non-AX grounding sources like OCR or Omniparser?
Each grounding source caches bounds in its own map on the MCP server. server.rs holds five separate Mutex<HashMap<u32, ...>>: uia_bounds, ocr_bounds, omniparser_items, vision_items, dom_bounds. When you call get_window_tree with include_ocr or include_omniparser, the server runs the detector, assigns 1-based indices to each detected box, and stores the bounds in the matching map. The model sees lines like '#42 [Button] Save' or '#42 OCR: "Save" bounds [840,32,68,28]'. When the agent later calls click_element with index: 42 and vision_type: "omniparser", server.rs:2660 looks up entry 42 in omniparser_items, takes the box_2d, computes the center, and routes to desktop.click_at_coordinates_with_type. The agent has called back to a vision-grounded element with the same shape it would call back to an AX node.
Does the agent have to choose grounding sources up front, or can it fall through?
Fall through. The recommended pattern is to call get_window_tree once with include_ocr: true, include_omniparser: true (or include_gemini_vision: true), and include_browser_dom: true if you are in a browser. The server returns a unified compact YAML with each source prefixed (#u for UIA, #o for OCR, #m for Omniparser, #v for Gemini vision, #d for DOM). The agent sees every grounded element across every source in one tree, picks the highest-confidence one for each click, and calls click_element with the matching vision_type. If the AX tree exposes the element, the agent uses index plus UiTree; if it doesn't, but OCR caught the label, the agent uses index plus Ocr; if nothing else worked but a vision model spotted an icon, the agent uses index plus Gemini. Coordinates is the last resort, used only when none of the indexed sources caught the element.
Why is index-based vision better than just clicking at raw coordinates?
Two reasons, both about agent reliability. First, the index is a stable handle the agent can carry across reasoning steps without re-deriving it. The model is more accurate calling back to '#42 the omniparser-detected Save icon at [840,32,68,28]' than computing those coordinates fresh from a screenshot every turn. Second, the server is the one that converts the index to actual screen coordinates, so DPI scaling, window offset, and any image resize the detector applied are handled in one well-tested function. Compare that to the Gemini computer use loop in terminator-computer-use, where the model emits 0-999 normalized coordinates and convert_normalized_to_screen has to apply four sequential transforms (resize_scale, DPI, window x, window y) for each click. Index dispatch skips the math; the bounds were captured at real screen coordinates from the start.
What does this look like in practice for a Claude or Gemini computer use agent?
Wire Terminator's MCP server into the agent's tool list. The agent now sees one click_element tool instead of a screenshot-coordinate tool. Its first call is get_window_tree with include_ocr: true and include_omniparser: true. The compact YAML comes back with up to five interleaved grounding sources, indexed. From that point on, every click_element call carries either a selector, an index plus vision_type, or coordinates. The agent's reasoning becomes 'this Save button is in the AX tree as #88, click that' or 'the icon I want is not in the AX tree but Omniparser caught it as #156, click that with vision_type omniparser'. The agent never has to choose at boot time whether to be an accessibility agent or a vision agent. It is both, per click.
What grounding source should the agent prefer when more than one matches?
AX tree first, DOM second (in browsers), Omniparser or Gemini-vision third, OCR fourth, raw coordinates last. The order is rank-by-stability. AX selectors and AX indices reference elements the OS itself classified, so they are robust across re-renders, DPI changes, and window resizes. DOM indices come from the live Chromium tree, also stable as long as the page hasn't navigated. Omniparser and Gemini-vision boxes are stable for as long as the captured screenshot is, then need re-grounding. OCR is the least stable because text recognition jitters around character boundaries. Raw coordinates are the most fragile because nothing on the screen is anchored to them; a window move invalidates the click. The same click_element call accepts all of them, so an agent can encode this priority in its system prompt and still emit the same JSON shape.
Does the seven-mode router add latency compared to a pure-AX or pure-vision agent?
Negligible. ClickMode::determine_mode at utils.rs:740 is a constant-time check on which arguments are present. The vision_type match in server.rs:2609 is one HashMap lookup keyed by index. The bottleneck for any of these paths is the platform-side click, not the dispatch. The interesting cost is the up-front call to get_window_tree with the include_* flags on, because OCR and Omniparser run on the captured screenshot before returning. That happens once per agent turn rather than once per click, and the result is cached and indexed for every click_element call until the next get_window_tree.
Where does Terminator's bundled gemini_computer_use loop fit, and when should I use it instead?
Use the MCP server with click_element when you are wiring an existing agent (Claude, custom OpenAI tool-use, your own harness) that already has its own reasoning loop. Use the bundled Desktop::gemini_computer_use loop in crates/terminator/src/computer_use/mod.rs:483 when you want a self-contained agent that runs end-to-end against the Gemini Computer Use API. The bundled loop is screenshot-only and uses normalized 0-999 coordinates throughout; it is the right choice if you want to ship a working agent in fifteen lines of Rust without designing your own tool layer. The MCP path is the right choice when the AX tree is the primary input and vision is a fallback. Both share the same desktop click implementation underneath.
Is the index for one source ever valid for another, or do I have to track which source produced which index?
Track it. Each grounding source has its own bounds map, and the indices are namespaced per source by the prefix on the line in get_window_tree's output (#u, #o, #m, #v, #d). Calling click_element with index: 42 and the wrong vision_type will either miss or land on a different element. In practice the agent does not re-derive this; it copies the prefix and the integer from the line it just read. The compact YAML's prefixing is what makes the seven-mode router safe to expose to an LLM without requiring it to memorize the cache topology.
Related guides on the same agent stack.
Keep reading
Accessibility API for AI agents: diff the tree, don't re-read it
Companion piece on the input loop. ui_tree_diff strips volatile #ids and bounds with two regexes, then ships only the lines that changed. 20 MCP tools accept the diff flag.
Claude computer use, the pixel-coordinate loop and the selector alternative
Claude's native computer_20251022 tool emits left_click [x, y]. Terminator's MCP gives the same model 32 selector-based tools so it can click by role and name instead.
Open source computer use AI agents, April 2026
Snapshot of which agents use accessibility APIs, which use vision, and which combine both. Where Terminator sits between the two camps, and why hybrid is winning.