Accessibility tree vs screenshot desktop automation: it is a router, not a binary choice

The framing presupposes the agent has to pick a side. Production desktop agents do not pick. Terminator's click_element tool exposes one selector with a vision_type parameter that routes to five grounding sources: the accessibility tree (default), local OCR, OmniParser, Gemini Vision, and the browser DOM. The right reading of this comparison is which mode to pick per call, not which side to pick at startup.

ui_tree (default)ocromniparsergeminidom

Direct answer (verified 2026-05-04)

For surfaces that expose an accessibility provider (the bulk of Windows and macOS business apps), the accessibility tree wins. It is deterministic, ~100x faster than the screenshot path, and survives DPI, theme, and resolution changes because selectors carry semantic identity instead of pixel coordinates. Screenshot+vision wins for surfaces that have no usable AX tree: DirectX or OpenGL games, canvas-rendered design surfaces, and opaque legacy widgets that draw their own pixels. Production frameworks pick a router, not a side. Terminator's VisionType enum at crates/terminator-mcp-agent/src/utils.rs:1062 has exactly five variants and defaults to UiTree.

Authoritative sources: Terminator source, Microsoft UI Automation overview, and the Anthropic Computer Use docs for the screenshot-only baseline.

The framing is the bug

“Accessibility tree vs screenshot” reads like a debate with two sides. It is not. It is a routing decision the caller makes per click. The clearest evidence sits in the source: one tool, five labelled grounding modes, one default. The mistake the binary framing makes is treating 1990s-era line-of-business apps and 2026-era canvas-rendered design tools as the same problem. The tree handles the first cleanly and the second poorly. Vision handles the second cleanly and the first wastefully. The router framing is what falls out when you stop pretending the surface is uniform.

Two ways to frame the same question

Most articles treat 'accessibility tree vs screenshot' as a forced choice. Pick a side, take the tradeoffs, ship. The framing implies the agent commits to one grounding strategy at startup and lives with it. The reader walks away thinking they have to pick the tree (and lose canvas-rendered surfaces) or pick screenshots (and pay LLM inference on every click). Neither is what production desktop agents actually do.

implies one strategy per agent
treats canvas/games and business apps as the same problem
buries the cost of LLM inference in the comparison
ignores DPI scaling math on the screenshot side

What the router actually looks like

The five grounding modes share a tool name (click_element), a click semantics layer (left | double | right), and a result schema. They differ in what populates the index map the click reads from. Each producer is a different MCP tool and a different code path; the consumer (click_element) is one match arm.

Five vision_type values

1
ui_tree (default)
Walks UIAutomation on Windows, AXUIElement on macOS. No screenshot, no LLM in the action loop. The default at utils.rs:835.
2
ocr
Captures a screenshot, runs local OCR to find text bounds. Fastest screenshot path because no remote model call is involved.
3
omniparser
Captures a screenshot, posts to the OmniParser backend at app.mediar.ai/api/omniparser/parse for icon and structural element detection.
4
gemini
Captures a screenshot, posts to the Gemini Vision backend for general-purpose element grounding by description.
5
dom
Reads the live DOM via the Chrome extension. Browser-only path, used when the AX tree under-reports a webpage.

The enum, the default, and the dispatch

Two short slices of source code make the design legible. The first is the enum and its default. The second is the match arm that converts the value into a bounds lookup. Together they explain why the router never has to guess: each grounding mode owns its own index map and the caller names which one to use.

Where the modes are defined and dispatched

// crates/terminator-mcp-agent/src/utils.rs:1062
#[derive(Debug, Clone, Copy, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum VisionType {
    Ocr,
    Omniparser,
    #[serde(alias = "ui_tree")]
    #[serde(alias = "uitree")]
    UiTree,
    #[serde(alias = "dom")]
    Dom,
    /// Gemini vision model elements
    #[serde(alias = "vision")]
    Gemini,
}

// utils.rs:833
/// Get vision type, defaulting to UiTree
pub fn get_vision_type(&self) -> VisionType {
    self.vision_type.unwrap_or(VisionType::UiTree)
}

-40% lines per side

One tool, two paths through it

The same tool call can resolve through the tree branch or the vision branch depending on which index map the caller is reading. The tree branch ends in a UIA pattern call inside the target process. The vision branch ends in synthetic input at coordinates the model returned. Both end in a result frame the MCP client gets back; the producer tool is what differs.

Two clicks, one click_element

How the modes differ in practice

The modes are not symmetric. ui_tree is cheap and structural; gemini is expensive and general. Picking the right one is a tradeoff between latency, surface coverage, and the kind of error you can recover from when the lookup misses. Reading the modes side by side is the working list when you sit down to write a workflow.

What each mode is honest about

ui_tree returns a typed error if the index is missing; you call get_window_tree first to populate uia_bounds.
ocr is local-only; no network hop, but limited to text-shaped targets.
omniparser detects icons and non-text elements; needs network to mediar.ai.
gemini is the most general grounding; it can find 'the trash icon next to the third row' but pays full LLM latency.
dom only works inside browsers; complementary to ui_tree on web pages, not a replacement for the tree elsewhere.
All five modes share the same selector grammar at the index layer; the tool surface stays one click_element.

When screenshot still wins outright

The router framing does not let the tree off the hook. There are real surfaces where the tree returns one opaque element and there is nothing more to ask it. A DirectX game window is one HWND with a frame buffer; the AX provider sees the window and nothing inside. A canvas-rendered design tool draws every layer into a single canvas element; the AX tree reports the canvas. A kiosk-mode line-of-business app sometimes ships with the UIA provider explicitly disabled, leaving SendInput at coordinates as the only way in. In all three the right vision_type is omniparser or gemini. The cost is real (one network round-trip plus model inference per click) but the alternative is an empty selector and a workflow that cannot make progress.

The honest framing in production looks like this: the tree covers roughly 95% of the clicks an agent makes against business apps, and screenshot+vision cleans up the residual 5%. The residual cannot be ignored, and it cannot be pushed onto a separate agent. It has to be available as a labelled fallback inside the same tool, with the same selector grammar at the index layer, so the agent can reach for it without context-switching. That is the shape Terminator's click_element is built around.

What this means for picking a framework

Two reads of the same comparison usually point at two different framework styles. Vision-only frameworks (the Anthropic Computer Use beta as the canonical example, plus several MCP wrappers around pyautogui and pillow) ground every action in a screenshot and an LLM. Cost is high per click; surface coverage is uniform. Tree-only frameworks (pywinauto, FlaUI, classic UIAutomation wrappers) ground every action in the AX tree. Cost is low per click; surface coverage drops on canvas-rendered and custom-painted apps. A router framework keeps the cost-low default and adds the labelled escape hatch.

If your agent will ever touch a canvas-rendered app or a game, tree-only is going to fail you and you will end up bolting on a screenshot path under pressure. If your agent only ever touches structured business apps, vision-only is paying full LLM inference for a click that should have been a 2 ms pattern call. Picking the router up front avoids both regrets.

Caveats worth naming

A few things the binary framing tends to elide and the router framing has to handle explicitly. The accessibility tree on macOS Chrome and Safari silently no-ops on AXPress for many web views, so any production engine maintains a hardcoded browser bypass list and falls back to synthetic input. Vision models trained on one resolution scale degrade when given screenshots at a different scale; Terminator's omniparser path at server.rs lines 1141 to 1147 carries explicit DPI math to scale physical bounding boxes back to logical screen coordinates. Cross-platform coverage is uneven: terminator-rs declares the cross-platform trait at platforms/mod.rs:86 but main today ships Windows primarily, with macOS support at the core Rust level. None of these are deal-breakers; they are reasons the labelled-fallback design exists in the first place.

Building an agent that has to clear both kinds of surface?

Talk through the routing model with us before you commit. Tree-default with a labelled vision fallback is what we have shipped; the calls walk through the tradeoffs on your real workload.

Frequently asked

Frequently asked questions

Which is faster, accessibility tree automation or screenshot-based automation?

The accessibility tree is faster, and the gap is structural, not implementation-specific. An AX-tree click on Windows resolves to IUIAutomationElement.FindFirst followed by IUIAutomationInvokePattern.Invoke, two COM cross-process calls that complete in single-digit milliseconds. A screenshot click has to capture the framebuffer (xcap on Windows pulls roughly 8 MB of RGBA at 1920x1080), encode to PNG, base64 the bytes, post the payload to a vision backend (OmniParser, Gemini, or local OCR), wait for the model to return bounding boxes, scale the coordinates back into screen space, then send a synthetic mouse event. Terminator's llms.txt at line 243 puts the gap at 100x ('CPU speed, not LLM inference'). The dominant cost on the screenshot path is the network round-trip and model inference, not the screenshot itself.

When should I actually use the screenshot path?

When the target surface has no accessibility provider, or has one that lies. Three categories: full-screen DirectX or OpenGL games where the OS sees one HWND containing a frame buffer; canvas-rendered design surfaces (Figma's drawing area, Photopea, web-based 3D tools) where every tool's hit region lives inside one canvas element; legacy ActiveX or custom-painted line-of-business widgets that draw their own pixels and never registered a UIA provider. In all three the AX tree returns a single opaque element and the only addressable thing is pixel coordinates. Terminator's click_element exposes vision_type='omniparser' and vision_type='gemini' for exactly these cases. Everywhere else the tree is the right default.

Can the same MCP tool do both grounding strategies in one call?

Yes, that is the design. In Terminator's MCP server, click_element is one tool with three modes (selector, index, coordinates) and the index mode takes a vision_type parameter that selects which grounding source to dereference. The enum is declared at crates/terminator-mcp-agent/src/utils.rs line 1062 with five variants: UiTree (the default at utils.rs:835), Ocr, Omniparser, Gemini, and Dom. The dispatch lives in server.rs starting at line 2609 as one match arm. The caller picks the source per click. The same agent can use ui_tree on a sidebar button and gemini on a canvas in the next call without re-instantiating anything.

Why is screenshot-based grounding still on the table if the tree is faster and more reliable?

Three honest reasons. First, the AX tree on macOS Chrome and Safari silently no-ops on AXPress for many web views, so any production desktop engine carries a hardcoded browser bypass list that falls back to synthetic input. Second, vision models like Gemini 2.0 Flash and OmniParser see things the tree does not: a numeric label drawn inside a canvas, an icon with no accessible name, a tooltip that has not yet shown. Third, vendor-built apps occasionally ship with the UIA provider disabled (older Office variants, kiosk shells, locked-down line-of-business apps), and the only entry point is the pixel grid. Production frameworks treat screenshot as a tagged fallback, not the default; that is the difference between a router and a single-source agent.

Does screenshot-based automation suffer from DPI or resolution changes?

Yes, twice. First, the captured pixel coordinates of an element shift when the user changes DPI scaling: a button at (412, 300) on a 100% scale 1920x1080 monitor lands at (515, 375) when the same window is dragged to a 200% scale 4K display. Second, vision models trained on one resolution scale degrade when given screenshots at a different resolution. Terminator's omniparser path at server.rs:1141 to 1147 carries explicit DPI scaling (`x = window_x + (box_2d[0] * inv_scale / dpi_scale_w)`) to convert physical-pixel bounding boxes back to logical screen coordinates. The accessibility tree is DPI-invariant by design: the OS reports element bounds in logical units already, and selectors carry semantic identity (role, AutomationId, name) that survives DPI, theme, and resolution changes.

Does the tree miss things the screenshot path catches?

Yes, and the gap is well-defined. Custom controls that draw their own pixels and skip the UIA provider interface show up in the tree as a single opaque element with no children: many games, some terminal emulators, certain 3D modellers, and legacy ActiveX widgets fall in this bucket. Tooltips and hover states are inconsistently exposed; the tree may not see a callout that has not been activated. Custom-rendered icons inside a canvas have no accessible name, and even when the tree sees the canvas element, it cannot tell you that the third icon from the top is the trash button. Vision models can. The right design treats screenshot-and-vision as a labelled fallback, not the default. That is what vision_type=omniparser and vision_type=gemini exist for.

How does Terminator decide which path to take if the caller does not pick one?

The caller picks one. There is no automatic selection; vision_type defaults to UiTree at utils.rs line 835 (`self.vision_type.unwrap_or(VisionType::UiTree)`). If you call get_window_tree first, the response carries indexed entries from the AX tree and the natural follow-up is click_element with vision_type=ui_tree. If you call capture_screenshot followed by an OmniParser or Gemini parse, the response carries indexed entries from the vision pass and the follow-up is click_element with vision_type=omniparser or vision_type=gemini. The server keeps four separate index maps in memory (uia_bounds, ocr_bounds, omniparser_items, vision_items, dom_bounds) keyed by the tool that populated them, so the dispatch is unambiguous.

Is screenshot-based desktop automation what Claude Computer Use does?

Yes. The Anthropic Computer Use beta operates over screenshots: the model receives a screenshot, emits pyautogui-style click(x, y) and type(text) actions, and waits for the next screenshot. This is the binary you might be reading about as 'screenshot desktop automation'. Pixel grounding is general (it works on anything visible) but slow (every action pays inference latency) and DPI-sensitive (the model has to learn the current resolution). The accessibility tree is narrower in surface coverage but deterministic on what it sees and ~100x faster per action. The honest combination: route the bulk of clicks through the tree to keep cost and latency down; route the residual through vision to cover the tree's blind spots. Terminator's terminator-computer-use crate is exactly that combination wired up.

What happens if I call vision_type=ui_tree without first calling get_window_tree?

You get a typed error, not a silent miss. The match arm at server.rs:2610 looks up the index in self.uia_bounds and returns McpError::internal_error with the message 'UI tree index N not found. Call get_window_tree first.' if the lookup fails. The same shape holds for vision_type=ocr (looks up ocr_bounds), vision_type=omniparser (omniparser_items), vision_type=gemini (vision_items), and vision_type=dom (dom_bounds). Each vision type has its own index map populated by its own producer tool. The error tells you exactly which producer to call. That is part of why the router design beats a single-source agent: the failure mode is observable instead of 'we tried the tree, fell back to vision, and silently aimed at the wrong pixel'.

Does this mean Terminator is doing screenshot automation when I use vision_type=gemini?

It is doing screenshot grounding, then dispatching the click through synthetic input at the resolved coordinates. The screenshot path takes a window or monitor capture, base64-encodes it, posts it to https://app.mediar.ai/api/vision/parse with the Gemini prompt, receives normalized bounding boxes, scales them back into screen space, and finally calls click_at_coordinates_with_type. The mouse event itself uses the same SendInput synthetic-input path that PyAutoGUI uses; Terminator does not pretend to invoke a UIA pattern when there is none to invoke. The honesty is in the labelling: vision_type=gemini means 'I am asking Gemini to find the element for me and then clicking the pixels'. vision_type=ui_tree means 'I am asking UIA to find the element and invoking the pattern'. Two different operations, one tool surface.

Adjacent reads

Patterns

Accessibility tree automation vs PyAutoGUI: the two clicks are not the same operation

Companion deep-dive on the syscall layer. invoke() at element.rs:838 to 859 calls UIInvokePattern.Invoke directly. PyAutoGUI's click(x, y) always lowers to SendInput.

Read

Architecture

MCP servers vs accessibility APIs: they are different layers, not alternatives

MCP is transport. UIA and AX are the OS hooks. Terminator-rs holds the AccessibilityEngine trait. Terminator-mcp-agent has 35 #[tool(...)] handlers that all dispatch through it.

Read

Agents

Accessibility API for computer use agents

Why a real desktop agent needs more than one grounding source. ClickMode + VisionType produces seven distinct click paths under one MCP tool.

Read