Alternative / AX tree vs screenshot (Mac)

AX tree vs screenshot for computer use on Mac. The choice is per-element.

Almost every comparison of these two approaches treats them as an architectural coin flip: pick one for the agent, run with it. On macOS that framing breaks in production because three specific quirks of the Mac Accessibility API silently flip the right answer on roughly half the surfaces a real desktop agent encounters in a typical user session. This page is about what those quirks are, why they make per-call routing the only honest shape, and what a router that handles all three looks like in code.

Direct answer (verified 2026-05-12)

Use the AX tree by default for native Cocoa surfaces (every AppKit and SwiftUI app: Notes, Mail, Calendar, Finder, System Settings, Xcode). Fall back to screenshot plus a vision model (OCR, OmniParser, or Gemini) per call when AX returns one opaque AXGroup or six nested AXGroups with no titles, which is what Chromium-rendered apps (Slack, Discord, VS Code, Cursor, ChatGPT desktop, Claude desktop) and custom-painted canvases (Figma, Photoshop, games) actually return on Mac AX. The decision belongs at the call level, not the task level. Source for the routing shape: crates/terminator-mcp-agent/src/utils.rs lines 1060 to 1073, the VisionType enum with five values defaulting to UiTree.

Authoritative macOS AX reference: developer.apple.com/documentation/applicationservices/axuielement_h. Terminator core (cross-platform trait, AX tree walker with window-traversal workaround): github.com/mediar-ai/terminator.

The two regimes, on actual Mac surfaces

The abstract comparison (faster, cheaper, more deterministic) is real but it hides what actually happens when you walk the tree on a Mac. Toggle to see the AX subtree two real apps return when your agent walks them with full accessibility permission granted. The first is a native Cocoa app and the tree is doing the job. The second is the Electron app a user opens five minutes later, with the same permission and the same walker, and the tree has nothing for the agent to grab.

Same AX walk, two surfaces

# Native Cocoa surface (Notes.app, top-level document window) # Output of AXUIElementCopyAttributeValue walked recursively. # This is what the agent reads instead of a 2576px PNG. AXWindow "Untitled" (focused) AXScrollArea AXTextArea value: "Buy milk, eggs, sourdough" focusable AXToolbar AXButton title: "Share" enabled AXButton title: "Done" enabled AXPopUpButton title: "Format" enabled AXSplitGroup AXOutline title: "Folders" AXOutlineRow title: "On My Mac" selected AXOutlineRow title: "iCloud" # Every node has a role (AXButton, AXTextArea, AXOutlineRow), a stable # accessible name (the title), a bounds rect, and a flag set. The model # clicks by saying "AXButton title=Share" and the resolver calls # AXUIElementPerformAction(elt, kAXPressAction). No pixels involved.

every node has a role below AXGroup
accessible names are stable and localized correctly
click by AXPress runs in microseconds

Three quirks the rhetoric does not mention

The latency and token numbers you see quoted for AX-vs-screenshot (50ms vs 5s, kilobytes vs megabytes) are real but they assume the tree is there. On Mac specifically, three things change whether that assumption holds. Production agents fail in different ways at each of them, and the only honest fix is to detect the failure shape and route around it per call.

The macOS AX quirks every Mac agent project hits

Default TreeWalker does not traverse windows. The first line of crates/terminator/src/platforms/tree_search.rs in the Terminator repo is literally a TLDR comment about this: macOS AX gives you children of one window via kAXChildrenAttribute, but to see all an app's windows you have to read kAXWindowsAttribute from the app element and walk each one. Miss that and your agent thinks the app only has whatever window happens to be returned first.
System-wide focus is permission-gated. AXUIElementCreateSystemWide and kAXFocusedUIElementAttribute return kAXErrorAPIDisabled until your binary appears in System Settings, Privacy and Security, Accessibility, with the checkbox toggled on. The first run of any Mac AX agent fails silently here. You need AXIsProcessTrustedWithOptions on every cold start.
Electron and Chromium webviews collapse to AXGroup soup. Chromium speaks AX on Mac, but the bridge maps the DOM to a stack of nested AXGroups with no titles, no AXIdentifier, and no role information below Group. Slack, Discord, VS Code, Notion desktop, ChatGPT desktop, Claude desktop, Cursor: all of these return an unusable subtree even with permission granted.
Custom-painted surfaces are one opaque element. Games, Figma's canvas, Photoshop's document area, terminal emulators with GPU compositors, anything rendered through Metal or OpenGL: AX sees one AXGroup with the full window bounds. The tree is technically there, but there is nothing in it to click.
What you keep when AX works: deterministic targeting that survives DPI changes, dark mode, localization, and theme animations. Native Cocoa apps (Notes, Mail, Calendar, Reminders, System Settings, Finder, every AppKit and SwiftUI surface) expose role plus accessible name on every interactive element, and the click happens via kAXPressAction in microseconds with no model turn spent grounding from pixels.

The quirk written into the source

The first line of crates/terminator/src/platforms/tree_search.rs in the Terminator repo is a TLDR comment about the most subtle of these:

/// TLDR: default TreeWalker does not traverse windows,
/// so we need to traverse windows manually
use accessibility::{AXAttribute, AXUIElement, AXUIElementAttributes, Error};
// ...

pub struct TreeWalkerWithWindows {
    attr_children: AXAttribute<CFArray<AXUIElement>>,
    visited: RefCell<HashSet<AXUIElementWrapper>>,
    cycle_count: RefCell<usize>,
}

That is a workaround for an OS API behavior. macOS AX gives you children of a single window when you read kAXChildrenAttribute on an app element, but most real Mac apps run with several windows alive at once: the main document, an inspector, a sheet, a preferences panel, a contextual palette. To see them all the walker has to read kAXWindowsAttribute on the app element and walk every window separately, then dedupe with a visited set because window children can reference each other. The default walker in any off-the-shelf accessibility crate does not do this. An agent built on top of it silently misses every interactive element that is not in the first window returned, and the model has no signal that anything is missing.

This is the kind of thing the rhetoric never names because it lives one level below "use the accessibility tree." But it is exactly the kind of failure that turns a Mac agent demo into something that does not work on a user's machine, where the user has Calendar open with the inspector and the agent looks at the wrong window for an hour.

The router, drawn

Once you accept the routing is per-call, the architecture is a small piece of code at the tool layer of the MCP server. Every click_element invocation carries an optional vision_type parameter. When omitted, the server uses the AX tree. When the agent has seen an empty subtree or a single opaque AXGroup on the previous call, it sets vision_type: "omniparser" or vision_type: "gemini" and the same tool call runs through the vision path instead.

Per-element routing on a Mac agent step

🌐

Agent asks for window tree

Call get_window_tree on the focused app

✅

Native Cocoa? AX wins

AXWindow with named children: click by role+title via kAXPressAction

📦

Electron or webview?

Recursive AX walk returns nested AXGroups with no titles

↪️

Route to vision fallback

Call click_element with vision_type: Omniparser or Gemini

⚙️

Custom-painted canvas?

One opaque AXGroup for entire surface: vision_type: Ocr or Gemini

The five values of VisionType in utils.rs (lines 1060 to 1073) cover the realistic Mac surfaces: UiTree for AX, Ocr for surfaces where the only signal is rendered text (terminal, custom labels), Omniparser for general visual grounding when AX is empty, Gemini for vision-language grounding when OmniParser is not enough, and Dom for browser DOM when the agent has a webdriver attached. The default returned by get_vision_type (around line 834) is UiTree, which is the right default for any task that opens with a native Cocoa app.

What the switch looks like on the wire

One round-trip walks the AX tree and finds nothing. The next call from the agent flips to vision_type Omniparser and the click happens. The model spent zero pixels on Notes a second earlier; it spends pixels here because here is where pixels are the only signal.

Slack on Mac: AX returns nothing, agent falls back to Omniparser

Wire it up on a fresh Mac agent

Four steps that hold whether you build on top of an existing framework or roll your own AX shim. The first one is the step that silently fails on day one if you skip it.

AX-first, vision-fallback, per-call routing

1
Grant accessibility permission
AXIsProcessTrustedWithOptions on cold start; surface a setup screen if it returns false.
2
Walk windows explicitly
Read kAXWindowsAttribute from the app element. Do not rely on the default child walk.
3
Try AX first
If the subtree has at least one element with a named role below AXGroup, target it structurally.
4
Fall back per call
If AX returns AXGroup soup, the same click_element call switches to vision_type: Omniparser or Gemini.

Terminator's published MCP server runs this shape today, but on Windows only: npx -y terminator-mcp-agent@latest ships a Windows binary, and the workspace mod.rs at line 320 still carries compile_error!("Terminator only supports Windows...") for the non-Windows targets. The macOS code paths exist in the repo (tree_search.rs, element.rs cfg blocks, the AccessibilityEngine trait at platforms/mod.rs line 86) but the binary is not yet shipping. If you need this routing on Mac today, the architecture is here to copy: read the AX shim files for the window-traversal logic, wire OmniParser or Gemini for the vision fallback, and keep the per-call switch at the tool-router layer.

Numbers worth carrying

Four small numbers from the source code, not benchmarks. They define the shape of the router and where the gaps still live.

0vision_type values in Terminator's MCP server (Ocr, Omniparser, UiTree, Dom, Gemini)

0default vision_type on click_element: UiTree, set in utils.rs at get_vision_type

0line in platforms/mod.rs where the workspace compile_error! still blocks non-Windows builds

0screenshot bytes sent when AX has a named role and AXPress is supported

Why this matters more on Mac than on Windows

Windows UIA is mature and covers WinUI, WPF, WinForms, MFC, and most major IDEs natively with AutomationId attributes that survive localization and theme changes. The AX-everywhere default works for the majority of Windows surfaces. Mac is the opposite shape. Native Cocoa apps are well-covered by AX, but the user spends half their day in Chromium-rendered Electron apps (Slack, Discord, VS Code, Cursor, ChatGPT desktop, Claude desktop, Notion, Linear, Figma's app shell, Spotify), and Chromium on Mac maps the DOM to AXGroup soup. The AX-everywhere default is wrong for that half. Routing per call is not a fancy optimization; it is the only way to keep the agent useful across the apps a user actually opens.

Building a Mac computer use agent and tired of AXGroup soup?

30 minutes. Walk through your agent loop, see where the per-call routing buys you the most, leave with a concrete next step.

Questions about AX tree vs screenshot for Mac computer use

When should a Mac computer use agent prefer AX tree over screenshot?

When the target subtree contains at least one element with a named role below AXGroup and a non-empty accessible name. That is roughly every native Cocoa surface: Notes, Mail, Calendar, Reminders, Finder, System Settings, every AppKit and SwiftUI app, plus Xcode and Office for Mac. On those surfaces, AX gives you role, title, bounds, and state in a few kilobytes of structured text per window, and the click happens via AXUIElementPerformAction with kAXPressAction in microseconds. The agent reads structured text instead of grounding from a 2576px PNG every turn.

When should the same agent fall back to screenshot plus vision?

Three real surfaces. First, Electron apps and Chromium-embedded webviews (Slack, Discord, VS Code, Notion, Cursor, ChatGPT desktop, Claude desktop): macOS Chromium maps the DOM to nested AXGroups with no titles, no AXIdentifier, and no role information. Even with full accessibility permission, the agent gets unactionable subtree shapes. Second, custom-painted surfaces: games rendered through Metal, Figma's canvas, Photoshop's document area, terminal emulators with GPU compositors. AX sees one opaque element with the window bounds and nothing inside. Third, image interpretation: any task where the user asks the agent to read a chart, an image embed, or visually compare two states. The tree does not carry that information.

Is the right decision per-task or per-element?

Per-element. A single task on Mac frequently spans both regimes: open Slack (Electron, vision), copy a message (vision), switch to Notes (AX), paste and save (AX). Routing at the task level forces you to use vision everywhere just because one surface in the task needed it, which is wasteful for the AX-friendly steps and slow for the agent loop. Routing at the call level keeps each tool invocation honest. Terminator exposes this as the vision_type parameter on click_element with five values (Ocr, Omniparser, UiTree, Dom, Gemini) that the agent selects per call, defaulting to UiTree when omitted. Source: crates/terminator-mcp-agent/src/utils.rs lines 1060 to 1073.

What is the specific macOS AX quirk most agents miss?

The default TreeWalker does not traverse windows. macOS AX returns children of one window when you read kAXChildrenAttribute on an app element, but a real Mac app frequently has several windows (the main window plus inspectors, palettes, modal sheets, preferences). To see them all you have to read kAXWindowsAttribute and walk each window separately. Terminator's repo has this written into its first comment: crates/terminator/src/platforms/tree_search.rs line 1 reads 'TLDR: default TreeWalker does not traverse windows, so we need to traverse windows manually.' The file ships a TreeWalkerWithWindows struct that enumerates kAXWindowsAttribute siblings and walks them all. If your agent uses an off-the-shelf accessibility crate without this workaround, you will silently miss elements in roughly half the apps your users open.

Does Terminator ship a macOS binary today?

Not yet. The Rust core has cross-platform scaffolding: AccessibilityEngine is shaped as a trait at crates/terminator/src/platforms/mod.rs line 86, the tree_search.rs file uses the accessibility-sys crate to walk AXUIElement trees, and several files (element.rs around line 1883, lib.rs around line 1567) have target_os = 'macos' code paths. But the same mod.rs ends at lines 318 to 320 with #[cfg(not(target_os = 'windows'))] compile_error!('Terminator only supports Windows. Linux and macOS are not supported.'). The published binaries on npm (terminator-mcp-agent) and pip (terminator-py) are Windows-only as of this writing. If you need this routing shape on Mac today, the architecture is here to copy: use atomacos or a pyobjc shim around AXUIElement for the AX path, wire OmniParser or Gemini for the vision path, and route per-call.

Why is the screenshot path slower and more expensive on Mac specifically?

Two reasons. First, Retina. A Mac on a 16-inch built-in display is 3456 by 2234 logical pixels, which is roughly 7.7 megapixels at native resolution and around 3 megapixels after Anthropic's 2576px long-edge downscale. Every screenshot is large, every encode round-trips through PNG, every upload spends real bandwidth. Second, the round-trip shape. Native Anthropic computer use is screenshot, act, screenshot, where the post-action screenshot is mandatory because the tool result for left_click does not echo state. Every action costs two round-trips and one vision pass. The AX path on Mac is one Mach call, one InvokePattern equivalent (AXUIElementPerformAction), and one optional tree re-read. The first runs in microseconds; the second runs in hundreds of milliseconds plus vision pricing.

Does the AX tree get any of the visual context the screenshot does?

Not for free, and the absences matter. The tree gives you role, name, bounds, value, and state. It does not give you color, font, image content, custom-drawn icons, or anything rendered into a canvas. If your agent task involves 'click the red button' or 'find the chart that is trending down' the tree is silent on the discriminator. Production agents handle this in two ways. The first is a small per-element vision call on demand: when the agent has a target named element in the AX tree but needs to disambiguate among several matches, it takes a cropped screenshot of the bounds and asks the vision model to pick. The second is the explicit vision_type fallback per call when no usable AX subtree exists. Both are cheaper than 'screenshot every turn' but recover the visual signal where the task needs it.

What is the failure mode of 'AX everywhere' on a real Mac desktop?

Three modes, in roughly the order users hit them. One: silent permission failure. AXUIElementCopyAttributeValue against another process returns kAXErrorAPIDisabled until your binary is in System Settings, Privacy and Security, Accessibility. A first-run agent that does not detect this with AXIsProcessTrustedWithOptions tells the model 'the app has no UI' and the loop dies confused. Two: window blindness. The default child walk misses non-main windows, so the agent thinks Calendar has no inspector or that Mail has no compose window. Three: Electron silence. The agent enters Slack and the tree is six nested AXGroups with no titles. The model emits 'click the Send button' and the resolver returns no matches, then synthesizes a click at coordinates the model invented from nothing. The fix in all three cases is the same: detect each failure shape and route to the vision_type fallback before the agent thinks the surface is empty.

What is the failure mode of 'screenshot everywhere' on a real Mac desktop?

The opposite shape. Every turn costs a screenshot, an encode, a vision pass, and a coordinate decision the model has to ground from raw pixels. Most native Mac surfaces are stable AX-friendly apps where the model spends its entire vision budget re-finding a button it could have looked up by role plus title for zero tokens. Latency stacks up too: a 100-action task at 2 seconds per action is 200 seconds of vision; the same task with AX routing on the native surfaces and vision routing on the two Electron surfaces is around 30 seconds. The model's accuracy also dips: a 5-pixel tooltip shift, a theme animation mid-capture, a transient hover state, all of these turn into 'click missed' on the screenshot path but never reach the AX path because the tree is theme-independent.

What is the practical default for a new Mac agent project?

AX-first with vision-fallback per call. Default every click and read to the AX tree. Detect the three failure shapes (permission denied, AXGroup soup, single opaque AXGroup) at the tool-router layer and switch to vision_type Omniparser or Gemini on those specific calls. Keep the same selector grammar on top so the agent code reads the same on both branches. The Terminator MCP server's click_element tool is shaped this way: it takes a selector, an optional vision_type, and runs the tree path by default. The agent gets to pick the grounding mechanism without restructuring the prompt or the tool surface.

Specific deep dives into the same stack

Adjacent reads

Alternative

Claude Opus 4.7 computer use, without the screenshots

The tree-diff alternative to computer_20251124. Every action returns the changed UI elements as compact YAML, so the model reads what happened without another screenshot.

Read

Alternative

macOS AX vs Windows UIA agent: where a Playwright-style trait leaks

AXUIElement and IUIAutomationElement under one Rust trait. Role names, action invocation, focused-element semantics: where the abstraction actually leaks.

Read

Guide

Accessibility tree vs PyAutoGUI: structural versus pixel automation

Why selector-driven trees beat pixel coordinates and OpenCV template matching for production desktop automation.

Read

The two regimes, on actual Mac surfaces

Same AX walk, two surfaces

Three quirks the rhetoric does not mention

The quirk written into the source

The router, drawn

What the switch looks like on the wire

Wire it up on a fresh Mac agent

AX-first, vision-fallback, per-call routing

Numbers worth carrying

Why this matters more on Mac than on Windows

Building a Mac computer use agent and tired of AXGroup soup?

Questions about AX tree vs screenshot for Mac computer use

Adjacent reads

Claude Opus 4.7 computer use, without the screenshots

macOS AX vs Windows UIA agent: where a Playwright-style trait leaks

Accessibility tree vs PyAutoGUI: structural versus pixel automation

Comments (••)

Comments ()