MCP servers vs accessibility APIs: they are different layers, not alternatives

The framing buries the actual question. MCP is a protocol; an AI client like Claude Desktop or Cursor speaks it to discover and call your tools. Accessibility APIs are OS hooks; UI Automation on Windows and AXUIElement on macOS report on real UI elements and act on them. They are stacked, not opposed. The useful comparison is whether the MCP server you are picking actually wraps a real accessibility engine. Terminator's repo makes the boundary literal: one Rust crate for the engine, one for the 35-tool MCP server, and the second never touches IUIAutomation without going through the first.

MCPUIAutomationAXUIElement35 toolsv0.24.32

Direct answer (verified 2026-05-04)

MCP servers are not a replacement for accessibility APIs. MCP is the transport and tool-discovery protocol your AI client speaks. Accessibility APIs are the OS-level interfaces (UI Automation, AXUIElement, AT-SPI) that read and act on real UI elements in running apps. A useful desktop-automation MCP server wraps a real accessibility engine and exposes its operations as typed tools. An MCP server with no engine underneath is a JSON-RPC harness that ends every call in a screenshot the LLM has to read.

Authoritative sources: Model Context Protocol architecture and Microsoft UI Automation overview.

Where each layer sits

Walking the call from a Claude tool invocation down to the bit of the OS that actually clicks the button makes the layering concrete. The MCP server is parse, validate, dispatch, and repackage. The accessibility API is the call that crosses into another process and runs the action. The engine trait between them is the seam where Terminator hides the platform code.

One click, four hops

What is actually under the protocol layer

Two MCP servers can advertise tools/list with the same names and have completely different behavior in production. The difference is what the handler dispatches into. A vision-only server ends every call at a screenshot and an LLM coordinate guess. An accessibility-backed server ends at a UIA pattern invocation in the target process. Toggle the panel to see how the source code splits.

What is actually under the MCP layer

An MCP server in this shape exposes tools like take_screenshot, click_at(x, y), and type_text. There is no accessibility engine in the loop. Each tool call ends in a screenshot the LLM has to read, and a SendInput-style synthetic event the LLM has to aim. The protocol layer is honest, but the stack underneath is screen-and-pixel automation in a JSON-RPC suit. Failure modes inherit from PyAutoGUI: DPI-sensitive coordinates, cursor takeover, fragile template matching, and an LLM in the loop on every action.

tools shaped like screenshot + click_at(x, y)
no IUIAutomationElement anywhere in the source
every action pays LLM inference latency to read pixels
wins where the target has no AX provider (games, canvas)

What each layer contributes

MCP and accessibility APIs solve disjoint problems. Mixing them up obscures the choice you are actually making when you pick an MCP server. The split below is the working list.

Concern	MCP servers (the protocol)	Accessibility APIs (the OS)
What it speaks	JSON-RPC 2.0 over stdio, SSE, or streamable HTTP	COM (UIA on Windows), Mach (AX on macOS), D-Bus (AT-SPI on Linux)
What it knows about the desktop	Nothing. The protocol is opaque to the target apps.	The full live UI tree: roles, names, automation IDs, bounds, focused element.
What you call to make a click happen	tools/call with arguments shaped by the server's schema	IUIAutomationInvokePattern::Invoke (Windows) or AXUIElementPerformAction (macOS)
Discovery model	tools/list returns named tools and JSON schemas an LLM client can read	Walk the tree from the desktop root or a specific process; query by control type, role, or AutomationId
Owns the selector grammar	No. The grammar lives in the server's tool schemas.	No. UIA gives you ConditionTree, AX gives you attribute queries; both are too low-level for a tool schema.
Concurrency	Per-server policy; Terminator defaults to MCP_MAX_CONCURRENT=1 because a desktop has one focused window	Calls are per-process; a UIA call into Notepad and one into Excel can overlap, but two clicks into the same app cannot
What “works” means	The client received a CallToolResult with status and content	The pattern call returned S_OK and the target process actually ran the default action
Useful without the other?	Only if some other engine is on the inside (browsers, APIs, file system, vision)	Yes. It is a normal SDK; you call it from Python or Rust without an LLM in the loop.

The row that decides everything is “what you call to make a click happen”. MCP gives you a JSON-RPC frame. The accessibility API gives you a function that runs in the target process. Without the bottom row you have a protocol pointing at nothing.

One click, the five hops it takes

The path from a Claude tool call to a real button press is a handful of layered moves. Each hop owns its own job; none of them is the whole stack.

From tools/call to UIInvokePattern.Invoke

1
Client speaks MCP
Claude Desktop, Cursor, VS Code, or Claude Code emits tools/call over stdio.
2
Server validates args
rmcp's typed Parameters extractor enforces the JSON schema before any UI work.
3
Engine trait runs
AccessibilityEngine.find_element resolves the selector to a real UIA element.
4
Pattern fires
UIInvokePattern.Invoke() crosses into the target process via COM.
5
Diff returns
CallToolResult carries success and an optional UI tree diff back up the chain.

What lives where, in plain terms

A short working list to keep the layers separate when you are picking tooling.

Layer responsibilities

MCP defines tools/list, tools/call, prompts, and resources over stdio, SSE, or streamable HTTP.
Accessibility APIs are OS-level interfaces: UI Automation on Windows, AXUIElement on macOS, AT-SPI on Linux.
The MCP server is the seam: it owns the schema, the selector grammar, the error model, and concurrency.
The engine layer is the work: it owns FindFirst, pattern lookups, and the synthetic-input fallback path.
If the MCP server has no engine, every tool call is a screenshot + LLM coordinate guess.
If the engine has no MCP server, you have a normal SDK, not a tool an AI client can discover.

Two MCP servers, same protocol, different engines

Both snippets are valid MCP servers. Both pass schema validation, both register with Claude Desktop, both expose tools the LLM can call. The difference is what the tool handlers actually do. The first one ends every call at pyautogui. The second routes through a real AccessibilityEngine trait declared at crates/terminator/src/platforms/mod.rs:86.

MCP server, with and without an engine underneath

# A vision-only MCP server. The protocol layer is honest;
# the stack underneath is pyautogui + screenshots.

@server.tool()
async def take_screenshot() -> bytes:
    return PIL.ImageGrab.grab().tobytes()

@server.tool()
async def click_at(x: int, y: int) -> dict:
    pyautogui.click(x, y)              # SendInput, no element
    return {"clicked": [x, y]}

@server.tool()
async def type_text(text: str) -> dict:
    pyautogui.typewrite(text)          # virtual key presses
    return {"typed": text}

# the LLM has to:
#   1) call take_screenshot
#   2) read the bytes
#   3) reason about coordinates
#   4) call click_at(x, y) and hope the click landed
#
# every action pays LLM inference. DPI changes break it.
# zero accessibility-API calls anywhere in this server.

-8% lines per click

Why split the engine and the MCP server into separate crates

The Terminator workspace at version 0.24.32 ships two crates that matter for this question. The first is terminator-rs (path: crates/terminator). It declares the cross-platform-shaped trait pub trait AccessibilityEngine at src/platforms/mod.rs line 86 with methods like find_element, get_window_tree, get_focused_element, click_at_coordinates. The Windows implementation under src/platforms/windows wires uiautomation-rs and is what fires real UIA pattern calls.

The second is terminator-mcp-agent (path: crates/terminator-mcp-agent). Its src/server.rs is roughly 11,000 lines, and a grep for #[tool( returns 35 hits, one per MCP tool exposed to the client. Every one of those handlers parses arguments through rmcp's typed Parameters extractor, then dispatches through the engine. The MCP layer never calls IUIAutomation directly. That is what the seam exists for: the protocol surface can grow new tools without churning UIA code, and the engine can absorb UIA quirks without breaking the wire schema.

The honest caveat: today the engine is Windows-only. src/platforms/mod.rs lines 319 to 320 emit compile_error!("Terminator only supports Windows. Linux and macOS are not supported.") on any non-Windows target. The trait is shaped to be cross-platform, the locator grammar is platform-neutral, and previous versions of the repo had a macOS implementation, but the version on main today does not ship one. Read the trait, decide if the seam is the right one for your stack, and if you need macOS today, this specific MCP server is not it. If you need Windows desktop automation behind an MCP server, the UIA path is the production one and is what 35 tools route through.

35 tools, 1 trait

“Most playbooks frame this as MCP or accessibility APIs. In production it is MCP plus a real engine, or it is screenshots in a JSON-RPC suit.”

From the engine trait at platforms/mod.rs:86

Picking an MCP server when desktop control is the actual goal

A short field guide if you are evaluating MCP servers for a desktop-driving agent rather than a browser one. Every one of these is a question about the engine, not the protocol.

Does the source import a real OS automation library? Look for uiautomation, pywinauto, FlaUI on Windows; AXUIElement, atomacos, pyobjc-framework-ApplicationServices on macOS. If the only imports are pyautogui and a screenshot library, it is a vision-only server.
Does the click tool take a selector or coordinates? A selector signature (role:Button|name:Save) means there is a tree-walking layer. A coordinates-only signature (x, y) means the LLM owns the visual reasoning every call.
Is there a get_window_tree tool? A tree dump that returns roles, names, AutomationIds, and indexed clickable elements is the thing that makes selector-mode and index-mode click tools work. Without it, the LLM has nothing to ground against.
What is the concurrency default? A desktop has one mouse, one keyboard, and one focused window. Servers that fan-out tool calls without thinking about focus will corrupt UI state. Terminator defaults MCP_MAX_CONCURRENT=1.
Is there a fallback path for opaque targets? Real production work runs into AX-empty surfaces (games, canvas apps, custom-rendered controls). The honest answer is a layered stack: tree first, then OCR or DOM index, then vision-LLM grounding, then raw coordinates. Servers that admit this and route around the gaps tend to work in production.

Driving Windows apps from an AI agent? Talk to us.

Walk through the engine boundary, the 35 MCP tools, and the production traps in the path from tools/call to UIInvokePattern.Invoke.

Frequently asked

Are MCP servers a replacement for accessibility APIs?

No. They sit at different layers and answer different questions. MCP (Model Context Protocol) is a transport spec from Anthropic that defines how an AI client (Claude Desktop, Cursor, VS Code, Claude Code, Windsurf) discovers and calls tools exposed by a server, with a typed JSON schema for arguments and results. Accessibility APIs are OS interfaces (UI Automation on Windows, AXUIElement on macOS, AT-SPI on Linux) that report on the live UI tree of running applications and let you act on elements (invoke, set value, expand, focus). An MCP server with no accessibility engine underneath is a JSON-RPC harness with nothing real to call. An accessibility engine with no MCP server is a normal SDK; you import it from Python or Rust and call methods on it. The useful product is the stack: an MCP server whose tools are thin handlers over a real accessibility engine.

What does an MCP server contribute that the accessibility API alone does not?

Three things, all on the agent side, none of them about the desktop. First, discovery: MCP defines tools/list and tools/call so an LLM client can enumerate what the server exposes without hardcoding. Second, schema: each tool ships a JSON schema that the client uses to constrain the LLM's tool calls and reject malformed arguments before they touch the desktop. Third, transport: MCP runs over stdio, SSE, or streamable HTTP, so the same server can be embedded next to Claude Desktop or sit behind an authenticated HTTP endpoint. None of this is desktop functionality. It is the interop layer that lets an LLM, running anywhere, call into your accessibility engine without bespoke glue.

What does an accessibility API contribute that MCP alone does not?

Everything that actually moves bits in another process. UIA on Windows resolves an IUIAutomationElement for a button, asks for its IUIAutomationInvokePattern, and calls Invoke. The call crosses into the target process via COM and runs the button's default action. AXUIElement on macOS does the same shape with AXUIElementCopyAttributeValue and AXUIElementPerformAction. These are the calls that talk to running apps. MCP cannot do any of it on its own; the protocol does not know what a Notepad save button is. The accessibility API is what makes the click real. The MCP server's job is to turn 'click_element with selector role:Button|name:Save in process notepad' into the right sequence of UIA calls and report back what happened.

How can I tell whether an MCP server is actually using accessibility APIs or is just pasting screenshots?

Read the source if it is open. Look for direct imports of the OS automation library: uiautomation, pywinauto, FlaUI, or the raw Win32 UIAutomationCore on Windows; AppKit AXUIElement, ApplicationServices.framework AXUIElementCopyAttributeValue, or atomacos on macOS. If the only imports are pyautogui, pillow, opencv, or a vision model client, the server is doing screen-and-pixel automation and the MCP wrapper is an illusion of structure on top. Run the server and inspect a tools/list response: a real accessibility-backed server exposes tools shaped like get_window_tree, click_element with a selector grammar, set_value, expand_collapse. A vision-only server's tools tend to be take_screenshot, click_at, type_text, and the LLM has to reason about coordinates from raw image bytes.

Why does Terminator split the engine and the MCP server into two separate Rust crates?

Because they answer different questions and have different release cycles. The terminator-rs crate (publish name on crates.io) holds the platform code: the AccessibilityEngine trait at crates/terminator/src/platforms/mod.rs line 86, the Windows implementation under crates/terminator/src/platforms/windows that wires uiautomation-rs, the locator grammar in selector.rs, and the synthetic input fallback in input.rs. It can be used as a normal Rust SDK with no LLM in the loop, and it has Python bindings (terminator-py). The terminator-mcp-agent crate is an axum and rmcp service whose only job is to expose 35 tools to an MCP client and route each one through the engine. The MCP layer can grow new tools without changing the AX layer, and the AX layer can fix UIA quirks without breaking the protocol contract. If they were one crate, every MCP feature would touch the platform code and every UIA fix would force an MCP version bump.

Does Terminator support macOS today?

Not from this repo, today. The current state is Windows-only and the AccessibilityEngine trait declaration is followed by a hard compile gate at crates/terminator/src/platforms/mod.rs lines 319 to 320: compile_error!("Terminator only supports Windows. Linux and macOS are not supported."). The trait is shaped to be cross-platform (find_element, get_window_tree, get_focused_element are platform-neutral signatures), and previous versions of the repo had a macos.rs implementation, but the version on main right now does not ship one. If you need macOS desktop automation through accessibility APIs from an MCP server today, that is not what this codebase covers. If you need Windows, the UIA path is the production one and is what 35 MCP tools route through.

If MCP is just transport, why does it matter which MCP server I pick?

Because the server is more than transport. It owns the selector grammar, the error model, the concurrency contract, and the side-effect surface. A click_element tool that takes a raw HWND and pixel coordinates puts the burden on the LLM to reason about windows. A click_element tool that takes a selector like role:Button|name:Save and a process scope puts the burden on the server to walk the AX tree and find the right element. A server with MCP_MAX_CONCURRENT defaulting to 1 (Terminator's default in main.rs) will serialize tool calls because a desktop has one focused window; a server that fan-outs tool calls without thinking about focus will corrupt your UI state when the agent calls two tools in parallel. The protocol does not impose any of this; the server design does.

What does one click look like, end to end, in this stack?

An MCP client (say Claude Desktop) sends a tools/call JSON-RPC frame: { name: "click_element", arguments: { process: "notepad", selector: "role:Button|name:Save" } }. The Terminator MCP server in server.rs at the click_element handler (line 2486) parses arguments via rmcp's typed Parameters extractor, picks a ClickMode (Selector vs Index vs Coordinates), and calls into the AccessibilityEngine. On Windows, the engine resolves the selector through the WindowsEngine implementation that wraps uiautomation-rs, which calls IUIAutomationElement->FindFirst with the appropriate condition tree, gets the IUIAutomationInvokePattern, and calls Invoke. The pattern call crosses into Notepad's process, runs the button's default action, and returns. The MCP server packs a CallToolResult with success status and any UI diff, and sends the response back to Claude. The protocol carries the request and the response. The accessibility API is what made the click happen.

Can an LLM use an accessibility API directly without an MCP server in the middle?

Yes, if you write the glue yourself. An LLM is just a function that returns text or tool-shaped JSON. Anthropic's Computer Use beta lets Claude take screenshots and emit pyautogui-style clicks; OpenAI's function calling lets you expose any Python function as a tool. You can hand-roll a function tool that wraps pywinauto or atomacos and skip MCP entirely. What you give up is portability. Your tool definition is bound to one client SDK, the schema lives in your codebase, and switching clients (Claude Desktop to Cursor to Windsurf) means re-wiring the tool registration. MCP is an interop bet: write the server once, plug it into any MCP-aware client. Terminator's MCP server registers in Claude with claude mcp add terminator 'npx -y terminator-mcp-agent@latest' and the same binary works in Cursor and VS Code with config-file edits.

Where does this leave Playwright? Is Playwright an accessibility API or an MCP server?

Neither, originally. Playwright is a browser automation library that drives Chromium, Firefox, and WebKit through their CDP and similar protocols. It happens to use the browser's accessibility tree as one of its locator strategies (page.getByRole), which is why people call it an accessibility-tree library by analogy. Playwright MCP is a wrapper around Playwright that exposes browser automation as MCP tools. So the stack for Playwright MCP is: MCP transport, Playwright Node API, Chrome DevTools Protocol, browser process. No OS accessibility API is involved on the browser path. Terminator's stack is: MCP transport, terminator-rs SDK, Windows UI Automation, target app process. They are siblings, not the same thing. Browser-only automation: Playwright and Playwright MCP are usually the right choice. Native desktop automation: a UIA-backed MCP server is the right choice.

Companion deep-dives on the same stack.

Keep reading

Patterns

Accessibility tree automation vs PyAutoGUI: the two clicks are not the same operation

Companion deep-dive on the syscall layer. invoke() at element.rs:838 to 859 calls UIInvokePattern.Invoke directly. PyAutoGUI's click(x, y) always lowers to SendInput.

Read

Agents

Accessibility API for computer use agents: the seven-mode click_element router

Why a real desktop agent needs more than one grounding source. ClickMode + VisionType produces seven distinct click paths under one MCP tool. utils.rs:728 and 1062.

Read

MCP

The best MCP server is the one shaped like the resource it controls

Concurrency shape beats tool count. A desktop has one mouse, one keyboard, one focused window. Terminator defaults MCP_MAX_CONCURRENT=1; here is the main.rs code.

Read