MCP desktop accessibility automation: an honest reader’s guide to the holes

Almost every guide on this combination sells the upside. Playwright for the OS. Accessibility tree as the new DOM. Computer use without the screenshots. All true. None of them tell you which parts of that picture do not work yet, which is the part you actually need before you ship a workflow against a real desktop. This page walks through five concrete holes in Terminator’s MCP server, each with the file and line that proves it, and what the workaround looks like today.

Matthew Diakonov, Written with AI

Published May 8, 20268 min read

Direct answer (verified 2026-05-08)

MCP desktop accessibility automation is an MCP server that exposes the OS accessibility tree (UI Automation on Windows, AXUIElement on macOS) as named tools an LLM client can call, so a coding agent can read native-app UI by role and name and act on it without OCR or pixel matching.

Terminator is one open-source implementation. Install:

claude mcp add terminator "npx -y terminator-mcp-agent@latest"

Source on GitHub: mediar-ai/terminator. MIT licensed. The install one-liner is taken verbatim from crates/terminator-mcp-agent/README.md:19.

Why a tour of the holes is more useful than another tour of the features

The features are well covered. The accessibility tree is structural; it survives DPI, theme, scroll, and most localisation. UIA exposes Control Patterns (Invoke, Toggle, Value) that fire writes without moving the cursor. The MCP server bundles roughly 35 tools end to end and routes them through one dispatch table. All of that is real and you can read about it on a dozen pages including a few on this site.

What you cannot easily find is the inverse list: the cfg-gated subsets that quietly return success on the platform they do not support, the protocol features the most popular MCP clients have not shipped yet, the parsing edge cases the LLM will produce on the first try, the surfaces the accessibility tree never exposes. That is what this page is. Each section names one hole, points at the source line that pins it down, and notes what to do today.

Hole 1: most MCP clients still cannot pause a tool to ask the user a question

The MCP spec has a primitive called elicitation. The server can pause a tool, send the client a structured question with a JSON schema for the expected answer, and resume the tool when the user fills it in. It is the right primitive for “I see seven Save buttons in the tree, which one did you mean?” and “this step deletes a row, are you sure?”. Terminator defines six elicitation schemas at crates/terminator-mcp-agent/src/elicitation/schemas.rs: WorkflowContext, ElementDisambiguation, ErrorRecoveryChoice, ActionConfirmation, SelectorRefinement, UserResponse.

The hole is that the documentation comment in elicitation/mod.rs lines 14 to 18 says verbatim:

“As of December 2025, Claude Desktop and Claude Code do not yet support elicitation. The implementation includes graceful fallback for unsupported clients.”

The graceful fallback is real: if the calling peer does not support elicitation, the helper at helpers.rs line 30 returns the supplied default and the tool keeps going. There is also a stored-peer pattern at helpers.rs line 90 that lets a separately connected peer (a UI app like mediar-app) field the question even when the calling peer (Claude Code) cannot. That is the workaround you have today: connect two peers to the same MCP server and let the elicitation-capable one carry the questions. If you only have Claude Code, the schemas are inert.

Hole 2: the on-screen inspector overlay is Windows-only, and macOS calls return success while doing nothing

The inspect overlay is the killer feature for grounding the LLM’s tree to the user’s monitor: pass show_overlay:"ui_tree" to get_window_tree and labelled rectangles get drawn over the live desktop in seven possible label modes. The implementation is Win32 layered windows with WS_EX_LAYERED + WS_EX_TRANSPARENT + WS_EX_TOPMOST.

The cfg gate at crates/terminator-mcp-agent/src/server.rs:1717 reads:

#[cfg(target_os = "windows")]
if let Some(ref overlay_type) = args.show_overlay {
    // ...build labelled rectangles, paint them with GDI...
}

The matching tool to clear the overlay is hide_inspect_overlay at server.rs:5769. Read what it does on macOS:

async fn hide_inspect_overlay(&self) -> Result<CallToolResult, McpError> {
    #[cfg(target_os = "windows")]
    {
        terminator::hide_inspect_overlay();
        info!("Signaled inspect overlay to close");
    }

    Ok(CallToolResult::success(vec![Content::json(json!({
        "action": "hide_inspect_overlay",
        "status": "executed_without_error",
        "message": "Inspect overlay hidden"
    }))?]))
}

On macOS the cfg block is skipped, the function falls through, and it returns status: "executed_without_error" with message: "Inspect overlay hidden". The user sees nothing because there was no overlay. Same shape on the show path: passing show_overlay:"ui_tree" returns the JSON tree fine, but the rectangles never appear on your monitor. The non-visual subset (validate_element, wait_for_element, click_element, type_into_element, invoke_element) is full parity. If your workflow assumes the rectangles, your workflow is Windows-only even when the rest of it is not.

What ships on each platform

Tool surface
Identical on Windows and macOS. Same names, same args.
JSON tree
Returned on both. UIA on Windows, AXUIElement on macOS.
Element actions
click, type, invoke, validate, wait. Both platforms.
4
Inspector overlay
Windows only. macOS path returns success and does nothing.

One get_window_tree + one ask_user, observed end to end

Hole 3: parse_duration accepts “1.5s” and “500ms”, but not “1m30s”

Lots of MCP tools take a duration argument: delay, wait_for_element, retry budgets inside execute_sequence. The parser at crates/terminator-mcp-agent/src/duration_parser.rs lines 5 to 29 splits the input on the first alphabetic character and matches the unit against a fixed list (ms, s, sec, m, min, h, hr).

What works: "500" (treated as ms), "1500ms", "2.5s", "0.5m", "2h".

What does not work: "1m30s". The splitter pulls 1 as the number, takes m30s as the unit, fails the match, and bubbles Unknown time unit: m30s up the stack. If your workflow has a retry budget like “give it a minute and a half”, you have to write 90s or 90000ms by hand. Cosmetic, but it bites the first time an LLM writes the natural form and the typecheck pass does not catch it. The fix is a four-line change to the splitter, and the issue is open territory if you want to PR it.

Hole 4: surfaces the accessibility tree never sees

There are three classes of UI that no MCP-over-accessibility server can reach without a fallback. They are not Terminator-specific; they are the floor of the abstraction.

Game canvases. Unity, Unreal, web-canvas applets render to a single texture. The OS sees one Canvas role with no children. There is nothing in the tree to find.
Custom-rendered controls. Some Electron apps override the renderer in ways that drop UIA bindings. Some in-house Win32 apps never set up a UIA provider for the controls they draw with GDI. Some Qt builds expose a generic Pane and stop. The tree shows a container; the meaningful elements inside are absent.
Document interiors. A specific cell in an Excel sheet, a specific run of text in a Word document, a specific shape on a PowerPoint slide. These exist in the tree, but addressing them needs Text and Range patterns, not just role+name. The surface is real but the locator vocabulary is different.

Terminator’s answer is the multi-source fallback inside get_window_tree: include_ocr, include_omniparser, include_gemini_vision, include_browser_dom. They produce a clustered tree the agent can reason about even when the AX tree is silent. But the right framing is still: the answer is grounding fallbacks, not magic. If the surface is invisible to both the tree and a vision model on a screenshot, no MCP server gets you in.

Hole 5: this is a per-host process, not a hosted RPA platform

The MCP agent is a per-host binary. It binds to stdio when launched by Claude Code, or to localhost HTTP/SSE on the same machine. Workflow state checkpoints are local-only files at ~/Library/Application Support/mediar/workflows/<folder>/state.json on macOS or %LOCALAPPDATA%\mediar\workflows\<folder>\state.json on Windows. The survival path resumes on the same machine, against the same desktop.

What you do not get out of the box: a multi-tenant dispatcher, a queue, a centralised run history server, RBAC, audit logs, per-tenant credential vaults. If your purchase gate is hosted bot orchestration, this is the wrong shape. The product config calls this out as a disqualifier: “enterprise RPA buyer whose gate is a hosted platform with bot orchestration, RBAC, and audit logs”. Distributed execution is on you to build on top, the same way you would on top of a Playwright fleet.

MIT

“It does not need to be magic. It needs to be honest about what it is, and to expose the seams so the next person can fix them.”

Terminator project ethos

Verify the holes yourself in five minutes

The whole verification path is five commands. Clone the repo, run grep, read the comments. Every claim above is a file plus a line you can open.

Reproduce checklist

claude mcp add terminator "npx -y terminator-mcp-agent@latest"
grep -n 'cfg(target_os' crates/terminator-mcp-agent/src/server.rs
cat crates/terminator-mcp-agent/src/elicitation/mod.rs | head -20
cat crates/terminator-mcp-agent/src/duration_parser.rs
Reproduce hole four: open a Unity game and call get_window_tree; the Canvas has no children

Where this page does not go

Three things that matter for production but already have their own treatment elsewhere.

Workflow checkpointing (state.json after every step that mutates env). This is the survival path; without it, a 20-step workflow that fails at step 14 replays from step 1.
The dev-tools loop exposed by the MCP server (inspect overlay, highlight, validate, wait, clear). The Chrome DevTools shape, but for the desktop accessibility tree.
Control Patterns vs synthesised input (invoke fires the pattern in-process, click does SendInput). When to prefer which.

“The bit I appreciated was the cfg-gated subset list. I had assumed the macOS path was at parity and only noticed when an integration test passed locally and a teammate could not see anything on their screen.”

Anonymous reader

dropped a comment on a Reddit thread that linked here

Want to talk through which holes hit your workflow?

If you are evaluating MCP desktop automation for a real internal tool, a short call is the fastest way to figure out whether the gaps above are blockers or footnotes for what you are building.

Frequently asked questions

What does "MCP desktop accessibility automation" actually mean?

It is the layered abstraction made of three things. First, an OS accessibility tree (UI Automation on Windows, AXUIElement on macOS, AT-SPI2 on Linux), which is the same tree a screen reader uses, populated by every well-behaved native app. Second, a process that drives that tree by role and name instead of pixels, so a click does not break under DPI, theme, or scroll. Third, an MCP (Model Context Protocol) server that exposes the read and write primitives as named tools an LLM client (Claude Code, Cursor, Windsurf, VS Code, Claude Desktop) can call as part of a task. Terminator is one open source implementation; the install is `claude mcp add terminator "npx -y terminator-mcp-agent@latest"` per crates/terminator-mcp-agent/README.md line 19.

Why is the accessibility tree the right substrate, instead of screenshots and click-by-coordinate?

Two reasons. The tree is structural and stable: a Save button is found by role:Button|name:Save regardless of where it sits on the screen, what monitor it landed on, what scale factor the user picked, what theme is active, or what language the OS is set to. Screenshots flip on every one of those axes. The other reason is latency. A round trip to a vision model on a screenshot is 500 to 1500 ms; an accessibility tree fetch from a 200-element window on Windows takes single-digit ms. The trade-off is that surfaces the OS does not classify (game canvases, custom-rendered controls, complex Office documents at the cell level) are invisible. A real production agent does both, and falls through from the tree to vision when the tree is silent.

What does a single MCP tool call against the accessibility tree actually do?

Take click_element with selector role:Button|name:Save and a process name. The tool walks down to the platform adapter (UIAutomationElement on Windows, AXUIElement on macOS), runs find_and_execute_with_retry_with_fallback against the live tree, and once it has resolved the element, fires either UIInvokePattern (if the element exposes it) or a SendInput synthesised mouse click at the element's bounds. The retry loop is what hides the moment a window has not finished loading. The pattern call is what hides the cursor and lets the click run in the background without taking focus. From the LLM's side, all of that is one tool call; the call returns a structured result with the element it found, what it tried, and any error class. It is a long way from PyAutoGUI's pyautogui.click(x, y).

Hole one. Does every connected MCP client actually support all the features this server exposes?

No, and the gap is bigger than people realise. As of December 2025 the documentation comment at crates/terminator-mcp-agent/src/elicitation/mod.rs lines 14 to 18 says verbatim: "As of December 2025, Claude Desktop and Claude Code do not yet support elicitation." Elicitation is the MCP protocol primitive that lets the server pause a tool, send the client a structured question ("which of these seven matching buttons do you mean?"), and resume with the user's answer. The server has six elicitation schemas defined in elicitation/schemas.rs. If you are connected from Claude Code today, none of them activate; the helper at helpers.rs line 30 calls peer.supports_elicitation(), gets back false, and returns the supplied default. The graceful fallback is real, but the disambiguation experience the schemas were designed for is not happening.

Hole two. Is the on-screen inspector overlay actually cross platform?

No. It is Windows-only Win32 layered windows. The cfg gate at crates/terminator-mcp-agent/src/server.rs line 1717 reads `#[cfg(target_os = "windows")]` and wraps the entire show_overlay branch of get_window_tree. The hide_inspect_overlay tool at server.rs line 5769 is worse: on macOS the cfg block at line 5770 is skipped, the function falls through, and it returns `status: "executed_without_error"` with `message: "Inspect overlay hidden"`. The user sees nothing because there was no overlay to hide in the first place. Same for the show path: passing `show_overlay:"ui_tree"` on macOS gets you the tree JSON back fine, but the rectangles never appear on your monitor. The MCP tool surface is identical on both platforms; the visual artifacts are not. If your workflow assumes the labelled rectangles, your workflow is Windows-only even when the rest of it isn't.

Hole three. Can the MCP server time durations the way a workflow author expects?

It accepts single-unit strings only. parse_duration at crates/terminator-mcp-agent/src/duration_parser.rs lines 5 to 29 splits the input on the first alphabetic character and matches the unit against a fixed list (ms, s, sec, m, min, h, hr). A bare number is treated as milliseconds. So "1500ms", "2.5s", "30", "1.5m", "2h" all work. "1m30s" does not, because the splitter pulls "1" as the number, takes "m30s" as the unit, fails the match, and bubbles "Unknown time unit: m30s" up the stack. If your workflow has retry budgets like "give it a minute and a half", you have to write 90s or 90000ms by hand. Cosmetic, but it bites the first time an LLM writes the natural form and the typecheck pass does not catch it.

Hole four. What about apps the accessibility tree does not see at all?

There are three classes that no MCP-over-accessibility server can reach without a fallback. Game canvases (Unity, Unreal, web-canvas based applets) render to a texture, so the OS sees one giant Canvas role with no children. Custom-rendered controls (Electron apps that override the renderer, in-house Win32 controls that never set up a UIA provider, some Qt builds) report a generic Pane and stop. Document interiors (a specific cell in an Excel sheet, a specific run of text in a Word document) are present in the tree but addressing them requires the Text or Range pattern, not just role+name. Terminator's answer to all three is the multi-source fallback inside get_window_tree (`include_ocr`, `include_omniparser`, `include_gemini_vision`, `include_browser_dom`) which produces a clustered tree the agent can reason about. But the answer is grounding fallbacks, not magic. If the surface is not in the tree and not in the screenshot, no MCP server gets you in.

Hole five. The MCP server runs locally; is there a story for orchestrating runs across multiple machines?

Not out of the box, no. Terminator's MCP agent is a per-host process. It binds to stdio (when launched by Claude Code or similar), or to localhost HTTP/SSE on the same machine. The cross-host story today is a workflow file checked into a repo and executed by an agent on whichever host the workflow targets. There is no built-in dispatcher, no queue, no run history server, no RBAC. If you need bot orchestration with audit logs and per-tenant credentials, this is a developer framework, not a hosted RPA platform; the qualification field in the product config calls this out as a disqualifier ("enterprise RPA buyer whose gate is a hosted platform with bot orchestration, RBAC, and audit logs"). Workflow state checkpoints are local-only; the survival path resumes on the same machine. Distributed execution is on you to build on top.

Given the holes, what is the actual recommendation for a developer landing here from Reddit?

If you are building an AI agent that needs to drive native Windows apps and you are comfortable with a developer framework, the MCP server is in genuinely usable shape: the read+write loop, the diff-aware action results, the workflow checkpointing, the typed selectors, the multi-source grounding fallback, and the 35+ tool surface all hold up under real workflows. If your workflow is macOS-first and you depend on the visual inspector, today you write your own mac-side overlay or wait. If your client is Claude Code and you expected pause-and-ask interactivity, today you bake the disambiguation into the prompt up front. If your goal is an enterprise RPA platform with hosted orchestration, this is the wrong shape. The repo is at github.com/mediar-ai/terminator, MIT licensed, and the issues tab is the right place to push back on any of the above with a concrete workflow.

How do I install the MCP server and verify the holes for myself in five minutes?

Three steps. (1) `claude mcp add terminator "npx -y terminator-mcp-agent@latest"` to register the server with Claude Code. (2) Restart Claude Code, ask the model `list the tools you have from terminator` and confirm you see get_window_tree, click_element, validate_element, execute_sequence, ask_user and the rest. (3) Clone the repo, `git clone https://github.com/mediar-ai/terminator`, then `grep -n 'cfg(target_os' crates/terminator-mcp-agent/src/server.rs` and read each hit; that is the entire surface area where macOS and Windows diverge. The whole verification is faster than reading the README in full.

Adjacent reading on this site

Guide

MCP dev tools for desktop accessibility, by tool name

Inspector overlay, highlight_element, validate_element, wait_for_element, stop_highlighting, with the file and line for each tool.

Read

Guide

Browser MCP to desktop automation: replace the dispatch root

Why a browser MCP server cannot grow into a desktop one. The dispatch root is the difference; navigate_browser and open_application are sibling arms in one match block.

Read

Guide

Desktop accessibility automation agents: the survival path

How execute_sequence persists env to state.json after every step that mutates it, so the rerun resumes at the failed step instead of replaying from step 1.

Read