Desktop automation tools, reconsidered: one clustered view over five perception sources
Every roundup of desktop automation tools forces you into the same binary. Accessibility, like WinAppDriver, UiPath, or Power Automate Desktop. Or pixels, like SikuliX, Claude computer use, or OpenAI Operator. Each approach has a real blind spot. This guide is about a third option: fuse both, tag every detection with its source, and let the agent pick a prefix.
The binary everyone else forces on you
Search for desktop automation tools and the same twelve products keep appearing. Each one commits to a single way of seeing the screen. That commitment is the thing an AI coding assistant has to live with once it starts running in a loop.
What every shortlist gets wrong
- Accessibility-first tools miss canvas, WebGL, and remote desktop
- Vision-first tools miss is_enabled, AutomationId, accessible name
- Neither approach helps when both sources partially fail
- Running both separately forces the agent to pick wrong up front
A tool that queries UIA will never see the pixel-drawn chart axis your workflow needs to click. A tool that screenshots and sends the image to a vision model will never know whether a button is actually disabled or just styled gray. If your agent has to pick one or the other at configuration time, it will be wrong on a non-trivial slice of real applications.
The anchor: five sources, one prefix each, one cluster per pixel
Terminator runs every perception method the tool supports and merges them. The core type that makes this work lives in crates/terminator/src/tree_formatter.rs. It is an enum called ElementSource with five variants: Uia, Dom, Ocr, Omniparser, Gemini. Each variant is assigned a one-character prefix: u, d, o, p, g. Every detection the tool makes from any source becomes a UnifiedElement with those fields, plus a bounding box in screen coordinates.
The agent never sees five separate trees. It sees one YAML, clustered by spatial proximity. The function that does the clustering is cluster_elements and the decision function next to it is should_cluster. That is the anchor fact for this guide.
The five perception sources, each tagged with its prefix
Every source earns its place by covering a failure mode of the others. Running all five only matters because real desktops fail them in different ways on different apps.
UIA: structured metadata
Role, AccessibleName, AutomationId, BoundingRectangle, IsEnabled, IsKeyboardFocusable. Batched into one CacheRequest so a full subtree costs one COM round-trip instead of thousands.
DOM: the browser truth
A Chrome extension injects into the page and reports tag, identifier, and viewport-aligned bounds. Catches anything the browser renders outside UIA's reach, including shadow DOM and canvas overlays.
OCR: text the others missed
Tesseract runs on a captured window screenshot. Essential for Citrix, remote desktop, and any custom-drawn control that does not expose an accessible name.
Omniparser: icon and widget vision
A local model that detects clickable regions by appearance, not by text. Ships bounding boxes plus a label (icon, button, chart). Works on WebGL canvases and paint-mode UIs.
Gemini: natural-language descriptions
Optional. When enabled, returns 2D boxes with element_type, content, and a short description like 'primary call to action, filled'. Useful when the agent needs to disambiguate between visually similar controls.
Cluster: one view over all five
Each detection is turned into a UnifiedElement with source, index, display_type, text, description, bounds. Clusters emit a centroid header and a prefixed-index line per element. That is what the model sees.
How the clustering actually runs
Five separate perception pipelines produce detections. A merging step collapses detections that sit on top of each other into one group. The merging step is not magic; it is a union-find with a generous distance threshold.
Perception sources merge into one clustered view
From five pipelines to one YAML
Each source writes detections to its own cache
UIA walks the accessibility tree via one IUIAutomationCacheRequest. DOM comes from the browser extension injecting a hit-test script. OCR rasterizes the window and runs Tesseract. Omniparser runs a local vision model. Gemini returns structured JSON. Every detection carries a bounding box in screen coordinates.
ElementSource tags each detection with a prefix
UIA gets 'u', DOM 'd', OCR 'o', Omniparser 'p', Gemini 'g'. Indices stay source-local so #u1 and #d1 can both exist in the same window without collision.
Union-find clusters overlapping bounds across sources
For every pair of detections, if min_edge_distance is less than 1.5x the smaller bounding dimension, union their sets. A small OCR word sitting inside a large UIA button still clusters because the smaller dimension controls the threshold.
Clusters sort by reading order, elements sort within
Clusters sort by the (Y, X) of their first element so the agent reads top-to-bottom. Within a cluster, elements sort by reading order too, so the densest structured source (usually UIA) tends to appear first when all are present.
The agent picks a prefix and clicks
Claude, Cursor, or any MCP-speaking agent now sees one YAML with every detection stacked on the same coordinate. If #u7 is present it uses UIA selectors. If UIA only sees an empty Pane, #d3 routes through the DOM. If the window is Citrix, only #o and #g exist, and those become the targets.
The threshold: 1.5x the smaller dimension
The clustering decision is deliberately simple. Two detections cluster if their bounding boxes overlap, or if the minimum edge distance between them is less than 1.5x the smaller of the two smaller-dimensions. That ratio is the whole tuning budget. It is tight enough that a button and a text label three rows away do not cluster, loose enough that an OCR word detected a few pixels offset from the UIA parent still does.
Union-find then resolves transitive closure. If A clusters with B and B with C, all three become one group. Path compression keeps the lookup cheap even when a window has hundreds of detections. After clustering, each group sorts by reading order (Y then X) and the groups themselves sort by the position of their first element, so the emitted YAML reads top-down like a normal document.
“The generous threshold was the right call. OCR bounds land a few pixels off from UIA bounds constantly, and you want them to cluster anyway.”
Implementation note, tree_formatter.rs
What the agent actually reads
When the MCP agent emits a clustered tree, every element is on a line with its source prefix, its display type, an optional name or description, and the bounding box. An agent scanning the YAML for a Submit button finds three lines inside one cluster: UIA, DOM, and OCR all confirmed the same region.
If the agent wants the structured metadata, it picks #u7. If UIA is empty and the DOM is not (common on web views and Electron apps), it picks #d3. If the session is Citrix and only OCR fired, it picks #o12. The agent does not have to decide its perception strategy up front. It reads the YAML and targets the line that has the most information.
The tool call that exposes it
The MCP agent registers 35 tools. The one that surfaces clustered output is get_window_tree. You set tree_output_format to clustered_yaml and toggle the sources you want. Only UIA is on by default because the extra sources each have a cost: OCR boots Tesseract, Omniparser loads a local model, Gemini calls a remote API.
Numbers worth pinning
None of these are benchmark claims. They are facts about how the tool is wired.
Five sources, one prefix each, a 0x clustering threshold, and 0 MCP tools that the agent can call once the tree is in hand. The rest of the behavior, clicks, typing, invoke, scroll, is standard desktop automation. The interesting part is the view the agent reads before it decides what to do.
How this compares to single-perception tools
| Feature | Single-perception tools | Terminator |
|---|---|---|
| Perception sources | One. Accessibility OR pixels, never both | Five. UIA, DOM, OCR, Omniparser, Gemini |
| Blind-spot fallback | None. If the primary method fails, the agent is stuck | Automatic. Agent picks a different source's prefix |
| Agent-facing format | Raw tree dump or a screenshot | Spatially-clustered YAML with prefixed indices |
| Coordinate system | Per-source, often misaligned | One screen-coordinate space for all five |
| Disabled-button detection | Guess from pixels, or query UIA only | Combine UIA is_enabled with visual state |
| Remote desktop / Citrix | Accessibility tools fail outright | OCR and Gemini sources still fire |
| Canvas and WebGL | UIA returns nothing; needs vision model | Omniparser and Gemini fill the gap |
| License | Proprietary, seat-based, or closed | MIT, open source, self-hostable |
The usual shortlist, and where it sits in this view
For context, the products that most roundups recommend. Each one makes a single perception bet. None of them publish a clustered view over multiple sources.
Most are accessibility-first studios aimed at business analysts. A few are image-matching tools aimed at QA engineers. A new cohort (Claude computer use, OpenAI Operator, AskUI) is model-first and pixel-based. They all solve specific problems well. The gap is that none of them expose a unified, spatially-aligned index to the agent driving the loop.
Why a developer framework, not a studio
A studio assumes a human is in the loop: dragging activities onto a canvas, recording a workflow, picking elements from a visual picker. An AI coding assistant writes code, calls functions, reads structured output, and recovers from errors. The interface it wants is a typed SDK plus an MCP server.
Terminator ships the Rust core, NAPI-RS bindings for Node, PyO3 bindings for Python, and the MCP agent as an npm package. One line in your MCP config gives Claude Code, Cursor, Windsurf, or Codex the ability to drive any desktop application. The clustered tree is what they read first on every iteration.
See the clustered view on your own desktop
Twenty minutes, your screen, our agent. We will point it at whichever application breaks your current tool and watch the prefixes light up.
Frequently asked
Frequently asked questions
What is a desktop automation tool, and how is it different from a browser automation tool?
A desktop automation tool drives any application on your operating system. That includes browsers but also Excel, SAP, Teams, File Explorer, QuickBooks, Photoshop, or an internal WPF tool that has no web counterpart. Browser automation tools like Playwright only see DOM nodes inside a browser process. Desktop automation needs to reach outside the browser sandbox and talk to the OS accessibility layer (Windows UI Automation on Windows, AX API on macOS, AT-SPI2 on Linux), plus fall back to pixels when those APIs return nothing useful.
Why do most tools force a choice between accessibility and vision?
Because they pick a perception method up front. WinAppDriver, UiPath, Power Automate Desktop, and Robot Framework are accessibility-first. They query UIA, find a matching element, click. SikuliX, Claude computer use, and OpenAI Operator are vision-first. They screenshot, ask a model or OpenCV where the element is, click a coordinate. Each approach has a blind spot. UIA can miss custom-drawn canvas controls and anything rendered to a WebGL surface. Vision can miss structured metadata like AutomationId or role, which is the only reliable way to tell a disabled button apart from an enabled one that happens to look gray. Terminator's contribution is refusing to pick. It runs all five perception pipelines and clusters their outputs spatially so your agent sees every detection from every source stacked on the same screen coordinates.
What exactly does the ClusteredYaml output look like?
Every detection gets a prefixed index: #u for UIA, #d for browser DOM, #o for OCR text, #p for Omniparser icon detection, #g for Gemini vision. Elements whose bounding boxes overlap or sit within 1.5x the smaller dimension of each other collapse into one cluster, labeled with a centroid coordinate. A Submit button that UIA sees as a Button control, the DOM sees as a <button> tag, and OCR reads as the word 'Submit' ends up as three lines inside one cluster at, say, @(480, 612): [Button] #u7 'Submit', [button] #d3 'Submit', [OcrWord] #o12 'Submit'. Your agent can target the most reliable source for that specific element without screenshotting the whole screen and re-running detection.
How does the clustering threshold actually work?
In crates/terminator/src/tree_formatter.rs the should_cluster function takes two bounding boxes, computes the minimum edge distance between them (zero if they overlap), and compares against 1.5x the smaller of the two bounds' smaller dimension. That ratio is deliberately generous. An OCR word detected a few pixels offset from the UIA button that contains it will still cluster. A button halfway across the screen will not. Cluster membership is resolved with union-find path compression, so if A is close to B and B is close to C, all three end up in one group. Clusters are then sorted by the Y-then-X position of their first element to give a stable reading order.
Why does an AI coding assistant need five sources and not just one?
Because real desktops fail each source in different ways. Electron apps sometimes expose UIA elements without accessible names, so UIA says 'there is a Button here' but cannot tell you what it does; the DOM inside the Electron content view can. Office ribbons expose rich UIA metadata but render custom icons that Omniparser can identify when the accessible name is a generic 'Button'. Remote desktop or Citrix sessions hand you a single pixel buffer with no accessibility tree at all, and you have to fall back to OCR and Gemini. Non-text UI elements like chart axes or drag handles usually show up in Omniparser and Gemini but are invisible to UIA. Any tool that commits to one source will fail on a sizable slice of real applications. Clustering lets the agent skip the source that failed and click the one that succeeded.
Is this different from what Claude computer use or OpenAI Operator do?
Yes. Computer use agents from Anthropic and OpenAI take a screenshot, send it to the model, get back coordinates, and click. The model sees only pixels; it does not see the accessibility tree, it does not see the DOM, and it cannot tell if a control is disabled without inferring from visual cues. Terminator runs locally on your machine, fuses structured signals (UIA, DOM) with visual signals (OCR, Omniparser, Gemini) into one view, and hands the clustered index to whichever AI coding assistant you already use. The model still makes the decision, but it picks from a set of concrete, prefix-tagged elements with real bounding boxes instead of guessing at a pixel location.
Is Terminator a developer framework or an end-user product?
Developer framework. You install it with cargo add terminator-rs (Rust), npm install @mediar-ai/terminator (Node.js), pip install terminator-py (Python), or npx -y terminator-mcp-agent@latest (MCP server). There is no studio, no drag-and-drop canvas, no bot designer. Its job is to give Claude Code, Cursor, Codex, Windsurf, and similar AI coding assistants the ability to drive any desktop application the same way those assistants already drive browsers through Playwright. That is the whole point of publishing it as an MCP server: you add one line to your MCP config and your existing AI pair programmer can suddenly read and manipulate any Windows application.
How do I see the clustered output locally?
Install the MCP agent with `npx -y terminator-mcp-agent@latest`, wire it into your AI assistant's MCP config, and call get_window_tree with tree_output_format set to 'clustered_yaml'. The result will include the prefixed indices. You can enable additional sources with include_browser_dom, include_ocr, include_omniparser, and include_gemini_vision flags. Only UIA is on by default because the extra sources each have a cost (OCR spins up Tesseract, Omniparser needs a local model, Gemini needs an API key and a network round-trip). Turn them on when the default tree is missing what you need.