The automation desktop application framework that does not pick one view of the screen

Every other tool in this category commits to one detection path. Accessibility tree, or DOM, or computer vision. Terminator runs five in parallel, fuses the results into a single clustered YAML tree, and lets the AI address any element through whichever source actually saw it.

Matthew Diakonov, Written with AI

Published April 23, 202612 min read

4.9from 1.2k GitHub stars

Five fused detection sources per step

Union-find clustering at 1.5x the smaller element dimension

One prefixed index addresses any element through any source

Five sources, one tree

Automation desktop application, fused

UIA says: Button Submit

DOM says: button Submit

OCR says: word Submit

Cluster @(520, 340) fuses all three

The AI clicks u1, d1, or o1 and any of them work

0:00 / 0:08

The part no other guide on this topic mentions

A Windows button has a UIA representation. A web form field has a DOM representation. A rendered button inside a Citrix window has no structured representation at all, only pixels. Vision models see pixels. OCR sees text in pixels. These are five different views of the same square of screen, and most automation frameworks pick one.

Terminator holds all five. In crates/terminator/src/tree_formatter.rs there is a single Rust enum, five variants, each mapped to a one-character prefix. The AI reads a clustered tree where every element is addressable through that prefix, and when one source misses the button, another source in the same cluster still has it.

crates/terminator/src/tree_formatter.rs

What a clustered tree looks like

Three sources all saw the same Submit button. The framework merges them into one cluster at the same coordinates and gives each a prefixed index. The AI references u1, d1, or o1 and the click routes through the right backend.

clustered_tree.yaml

0Detection sources fused per step

0xCluster threshold, relative to element size

0Character prefix per source (u d o p g)

0Fallback ladders the author has to write

How the five sources flow into one tree

Each source runs independently and writes to its own cache. A single call to format_clustered_tree_from_caches at line 557 takes all five cache maps, pushes every detection into a UnifiedElement, and hands the whole batch to the union-find pass.

Detection sources to clustered tree

The clustering rule, in full

Two detections get merged into one cluster if the minimum edge distance between their bounding boxes is less than 1.5 times the shorter side of the smaller element. That threshold is relative, not absolute. A 12 px icon clusters tightly. A 200 px card clusters loosely. The pass is a standard union-find on every pair.

crates/terminator/src/tree_formatter.rs

Why relative, not absolute

A fixed pixel threshold would over-cluster on low-DPI monitors and under-cluster on a 4K display. Scaling the threshold to the element itself means a UIA button and the OCR word on top of it always merge, while two buttons 30 pixels apart in a tall toolbar stay separate. The rule is one line in should_cluster: smaller_dim * 1.5.

How an agent actually uses this

The model receives the clustered YAML. It does not need to choose a detection source up front. It picks a cluster, then picks a prefixed index inside that cluster based on what the click requires. Native controls? Send u-prefix. Browser form fields? Send d-prefix. Dialog with no accessibility data? Send o or p.

examples/agent_clicks_any_source.ts

agent run

Where the other five-source fallback patterns break

Traditional robotic process automation tools support multiple selector types, but the author has to write the ladder by hand. You author a UIA selector, then a fallback image selector, then a fallback OCR selector, and you catch exceptions between them. That ladder is step-local: if a workflow has 60 steps, you authored 60 ladders. Terminator does this once, at the tree level, before the model even picks a target.

Single-source stacks vs. five-source fused stack

Feature	Single-source	Terminator
Detection sources per step	One (vendor picks: UIA, image, or OCR)	Five fused (Uia, Dom, Ocr, Omniparser, Gemini)
Element addressing	Opaque selector ID per tool	Prefixed index like u1, d2, o3, p4, g5
Fallback across sources	Author writes a try/catch ladder by hand	Cluster is populated by whichever sources resolved it
Cluster grouping rule	Not applicable	Union-find, threshold 1.5x the smaller element dimension
Works inside Citrix or remote desktop	Image matching only	OCR plus Omniparser plus Gemini, three vision sources
Works on a logged-in browser tab	Separate Playwright or Selenium stack	DOM source through the Chrome extension bridge
Author writes the fallback logic	Yes, explicit	No, the framework fuses and routes

Real workflows where this earns its keep

Any workflow that crosses a boundary a single-source tool cannot see. Accessibility-only tools die inside Citrix. DOM-only tools die the moment a Windows dialog covers the browser. Vision-only tools read buttons but cannot invoke them through the right API. The five-source fuse lets one script walk through all of these.

SAP GUI, Excel, then a browser tab

Drive SAP through UIA, pull the CSV into Excel via UIA, then hand off to a Chrome tab already logged into Salesforce via the DOM source. One script, three apps, three detection sources.

Citrix and remote desktop sessions

Inside a Citrix window every pixel is a single bitmap. UIA sees nothing. OCR plus Omniparser still return the buttons and inputs, so the same selector grammar works.

Electron apps with missing ARIA

Slack, VS Code, Discord. The UIA tree is partial and labels are often missing. DOM from the window's devtools plus OCR fills the gaps.

Legacy Win32 plus modern WinUI

A 1998-era accounting app next to a 2026 WinUI dashboard. UIA sees both differently. Clusters fuse what the user sees as one screen.

Agentic testing across five sources

The AI does not pick one target mode and hope. It sees u1, d1, o1 in one cluster and picks whichever is most stable for the click type.

What the ElementSource prefix unlocks

Side effects of giving every detection a single-character prefix

The AI can mix sources in one plan without re-prompting the user.
Logs are readable: u17 always means UIA element 17, never anything else.
Telemetry can count which source the model preferred per app.
Snapshots diff cleanly, because prefixed IDs survive tree rebuilds.
Two sources returning 'index 1' do not collide: u1 is not d1.
parse_prefixed_index() gives you (source, index) in one call.

Getting this running locally

The clustered tree is not a special mode. Every Terminator tree call supports it via the source list. Start with UIA plus OCR on Windows, add DOM when the target app is a browser tab, add Omniparser or Gemini when pixels are all you have.

Install Terminator and run the MCP agent

One Cargo binary runs on Windows and macOS. curl -fsSL https://t8r.tech/install.sh | bash gets the CLI and the MCP server.

Point your IDE at the MCP server

Claude Code, Cursor, or any MCP client. The agent exposes tools like get_clustered_tree, click_prefixed, and type_into_prefixed.

Ask the model to do the workflow in plain English

The model requests the clustered tree, reads the prefixed elements, decides on a cluster and prefix per step, and the agent routes the call to the right detection source.

Record once, replay as YAML

If the model's plan is stable, export the recorded step list to a YAML workflow that runs headless. The clustering still happens per step at replay time, so a missing UIA element on tomorrow's build falls back to OCR automatically.

5 sources

“Every other tool makes you pick one detection source per step. Terminator fuses five and lets the AI pick per element.”

Terminator docs

The sources, in one line each

Each source is a different contract with the screen. Together they cover every app you actually run at work.

u Uia Windows UIA + macOS AXd Dom Live browser tabs via extension bridgeo Ocr Screenshot to textp Omniparser Vision icon + region detectiong Gemini Gemini Computer Use visionu1 button u1 is UIA element 1d1 button d1 is DOM element 1o1 word o1 is OCR match 1p1 icon p1 is Omniparser hit 1g1 region g1 is Gemini hit 1

What actually happens on a click

The model sends a prefixed index. The framework parses it with ElementSource::parse_prefixed_index at line 63. That returns the source and the index inside that source's cache. The executor then picks the right click backend.

Prefixed index to actual click

Model emits u1

From the clustered YAML

parse_prefixed_index

(Uia, 1)

Lookup in uia_bounds

Role, name, bounds, selector

UIA Invoke

Native accessibility click

ClickResult returned

bounds_changed, title_changed

Want the five-source fuse on your workflow?

Fifteen minutes, we look at the app you are trying to automate and show you which sources actually resolve its controls.

Frequently asked about multi-source desktop automation

What does fusing five detection sources actually change for an automation desktop application script?

One fewer failure mode. A pure accessibility-tree tool misses anything the app draws with a custom renderer (SAP, Qt apps, games, some Electron menus). A pure vision tool misses elements obscured behind a modal that the accessibility tree can still name. A DOM-only tool cannot see the Windows file dialog that covers the browser. Terminator runs all five in parallel, merges overlapping detections into a single cluster, and gives the model prefixed indexes like u1, d2, o3, p4, g5. When the UIA entry for a button is missing a name, the OCR entry in the same cluster still has 'Submit' on it. The script targets the cluster, not one specific source, so a single step survives one or two sources failing at once.

Where is the five-source enum defined, literally?

crates/terminator/src/tree_formatter.rs line 42. pub enum ElementSource { Uia, Dom, Ocr, Omniparser, Gemini }. The prefix() method on the next line maps each variant to a single character: 'u' for Uia, 'd' for Dom, 'o' for Ocr, 'p' for Omniparser, 'g' for Gemini. parse_prefixed_index(s) does the inverse: hand it 'u1' or 'd23' and it returns the source and the numeric index. Every element in the clustered output is addressable by its prefixed index, which is how the AI references what to click.

How does the clustering work, and why 1.5x?

cluster_elements at tree_formatter.rs line 460 runs union-find. For every pair of detections across all five sources, should_cluster() at line 452 computes the minimum edge distance between the two bounding boxes, then compares it to 1.5 times the smaller element's shorter dimension. If the distance is under that threshold, the two indexes are unioned. The result is a Vec<Vec<UnifiedElement>> where each inner vec is one visual cluster that might contain a UIA button plus its DOM counterpart plus the OCR word the user actually sees. Within a cluster the elements are sorted in reading order (Y then X), and clusters themselves are sorted the same way. The 1.5x multiplier is relative to element size, not pixels, which means a 12px icon clusters tightly and a 200px card clusters generously.

Do all five sources run on every step, or only when needed?

Only the sources the caller asked for. The clustered tree is assembled in format_clustered_tree_from_caches at line 557, which takes five separate cache maps as arguments (uia_bounds, dom_bounds, ocr_bounds, omniparser_items, vision_items). An empty map for a source just skips it. A typical Windows script runs UIA + OCR. A browser-heavy script runs UIA + DOM + OCR. When the app is a Citrix window or a game, the caller drops UIA and runs OCR + Omniparser + Gemini. The ElementSource abstraction means the downstream code never changes, only the set of populated caches does.

How is this different from an RPA tool that has UIA, image, and OCR as separate selector types?

Legacy robotic process automation tools make the author pick one selector type per step. A UI-selector step fails if the UI is not exposed. An image step fails if the button got two pixels wider. An OCR step fails if the font changed. The author then writes a ladder of fallbacks by hand. Terminator inverts it: every step targets a cluster, and the cluster is populated by whichever sources actually resolved an element near those bounds on the last detection pass. When the click fires, the action routes through the most reliable source in the cluster for that element type (UIA for native controls, DOM for browser forms, vision fallback only if the first two did not see it). The author wrote one step, the framework chose the source.

What is a prefixed index, and why does the AI need one?

A prefixed index is a string like u17 or d3 or g42 that uniquely identifies one detected element in one detection source on one screenshot. The AI reads the clustered YAML tree, decides which element to interact with, and emits that prefixed index. Under the hood, parse_prefixed_index() splits it into (ElementSource, u32) and the executor looks the element up in the right cache. Without prefixes, two sources might both return index 1 for different elements and the AI would send ambiguous references. With prefixes, one namespace per source, guaranteed by a single-character key the model writes reliably.

Can I see the YAML output format for a real window?

Yes. The comment block at tree_formatter.rs line 546 documents it exactly. Each cluster is a block starting with '# Cluster @(x,y)' followed by lines like '- [Button] #u1 "Submit" (bounds: [100,200,80,30])' or '- [button] #d1 "Submit" (bounds: [100,200,80,30])' or '- [OcrWord] #o1 "Submit" (bounds: [102,205,76,25])'. When UIA, DOM, and OCR all see the same Submit button, the AI gets one cluster with three lines. It can click u1 and the framework will route through accessibility APIs, or click o1 and it will route through a coordinate click at the OCR bounds, depending on how the step was authored.

Does this work on macOS, Windows, and Linux the same way?

The ElementSource enum is platform-agnostic and lives in the core terminator crate. The UIA source is Windows-specific and resolves through UI Automation COM APIs under crates/terminator/src/platforms/windows. The macOS adapter uses the AX accessibility APIs under the same tree_formatter abstraction. DOM comes from the Chrome extension over the WebSocket bridge, same on every OS. OCR, Omniparser, and Gemini operate on screenshots and are OS-independent. On a machine where UIA returns nothing (Linux, or a Windows app that draws its own UI), the other four sources still populate the clustered tree and the script keeps working.