A free desktop automation tool whose output was written for an LLM to read
Most of the products on a free desktop automation roundup were built for a human operator writing a script. They return element handles, image matches, control IDs. That output reads fine if a person is sitting in front of a REPL. It does not read at all if the caller is a language model on a single conversational turn. Terminator returns something different: a compact indexed YAML tree where every clickable element is prefixed with a #N, tagged with [ROLE] and a name, and annotated with bounds and state. The same index goes back into click_element, type_into_element, and wait_for_element as a one-integer argument. That loop is the product.
Why every other free tool on the list assumes a human
The shortlist repeats itself across every write-up on this subject. AutoHotkey, AutoIt, SikuliX, Pywinauto, Robot Framework, Winium, WinAppDriver, Ui.Vision, Power Automate Desktop, UiPath Community. They are grouped by language binding or by recognition strategy, but none of the comparisons pause to ask what the tool actually returns to its caller. That matters now because the caller is no longer a person opening an editor. It is a coding assistant that received a prompt, has to figure out what is on the screen, and then has to decide which element to act on. A return value that makes sense to a human reading docs does not automatically make sense to a model reading a single tool response.
The anchor: one function emits the tree, one cache holds the mapping, one argument is all the model needs
The output format lives in crates/terminator-mcp-agent/src/tree_formatter.rs. The function format_tree_as_compact_yaml walks a serialized UI element, assigns every element with bounds a 1-based index, and writes lines of the shape #N [ROLE] name (bounds: [x,y,w,h], focusable, value: ...). Elements without bounds get a dash prefix and no index so the model never tries to click them. The function also returns an index_to_bounds HashMap that the MCP agent keeps server-side. When the model calls click_element with index: 7, the agent resolves it to the stored role, name, bounds, and original selector. The index is the whole interface.
What the tree actually looks like on the wire
This is the shape of a minimal Notepad tree the way the MCP agent emits it. The indentation encodes containment. The #N is the handle. The role is the structural type. The name is the accessible label. Everything in parentheses is context the model may or may not need, depending on the task.
None of this is a special debug view. It is the default response body of a single MCP tool call. A model seeing this for the first time has everything it needs to decide what to do next, in one block, without extra context.
The function that writes that tree
The formatter is straightforward Rust. It builds the output string, assigns the running index, and returns both the formatted text and the map used later to resolve an index back to coordinates. The interesting part is that the mapping is stored, not thrown away: that is how click_element is able to take an integer from the model and click the right element without the model ever sending bounds or selectors back.
The click side is simpler still. Three modes, one check per mode, and a validation that exactly one is specified.
The loop from prompt to pixel
Here is what happens when an assistant receives a request like "open Notepad, write hello, save the file." Nothing about the sequence below is Notepad-specific; the same six steps run against any window the accessibility API can see.
One turn from the model's point of view
Assistant prompt
'fill out notepad'
get_window_tree
UIA cache read
Indexed YAML
#1..#37 with role, name, bounds
Model reads
picks #5
click_element
index: 5
UIA click
bounds center
Traditional tool output vs. what an assistant can read
To see why the format matters, put a typical free-tool response next to the indexed YAML. Both describe the same four buttons. Only one of them is usable by a model that has no prior state.
A human-first dump vs. an agent-first tree
A stream of element handles or image match references. The script author knows what each one is because they wrote the script. A model receiving this cold has no frame of reference.
- ControlID: 0x0041A, ClassName: Button
- Image match at (420, 300) with confidence 0.87
- UIAutomationElement#0x7ff9e41b2010
- WebDriver session element-ref:c3f9-4a...
The exchange between assistant, agent, and the OS
Three actors, eight messages, one state transition per round trip. The MCP agent is what sits between the model and the Windows or macOS accessibility API. It is the piece that knows how to read the cached tree and how to translate an index back into a coordinate click.
Assistant -> MCP agent -> OS accessibility API
The five perception modes, same tree shape every time
The indexed YAML is the unit of output. What feeds into it can change: UIA by default, browser DOM from an optional extension, OCR for canvas widgets, OmniParser for icon detection, Gemini for last-resort vision. Each one is a boolean flag on the same get_window_tree call, and each one merges its own indexed nodes into the tree the model reads.
UIA tree (default)
Always on. Indexed YAML from IUIAutomationCacheRequest. One round trip, no network, no CPU spike. This is the tree the model reads on every iteration.
include_browser_dom
Adds DOM-level elements from the Terminator Chrome extension. Indexed the same way, so the model's prompt stays the same shape.
include_ocr
Runs Tesseract locally and surfaces OcrWord nodes with their own #N indices. For canvas widgets UIA cannot see.
include_omniparser
Posts the window screenshot to an Omniparser backend (self-host with OMNIPARSER_BACKEND_URL). Bounding boxes come back as indexed nodes.
include_gemini_vision
Opt-in, BYO Gemini API key. Used only when the other tiers miss something. The only tier that can bill your card.
The MCP agent sits between assistants and the OS
Every major AI coding assistant speaks the Model Context Protocol. On the other side, every major desktop operating system exposes an accessibility API. The agent is the bridge: one piece of code, one tree format, one set of tools, regardless of which assistant talks to it or which OS it is reading.
One agent, many assistants, every accessibility API
The four-step install that exposes the whole loop
There is no onboarding flow, no sign-in, no license key. The agent is an npx command. The assistant is whatever MCP client you already use. The rest is the model reading a tree.
Install the MCP agent
Add one block to your assistant's MCP config that runs 'npx -y terminator-mcp-agent@latest'. No system service, no driver binary, no daemon.
Ask the assistant to list windows
The first thing the model usually does is call get_applications_and_windows_list. That gives it the process names it can then pass to get_window_tree.
Let it get the tree
get_window_tree returns the indexed YAML block. The model reads it on that single turn. The index_to_bounds cache lives on the agent, not the model.
Click, type, press, wait
Every action tool accepts the index it just read. There is no middle translation step. The model converges on the right element by naming an index, not by writing a CSS-like string.
What a real session looks like on the stdio
Here is one opening of Notepad, one get_window_tree, one click, one type, one save. Every line after the npx invocation is a log message the agent prints to stderr, so you can watch the loop resolve.
The comparison that belongs on every free-tools list
"Free" is a cost attribute. The attribute that matters once an AI assistant is the operator is shape of output, not dollars. Here is where Terminator lands on each axis a typical roundup leaves blank.
| Feature | Typical free desktop automation tool | Terminator |
|---|---|---|
| What the tool returns to the caller | Element handles, image matches, XPaths, control IDs | Indexed YAML tree (#N [ROLE] name ...) |
| Designed input shape | A human writing a script interactively | A language model on a single conversational turn |
| Click interface | Selector string or pixel coordinates | Three modes: selector, index, or (x, y) |
| Cross-process tree | Usually scoped to one app or driver session | Any running process by name or PID |
| Selector language | Tool-specific DSL or object chain | Playwright-style: role:Button|name:Save >> name:OK |
| Accessibility-first | Image matching or Win32 messages (mostly) | UIA on Windows, AX on macOS, AT-SPI on Linux |
| AI assistant install path | None; they predate the use case | npx -y terminator-mcp-agent@latest |
| Functional gating | Seat caps, premium connectors, trial clocks (varies) | MIT, no gating |
The numbers that actually describe the surface
The whole agent is small. 0 MCP tools registered, 0 click modes, and 0 perception tiers that all collapse into the same indexed tree. A Notepad window typically produces about 0 indexed nodes; a Chrome window is an order of magnitude larger, but the shape is identical.
“The whole interface between a language model and a desktop app can be one integer. That is what the tree format enables.”
tree_formatter.rs + utils.rs, Terminator MCP agent
Why this is the part a roundup never covers
A free-tools article is graded on whether it mentions the expected dozen products. The shape of the return value is outside that grade. But once you are installing a desktop automation tool to hand to an AI coding assistant rather than to write your own scripts against, the return value is the product. An image match is not an interface for a model. An element handle is not an interface for a model. An XPath is not an interface for a model. An indexed YAML tree with a side-table of bounds, retained on the agent, accepting an integer back, is. That is the gap this page tries to fill in.
Install with claude mcp add terminator "npx -y terminator-mcp-agent@latest", ask your assistant to open any window, and watch the first get_window_tree response scroll by. The format is the point.
Have your assistant read a tree with you on the call
Bring the app you want automated. We will open a live MCP session, take one get_window_tree snapshot together, and walk through how the model would act on the indices before you write any integration code.
Frequently asked
Frequently asked questions
What makes a free desktop automation tool actually usable by an AI coding assistant?
The tool has to return a representation of the screen that a language model can read with no extra work. Human-first tools return element handles, image match coordinates, or XPaths that only make sense if you are going to read source docs and debug interactively. An assistant-first tool returns a structured, indexed, named tree where each interactive element has a single identifier the model can pass back to a click or type call. Terminator's get_window_tree returns a compact YAML tree where every clickable element is prefixed with a #N index and tagged with [ROLE] name (bounds, state flags). The same index is then accepted by click_element, type_into_element, invoke_element, and so on. The model reads the tree, picks an index, and fires the next tool call. It does not need to build a selector, parse a screenshot, or retry on pixel mismatch.
How does this tree format differ from what Pywinauto, AutoHotkey, or SikuliX return?
Pywinauto returns Python object handles that the script has to walk (app.Notepad.Edit). AutoHotkey returns window IDs and control text that you compose into WinActivate and ControlClick commands. SikuliX returns image match objects that you click by pixel. Each of those formats is readable by a human who wrote the automation. None of them are a drop-in input for a language model reasoning turn by turn. Terminator's format is the opposite: #1 [Button] Save (bounds: [420,300,80,30], focusable) is one line the model can reproduce verbatim as the argument to the next tool call. The mapping from index to role, name, bounds, and selector is stored server-side in the MCP agent's index_to_bounds cache, so the model never has to hold state about which window was active.
Where exactly does the tree format live in the source?
It is defined in crates/terminator-mcp-agent/src/tree_formatter.rs. The function format_tree_as_compact_yaml walks a SerializableUIElement, emits #N [ROLE] name (bounds: [...], additional context) for every element with bounds, and writes - [ROLE] name for elements without bounds. It also returns an index_to_bounds HashMap<u32, (role, name, bounds, selector)>. That map is what click_element looks up when the model passes index: 7 instead of a selector string.
What are the three click modes, and why does an assistant need all three?
ClickElementArgs in crates/terminator-mcp-agent/src/utils.rs has a determine_mode method that validates exactly one of three patterns: selector mode (process plus a string like 'role:Button|name:Save'), index mode (an integer index from the last get_window_tree call), and coordinate mode (absolute screen x and y). The assistant uses index mode by default because that is the cheapest turn in a conversation: one number, one tool call. Selector mode is for deterministic replay across sessions where the index would change. Coordinate mode is the last resort for things accessibility APIs cannot see, like a canvas-drawn button inside a game or a widget inside a WebGL viewport.
How many MCP tools does the agent register, and are they free to use?
The Rust implementation in server.rs exposes roughly thirty-five tools via the rmcp tool_router: get_window_tree, click_element, type_into_element, press_key, press_key_global, scroll_element, activate_element, select_option, set_selected, invoke_element, set_value, wait_for_element, validate_element, highlight_element, mouse_drag, capture_screenshot, open_application, navigate_browser, execute_browser_script, run_command, read_file, write_file, edit_file, copy_content, glob_files, grep_files, typecheck_workflow, execute_sequence, ask_user, delay, stop_execution, get_applications_and_windows_list, gemini_computer_use, and a couple of inspect overlays. Every one of them is shipped under the MIT license. The only one that can bill your card is gemini_computer_use because it calls the Google Gemini API with your own key, and it is opt-in.
Does Terminator run on macOS and Linux, or just Windows?
The core terminator crate has a platforms/ directory with adapters behind a trait. The Windows adapter (platforms/windows/) uses the IUIAutomation COM interface and IUIAutomationCacheRequest for batched tree reads. The macOS adapter uses the AX accessibility API. Linux support is partial and gated on AT-SPI. The same Rust API is re-exported into Node via the terminator-nodejs binding and into Python via terminator-python, so you install once and get the same tree shape everywhere the accessibility API can see the window.
What is the quickest way to point an AI assistant at it?
One line in your MCP config. For Claude Code or Claude Desktop: 'mcpServers': { 'terminator-mcp-agent': { 'command': 'npx', 'args': ['-y', 'terminator-mcp-agent@latest'] } }. Restart the assistant, and it registers the full tool set. The first call the model usually makes is get_window_tree with a process name like 'notepad'. The response is the indexed YAML tree, and every subsequent click or type references an index from that tree. There is no daemon to install, no license server to talk to, no bot registration step.
Why do other free tools struggle when an AI assistant drives them?
Because they were designed for a person writing a script once, not for a loop of tool calls made by a model that wakes up fresh on every turn. AutoHotkey expects the script author to know the ControlClass name in advance. SikuliX expects you to have already captured the reference image. Robot Framework expects you to write keywords. Those assumptions hold when a human is at the keyboard. They break the moment the caller is a language model that just received a prompt and has no prior context. Terminator's tree is self-describing: the model reads one YAML block and has everything it needs to act. Nothing has to be prepared ahead of time.
Is any of this a paid tier in disguise?
No. The terminator crate is on crates.io as terminator-rs, the Node binding is on npm as @mediar-ai/terminator, the Python binding is on PyPI as terminator.py, and the MCP agent is on npm as terminator-mcp-agent. All MIT, no functional gating, no online activation. Mediar (the company that develops it) sells a managed workflow product that uses Terminator underneath; that product is paid, but the underlying library and MCP agent are not.