Matthew Diakonov

Published April 24, 202611 min read

A free desktop automation tool whose output was written for an LLM to read

Most of the products on a free desktop automation roundup were built for a human operator writing a script. They return element handles, image matches, control IDs. That output reads fine if a person is sitting in front of a REPL. It does not read at all if the caller is a language model on a single conversational turn. Terminator returns something different: a compact indexed YAML tree where every clickable element is prefixed with a #N, tagged with [ROLE] and a name, and annotated with bounds and state. The same index goes back into click_element, type_into_element, and wait_for_element as a one-integer argument. That loop is the product.

5.0from MIT, agent-first output format, 35 MCP tools

Tree output is #1 [ROLE] name (bounds, state) per line, keyed 1..N

click_element has three modes: selector, index, coordinates

One npx command registers the full tool set in any MCP client

The loop that makes a free tool usable by an AI assistant

Indexed tree in, index out, action fires.

The model asks for a window tree.

It gets back indexed YAML.

It picks an index. It fires a click.

No selectors to memorize. No pixels to match.

That loop is what 'free for an AI assistant' looks like.

0:00 / 0:05

Why every other free tool on the list assumes a human

The shortlist repeats itself across every write-up on this subject. AutoHotkey, AutoIt, SikuliX, Pywinauto, Robot Framework, Winium, WinAppDriver, Ui.Vision, Power Automate Desktop, UiPath Community. They are grouped by language binding or by recognition strategy, but none of the comparisons pause to ask what the tool actually returns to its caller. That matters now because the caller is no longer a person opening an editor. It is a coding assistant that received a prompt, has to figure out what is on the screen, and then has to decide which element to act on. A return value that makes sense to a human reading docs does not automatically make sense to a model reading a single tool response.

The anchor: one function emits the tree, one cache holds the mapping, one argument is all the model needs

The output format lives in crates/terminator-mcp-agent/src/tree_formatter.rs. The function format_tree_as_compact_yaml walks a serialized UI element, assigns every element with bounds a 1-based index, and writes lines of the shape #N [ROLE] name (bounds: [x,y,w,h], focusable, value: ...). Elements without bounds get a dash prefix and no index so the model never tries to click them. The function also returns an index_to_bounds HashMap that the MCP agent keeps server-side. When the model calls click_element with index: 7, the agent resolves it to the stored role, name, bounds, and original selector. The index is the whole interface.

What the tree actually looks like on the wire

This is the shape of a minimal Notepad tree the way the MCP agent emits it. The indentation encodes containment. The #N is the handle. The role is the structural type. The name is the accessible label. Everything in parentheses is context the model may or may not need, depending on the task.

get_window_tree response (abridged)

None of this is a special debug view. It is the default response body of a single MCP tool call. A model seeing this for the first time has everything it needs to decide what to do next, in one block, without extra context.

The function that writes that tree

The formatter is straightforward Rust. It builds the output string, assigns the running index, and returns both the formatted text and the map used later to resolve an index back to coordinates. The interesting part is that the mapping is stored, not thrown away: that is how click_element is able to take an integer from the model and click the right element without the model ever sending bounds or selectors back.

tree_formatter.rs (format_tree_as_compact_yaml)

The click side is simpler still. Three modes, one check per mode, and a validation that exactly one is specified.

utils.rs (ClickElementArgs::determine_mode)

The loop from prompt to pixel

Here is what happens when an assistant receives a request like "open Notepad, write hello, save the file." Nothing about the sequence below is Notepad-specific; the same six steps run against any window the accessibility API can see.

One turn from the model's point of view

Assistant prompt

'fill out notepad'

get_window_tree

UIA cache read

Indexed YAML

#1..#37 with role, name, bounds

Model reads

picks #5

click_element

index: 5

UIA click

bounds center

Traditional tool output vs. what an assistant can read

To see why the format matters, put a typical free-tool response next to the indexed YAML. Both describe the same four buttons. Only one of them is usable by a model that has no prior state.

A human-first dump vs. an agent-first tree

A stream of element handles or image match references. The script author knows what each one is because they wrote the script. A model receiving this cold has no frame of reference.

ControlID: 0x0041A, ClassName: Button
Image match at (420, 300) with confidence 0.87
UIAutomationElement#0x7ff9e41b2010
WebDriver session element-ref:c3f9-4a...

The exchange between assistant, agent, and the OS

Three actors, eight messages, one state transition per round trip. The MCP agent is what sits between the model and the Windows or macOS accessibility API. It is the piece that knows how to read the cached tree and how to translate an index back into a coordinate click.

Assistant -> MCP agent -> OS accessibility API

The five perception modes, same tree shape every time

The indexed YAML is the unit of output. What feeds into it can change: UIA by default, browser DOM from an optional extension, OCR for canvas widgets, OmniParser for icon detection, Gemini for last-resort vision. Each one is a boolean flag on the same get_window_tree call, and each one merges its own indexed nodes into the tree the model reads.

UIA tree (default)

Always on. Indexed YAML from IUIAutomationCacheRequest. One round trip, no network, no CPU spike. This is the tree the model reads on every iteration.

include_browser_dom

Adds DOM-level elements from the Terminator Chrome extension. Indexed the same way, so the model's prompt stays the same shape.

include_ocr

Runs Tesseract locally and surfaces OcrWord nodes with their own #N indices. For canvas widgets UIA cannot see.

include_omniparser

Posts the window screenshot to an Omniparser backend (self-host with OMNIPARSER_BACKEND_URL). Bounding boxes come back as indexed nodes.

include_gemini_vision

Opt-in, BYO Gemini API key. Used only when the other tiers miss something. The only tier that can bill your card.

The MCP agent sits between assistants and the OS

Every major AI coding assistant speaks the Model Context Protocol. On the other side, every major desktop operating system exposes an accessibility API. The agent is the bridge: one piece of code, one tree format, one set of tools, regardless of which assistant talks to it or which OS it is reading.

One agent, many assistants, every accessibility API

The four-step install that exposes the whole loop

There is no onboarding flow, no sign-in, no license key. The agent is an npx command. The assistant is whatever MCP client you already use. The rest is the model reading a tree.

Install the MCP agent

Add one block to your assistant's MCP config that runs 'npx -y terminator-mcp-agent@latest'. No system service, no driver binary, no daemon.

Ask the assistant to list windows

The first thing the model usually does is call get_applications_and_windows_list. That gives it the process names it can then pass to get_window_tree.

Let it get the tree

get_window_tree returns the indexed YAML block. The model reads it on that single turn. The index_to_bounds cache lives on the agent, not the model.

Click, type, press, wait

Every action tool accepts the index it just read. There is no middle translation step. The model converges on the right element by naming an index, not by writing a CSS-like string.

claude_desktop_config.json

What a real session looks like on the stdio

Here is one opening of Notepad, one get_window_tree, one click, one type, one save. Every line after the npx invocation is a log message the agent prints to stderr, so you can watch the loop resolve.

claude_desktop -> terminator-mcp-agent (stderr tail)

The comparison that belongs on every free-tools list

"Free" is a cost attribute. The attribute that matters once an AI assistant is the operator is shape of output, not dollars. Here is where Terminator lands on each axis a typical roundup leaves blank.

Feature	Typical free desktop automation tool	Terminator
What the tool returns to the caller	Element handles, image matches, XPaths, control IDs	Indexed YAML tree (#N [ROLE] name ...)
Designed input shape	A human writing a script interactively	A language model on a single conversational turn
Click interface	Selector string or pixel coordinates	Three modes: selector, index, or (x, y)
Cross-process tree	Usually scoped to one app or driver session	Any running process by name or PID
Selector language	Tool-specific DSL or object chain	Playwright-style: role:Button\|name:Save >> name:OK
Accessibility-first	Image matching or Win32 messages (mostly)	UIA on Windows, AX on macOS, AT-SPI on Linux
AI assistant install path	None; they predate the use case	npx -y terminator-mcp-agent@latest
Functional gating	Seat caps, premium connectors, trial clocks (varies)	MIT, no gating

The numbers that actually describe the surface

The whole agent is small. 0 MCP tools registered, 0 click modes, and 0 perception tiers that all collapse into the same indexed tree. A Notepad window typically produces about 0 indexed nodes; a Chrome window is an order of magnitude larger, but the shape is identical.

1 index

“The whole interface between a language model and a desktop app can be one integer. That is what the tree format enables.”

tree_formatter.rs + utils.rs, Terminator MCP agent

Why this is the part a roundup never covers

A free-tools article is graded on whether it mentions the expected dozen products. The shape of the return value is outside that grade. But once you are installing a desktop automation tool to hand to an AI coding assistant rather than to write your own scripts against, the return value is the product. An image match is not an interface for a model. An element handle is not an interface for a model. An XPath is not an interface for a model. An indexed YAML tree with a side-table of bounds, retained on the agent, accepting an integer back, is. That is the gap this page tries to fill in.

Install with claude mcp add terminator "npx -y terminator-mcp-agent@latest", ask your assistant to open any window, and watch the first get_window_tree response scroll by. The format is the point.

Have your assistant read a tree with you on the call

Bring the app you want automated. We will open a live MCP session, take one get_window_tree snapshot together, and walk through how the model would act on the indices before you write any integration code.

Frequently asked

Frequently asked questions

What makes a free desktop automation tool actually usable by an AI coding assistant?

The tool has to return a representation of the screen that a language model can read with no extra work. Human-first tools return element handles, image match coordinates, or XPaths that only make sense if you are going to read source docs and debug interactively. An assistant-first tool returns a structured, indexed, named tree where each interactive element has a single identifier the model can pass back to a click or type call. Terminator's get_window_tree returns a compact YAML tree where every clickable element is prefixed with a #N index and tagged with [ROLE] name (bounds, state flags). The same index is then accepted by click_element, type_into_element, invoke_element, and so on. The model reads the tree, picks an index, and fires the next tool call. It does not need to build a selector, parse a screenshot, or retry on pixel mismatch.

How does this tree format differ from what Pywinauto, AutoHotkey, or SikuliX return?

Pywinauto returns Python object handles that the script has to walk (app.Notepad.Edit). AutoHotkey returns window IDs and control text that you compose into WinActivate and ControlClick commands. SikuliX returns image match objects that you click by pixel. Each of those formats is readable by a human who wrote the automation. None of them are a drop-in input for a language model reasoning turn by turn. Terminator's format is the opposite: #1 [Button] Save (bounds: [420,300,80,30], focusable) is one line the model can reproduce verbatim as the argument to the next tool call. The mapping from index to role, name, bounds, and selector is stored server-side in the MCP agent's index_to_bounds cache, so the model never has to hold state about which window was active.

Where exactly does the tree format live in the source?

It is defined in crates/terminator-mcp-agent/src/tree_formatter.rs. The function format_tree_as_compact_yaml walks a SerializableUIElement, emits #N [ROLE] name (bounds: [...], additional context) for every element with bounds, and writes - [ROLE] name for elements without bounds. It also returns an index_to_bounds HashMap<u32, (role, name, bounds, selector)>. That map is what click_element looks up when the model passes index: 7 instead of a selector string.

What are the three click modes, and why does an assistant need all three?

ClickElementArgs in crates/terminator-mcp-agent/src/utils.rs has a determine_mode method that validates exactly one of three patterns: selector mode (process plus a string like 'role:Button|name:Save'), index mode (an integer index from the last get_window_tree call), and coordinate mode (absolute screen x and y). The assistant uses index mode by default because that is the cheapest turn in a conversation: one number, one tool call. Selector mode is for deterministic replay across sessions where the index would change. Coordinate mode is the last resort for things accessibility APIs cannot see, like a canvas-drawn button inside a game or a widget inside a WebGL viewport.

How many MCP tools does the agent register, and are they free to use?

The Rust implementation in server.rs exposes roughly thirty-five tools via the rmcp tool_router: get_window_tree, click_element, type_into_element, press_key, press_key_global, scroll_element, activate_element, select_option, set_selected, invoke_element, set_value, wait_for_element, validate_element, highlight_element, mouse_drag, capture_screenshot, open_application, navigate_browser, execute_browser_script, run_command, read_file, write_file, edit_file, copy_content, glob_files, grep_files, typecheck_workflow, execute_sequence, ask_user, delay, stop_execution, get_applications_and_windows_list, gemini_computer_use, and a couple of inspect overlays. Every one of them is shipped under the MIT license. The only one that can bill your card is gemini_computer_use because it calls the Google Gemini API with your own key, and it is opt-in.

Does Terminator run on macOS and Linux, or just Windows?

The core terminator crate has a platforms/ directory with adapters behind a trait. The Windows adapter (platforms/windows/) uses the IUIAutomation COM interface and IUIAutomationCacheRequest for batched tree reads. The macOS adapter uses the AX accessibility API. Linux support is partial and gated on AT-SPI. The same Rust API is re-exported into Node via the terminator-nodejs binding and into Python via terminator-python, so you install once and get the same tree shape everywhere the accessibility API can see the window.

What is the quickest way to point an AI assistant at it?

One line in your MCP config. For Claude Code or Claude Desktop: 'mcpServers': { 'terminator-mcp-agent': { 'command': 'npx', 'args': ['-y', 'terminator-mcp-agent@latest'] } }. Restart the assistant, and it registers the full tool set. The first call the model usually makes is get_window_tree with a process name like 'notepad'. The response is the indexed YAML tree, and every subsequent click or type references an index from that tree. There is no daemon to install, no license server to talk to, no bot registration step.

Why do other free tools struggle when an AI assistant drives them?

Because they were designed for a person writing a script once, not for a loop of tool calls made by a model that wakes up fresh on every turn. AutoHotkey expects the script author to know the ControlClass name in advance. SikuliX expects you to have already captured the reference image. Robot Framework expects you to write keywords. Those assumptions hold when a human is at the keyboard. They break the moment the caller is a language model that just received a prompt and has no prior context. Terminator's tree is self-describing: the model reads one YAML block and has everything it needs to act. Nothing has to be prepared ahead of time.

Is any of this a paid tier in disguise?

No. The terminator crate is on crates.io as terminator-rs, the Node binding is on npm as @mediar-ai/terminator, the Python binding is on PyPI as terminator.py, and the MCP agent is on npm as terminator-mcp-agent. All MIT, no functional gating, no online activation. Mediar (the company that develops it) sells a managed workflow product that uses Terminator underneath; that product is paid, but the underlying library and MCP agent are not.