Guide · April 2026

Open source computer use agents in April 2026, and the math most survey posts skip

Every list of open source computer use agents sorts the field into vision, accessibility, or hybrid and moves on. The interesting code lives in the hyphen. Terminator is the project that publishes it: four arithmetic transforms in Rust that turn a vision model's normalized 0-999 click into a real desktop pixel, living in the same MCP process as 32 selector-based tools.

$ claude mcp add terminator "npx -y terminator-mcp-agent@latest"

Install the MCP server in Claude Code (also works for Cursor, VS Code, Windsurf via MCP config)

M
Matthew Diakonov
11 min read
4.8from 1.3k Rust crate installs / month
Source: terminator-computer-use/src/lib.rs
MIT license, runs locally
Ships the hybrid bridge in public

The list posts all say the same three words

Search “open source computer use agents April 2026” and the first page repeats itself. UI-TARS, Browser Use, Microsoft UFO, Agent S, Agent Zero, Open Interpreter, LaVague, Fazm. Each post labels every entry vision, accessibility, or hybrid. None of them open the source on a hybrid agent and show what hybrid actually means in code. That is the gap this page fills, using the one project in that list whose hybrid bridge ships in public: Terminator.

The bridge is nine lines of arithmetic. Written out with comments it is about thirty. But getting any of those lines wrong puts every click half an inch to the left of the button, and none of the other posts show you how the line goes.

The field, April 2026

UI-TARSBrowser UseOpen InterpreterMicrosoft UFOAgent SAgent ZeroLaVagueFazmTerminatorSelf-Operating Computer
0
Max norm coord Gemini emits
0
Transforms per click
0
MCP tools on the other path
0
Function keys handled, f1 to f24

The two architectures, end to end

A pure-vision agent sees a PNG, picks a pixel, and hands the pixel back. A pure-selector agent reads the accessibility tree, matches a node by role and name, and invokes it through the OS. Every survey post acknowledges both. What they rarely acknowledge is that a hybrid agent has to do both at the same time in the same process, and that means the boundary between “what the model returned” and “what the OS accepts” is real code someone has to write.

Vision-native vs selector-native, in practice

FeatureVision path (screenshot loop)Selector path (Terminator MCP)
What the model looks atA PNG screenshot, every turn, re-uploaded. That is the whole input.The accessibility tree as YAML or JSON, fetched once, diffed on change.
What the model returnsA pixel coordinate pair like [487, 341], or a normalized 0-999 pair the harness must project back to the screen.A selector string like "role:Button && name:Save". No coordinates leave the model.
Where DPI / resize / window offset is handledIn your harness. If you forget any of the four transforms, clicks land off the button. Silently.Nowhere the model can see. UIA gives the element, we invoke it.
Cost per actionOne image upload plus one model call. Scales with workflow length.One MCP stdio call. Model round-trip amortised across whole plan.
Behaviour on a button moving 12 pixelsOld coordinate misses. The LLM does not notice unless the screenshot still shows the button.role:Button && name:Save still resolves. Selector is an identity, not a coordinate.
When it is genuinely the right toolCanvas apps. Games. Any surface the accessibility tree does not describe. Drawing software. Custom-rendered dashboards.Normal desktop apps, browsers, anywhere the OS already knows what the buttons are.

Both paths ship in the same Terminator binary. You can call click_element and gemini_computer_use against the same desktop in the same session.

What a hybrid call actually moves through

A single Gemini Computer Use round-trip passes through three inputs on the way in and three transforms on the way out. In Terminator both ends meet inside the terminator-computer-use crate.

terminator-computer-use: one round-trip

Desktop capture
Image resize
Base64 encode
terminator-computer-use
norm 0-999
convert_normalized_to_screen
UIA click
Anchor fact

The nine lines almost nobody publishes

Here is the function at crates/terminator-computer-use/src/lib.rs:336. It is the whole reason hybrid computer use is possible on a Windows laptop with a HiDPI display.

crates/terminator-computer-use/src/lib.rs

You can verify this by cloning the repo and running cargo test -p terminator-computer-use; the tests at the bottom of the same file cover the no-scaling case, the window-offset case, and the f1 through f24 range, so the bridge is not just shipped, it is checked in CI.

The four transforms, one at a time

1

Step 1 — Leave model space

Gemini Computer Use emits x and y in a normalized 0-999 grid. Multiply by screenshot_w and screenshot_h, divide by 1000. You are now in screenshot pixel space.

2

Step 2 — Undo the resize

Vision models have an image size cap. Terminator shrinks the capture before sending, recording a resize_scale. Divide by that scale to get back to the raw captured pixels.

3

Step 3 — Physical to logical pixels

Screenshots live in physical pixels. Windows UI Automation lives in logical pixels. On a 150% scaling display the divisor is 1.5. Skip this step and every click on a HiDPI monitor lands a third of the way down the button.

4

Step 4 — Add the window offset

Up to this point every coordinate has been relative to the window we screenshotted. Now add window_x and window_y to land on the virtual desktop. This is what we actually click.

Try it in your editor

Run the bridge against your own desktop in under a minute

One command wires Terminator into Claude Code as an MCP server. The same binary exposes the 32 selector tools and the Gemini Computer Use loop. Swap claude for cursor or edit the Windsurf mcp.json to use elsewhere.

$ claude mcp add terminator "npx -y terminator-mcp-agent@latest"

Install the MCP server in Claude Code (also works for Cursor, VS Code, Windsurf via MCP config)

The key translator is the smaller bridge, and it fails loudly

A vision model does not know or care about Windows UI Automation key syntax. It emits what it learned: control+a, Meta+Shift+T. Terminator ships a deterministic translator that refuses to guess. Unknown keys return a typed error, not a best-effort attempt. This matters because a silent miss on a keyboard shortcut is indistinguishable from a slow app on a noisy CI run.

crates/terminator-computer-use/src/lib.rs (trimmed)

The two paths live next to each other in one match block

In Terminator's MCP server, gemini_computer_use is not a separate binary or a plugin. It is one arm of the same dispatch_tool match block that exposes click_element, type_into_element, and the other 29 selector-based tools. A model driving an agent session can choose, per step, whether to spend a screenshot round-trip or a selector lookup.

crates/terminator-mcp-agent/src/server.rs

What actually happens per vision step

Read this like a latency budget. Every request arrow is a wait you are paying for. The selector path collapses the three left-most arrows into zero, which is why it wins for normal apps.

gemini_computer_use, one step

Your agentMCP (Terminator)Gemini backendWindows UIAtool_use: gemini_computer_use { goal: "post this to Slack" }capture window bitmap + dpi_scale + window_x, window_ycall_computer_use_backend(image, goal, previous_actions)function_call { name: "click_at", args: { x: 612, y: 284 } }convert_normalized_to_screen(612, 284, ...) -> (x_screen, y_screen)UIA ElementFromPoint(x_screen, y_screen).Invoke()click dispatched, UI diff capturedComputerUseStep { success: true, screenshot, reasoning }

The cast, April 2026

Six open source projects that credibly call themselves computer use agents today. Each picks a different place on the architecture spectrum.

Browser Use

DOM-aware browser agent. Lives inside the page, not on top of the OS. Crossed 50k stars on GitHub in Q1 2026. Great fit when the target is a web tab.

UI-TARS

Purpose-built vision model for computer use, from the Alibaba group. Screenshots in, actions out. Runs fully local once the weights are pulled.

Microsoft UFO

Windows-native dual-agent system built on UI Automation. Reached production quality in early 2026. Accessibility-first, vision on demand.

Agent S

Simular AI's open agentic framework. Uses a computer like a human: plans, observes, acts. Screenshot-driven with memory.

Open Interpreter

Code-execution-first. It will write and run Python or Node over your OS as readily as it will click things. Not strictly a GUI agent but lives in the same space.

Terminator

Windows UIA selectors plus an embedded Gemini Computer Use loop in the same MCP process. The hybrid bridge and the selector stack ship in the same binary.

0sBackend timeout (seconds) per call
0pxImage long edge cap before upload
0Function keys handled (f1-f24)
0Binary ships both paths

Ship the hybrid bridge with your next agent

Open source, MIT licensed, Rust core with Python and TypeScript bindings. One npx command registers the MCP server against Claude Code, Cursor, VS Code, or Windsurf. The convert_normalized_to_screen function and the 32 selector tools live in the same binary.

Install Terminator

What the public bridge actually unlocks

When the coordinate conversion and the key translator are real code you can read, three things get easier. You can run the same agent harness on a 100% display and a 175% display without re-tuning anything, because dpi_scale is just a parameter. You can swap the model backend by changing one env var (GEMINI_COMPUTER_USE_BACKEND_URL) and the math continues to work. And you can write deterministic tests for the bridge (the file has them) so a regression in the resize path does not silently move every click.

Building an agent that needs to click real buttons?

20 minutes with the maintainers. Bring the app you are trying to automate and the framework you have been fighting (PyAutoGUI, AutoHotkey, UIA, screenshot loops). We will sketch a selector-first plan with a vision fallback for the gnarly surfaces.

Frequently asked questions

What does 'open source computer use agent' mean in April 2026?

A computer use agent is a system where an AI model can see the state of a computer and drive its input devices — cursor, keyboard, scroll — toward a goal. Open source means the harness, the tools, and usually the system prompt are on GitHub under a permissive license; the model itself can be proprietary (Claude, Gemini) or open (UI-TARS, Qwen-VL). In April 2026 the space splits into three architectures: pure screenshot-and-vision (works on any OS, slow and expensive), pure accessibility-tree selectors (fast and robust, OS-specific), and hybrid systems that start with selectors and fall back to vision when the tree does not describe the element.

What does Terminator do that most of the other open source agents do not?

Two things in one binary. First, it exposes 32 selector-based tools over MCP, so a coding assistant like Claude or Cursor drives the desktop by role:Button && name:Save rather than pixel coordinates. Second, it also exposes a vision loop via the gemini_computer_use tool, and it ships the bridge code that connects the two worlds. That bridge is the convert_normalized_to_screen function at crates/terminator-computer-use/src/lib.rs line 336, which takes a model's 0-999 normalized coordinate and applies four transforms (divide by 1000, divide by resize_scale, divide by dpi_scale, add window offset) to produce a real pixel. Most hybrid agents keep that math private; Terminator ships it in public Rust.

Why four transforms? Is one of them optional?

None of them are optional on a real Windows machine. Step 1 exists because the model is trained on a normalized 0-999 grid, not your display. Step 2 exists because we have to resize the screenshot before uploading (Gemini Computer Use caps the long edge around 1024 pixels); the scale we used to shrink has to be undone to find where in the original capture the click belongs. Step 3 exists because screenshots live in physical pixels but UI Automation coordinates live in logical pixels, and on a 150% scaled display those differ by a factor of 1.5. Step 4 exists because up to that point we have been measuring inside the screenshotted window, and we have to add the window's top-left to land on the virtual desktop. Skip any one of them and your click is either off the button or off the screen.

Where does the key translator fit in?

Gemini emits keyboard input in one dialect (control+a, Meta+Shift+T) and Windows UIA accepts a different one ({Ctrl}a, {Win}{Shift}t). translate_gemini_keys in the same file (lines 212 to 313) walks each plus-separated token, maps modifiers to bracket forms, handles enter, tab, escape, backspace, delete, space, arrows, and the f1 through f24 range with an explicit 1..=24 match so a model hallucinating 'f99' fails loudly instead of silently. Single ASCII characters are only valid as the last segment of a combination. It is small code, but it is the kind of code a survey post never shows.

How do I see the 32 selector-based tools for myself?

Open crates/terminator-mcp-agent/src/server.rs and jump to the dispatch_tool function (around line 9953 at the time of writing). There is a single match block on tool_name; each arm is one MCP tool. At the time of this page: get_window_tree, get_applications_and_windows_list, click_element, type_into_element, press_key, press_key_global, validate_element, wait_for_element, activate_element, navigate_browser, execute_browser_script, open_application, scroll_element, mouse_drag, highlight_element, select_option, set_selected, capture_screenshot, invoke_element, set_value, execute_sequence, run_command, delay, stop_highlighting, stop_execution, gemini_computer_use, read_file, write_file, edit_file, copy_content, glob_files, grep_files. The vision loop is one of the 32.

Is Terminator Windows-only?

Today, yes. The README is explicit: Windows support is the stable surface; macOS and Linux are not currently supported for the automation features. The computer use crate itself compiles cross-platform, but the UIA invocations that execute the converted coordinates are Windows APIs. If you are on macOS and want an open source, accessibility-first agent, Fazm is the closest counterpart in the April 2026 space.

When should I prefer the vision path and when should I prefer selectors?

Prefer selectors whenever the accessibility tree describes what you want. Regular apps, browsers, native Windows dialogs — these all have a tree, and role + name selectors survive layout changes that break pixel coordinates. Prefer vision for surfaces that do not expose structure: canvas-based drawing apps, games, custom-rendered dashboards, WebGL content, remote desktops, anything where the 'button' is just painted pixels. A well-built hybrid agent tries the cheap path first and only pays for the screenshot round-trip when there is nothing structural to target.

Where does Terminator's MCP agent send the screenshot?

To https://app.mediar.ai/api/vision/computer-use by default, overridable with the GEMINI_COMPUTER_USE_BACKEND_URL environment variable. The payload is a JSON body with three fields: image (base64 PNG), goal (the user task), previous_actions (history with their screenshots and success flags). The backend proxies to Gemini. Source: terminator-computer-use/src/lib.rs line 140, call_computer_use_backend, with a 300-second timeout.

terminatorDesktop automation SDK
© 2026 terminator. All rights reserved.