Open source computer use agent SDK: the deterministic + vision hybrid most guides skip
Most articles on this topic line up the same four projects: a browser-only library, a hosted operator product without source, a shell-execution agent, and a vision-loop framework. They argue about which one is “real.” They miss the more interesting answer. The SDK that sits closest to a working desktop agent is the one that ships both paths in the same install: a deterministic accessibility-tree runtime for the 90% of actions the OS already knows how to describe, and a built-in Gemini 2.5 Computer Use vision loop for the rest. Terminator is that SDK. Below is the actual Rust source that wires the two together, including the function that turns model output into screen pixels.
same runtime, every binding
The shape of an open source computer use SDK
A computer use agent has to do three things: observe the desktop, choose an action, and execute it. Most open source projects ship one of the three. A recorder gives you observation. A vision loop gives you choice. A driver like xdotool or PyAutoGUI gives you execution. Calling any one of those an SDK undersells what teams building real automations actually need.
Terminator ships all three, in the same install, with the same type system across Rust, Node, and Python, plus an MCP server that exposes them to any agent host. The interesting design choice below the surface is that observation and choice come in two flavors. The fast flavor reads the OS accessibility tree directly, so “click Save” resolves locally without a model. The slow flavor talks to Gemini 2.5 Computer Use through a 402-line Rust crate, so a fully alien UI still works. You pick at the action level, not at the install level.
Both paths in one script
This is what the SDK looks like in practice. Two calls, one process, one MCP install. The first call resolves a selector against the live UIA tree. The second hands the goal to a vision loop and lets it work autonomously until done.
How the pieces talk to each other
Every binding (Rust, Node, Python) and every host (Claude Code, Cursor, VS Code, Windsurf) routes through the same dispatch_tool in the MCP server, which then chooses between the deterministic UIA path and the vision-loop crate. The vision-loop crate itself calls a single backend endpoint, by default hosted but switchable with one environment variable.
Terminator SDK -> dispatch_tool -> deterministic OR vision
What ships in the install
Seven crates and packages, all under the same MIT license, all speaking to the same Rust core. The deterministic primitives are in terminator-rs. The vision integration is its own focused 402-line crate, which is small enough to read in one sitting.
Rust core
terminator-rs on crates.io. The platform adapters (Windows UIA, macOS AX), the selector engine, and the screenshot pipeline live here. Everything else is bindings.
Node SDK
@mediar-ai/terminator. napi-rs bindings, ships prebuilt binaries. Same Desktop, Locator, Element types as the Rust core.
Python SDK
pip install terminator. PyO3 bindings, .pyi stubs in the repo for IDE completion.
MCP agent
terminator-mcp-agent. 32 tools defined as match arms in server.rs. Compiled into a single binary, run via npx for any MCP host.
terminator-computer-use
402-line crate. Wraps Gemini 2.5 Computer Use into the runtime: backend call, key translation, normalized-to-screen pixel math.
Workflow recorder
terminator-workflow-recorder. Captures real human input and emits replayable YAML, so the deterministic path can be authored without writing selectors by hand.
CLI + KV
@mediar-ai/cli to drive workflows from a shell. @mediar-ai/kv to share state between deterministic and vision steps in the same run.
terminator-computer-use: 402 lines, three jobs
The vision-loop integration is a single Rust crate at crates/terminator-computer-use/src/lib.rs. It is exactly 402 lines and it has three jobs: speak HTTP to a vision backend, translate Gemini-flavored keys into Windows uiautomation strings, and convert the model's 0-999 normalized coordinates into real screen pixels. That's the whole crate. The Desktop method that drives the agentic loop sits in crates/terminator/src/computer_use/mod.rs and the MCP wrapper that exposes it to Claude / Cursor / VS Code sits at crates/terminator-mcp-agent/src/server.rs:8635. Below: the three functions that matter, copied from the source.
1. The HTTP boundary, lib.rs:140
One async function. One environment variable. One JSON shape. If you want to run your own model, replace the URL.
2. Key translation, lib.rs:212
Gemini 2.5 Computer Use emits keys in human format, like control+a or Meta+Shift+T. Windows UI Automation needs braced tokens like {Ctrl}a with meta mapped onto the Windows key. The translation is a flat match block. There are also unit tests in the same file pinning the mappings.
3. The pixel math, lib.rs:336
This is the function nobody else shows. The vision model returns coordinates in a 0-999 normalized space, regardless of the screenshot it was shown. Mapping back to your monitor is a stack of four conversions: undo the resize, undo the DPI scale, add the window offset. If any of those steps is wrong by a few pixels, the agent clicks the wrong thing. Here's the code:
“The whole vision integration is 402 lines, MIT-licensed, and lives in one crate. The pixel-coordinate math is 25 lines. If you want to swap Gemini for another vision model, you change one URL and adapt one function call.”
crates/terminator-computer-use/src/lib.rs (HEAD)
Numbers you can verify
All four come from running wc -l and grep on the repo. The 100x figure is the README's own claim about the deterministic path beating pixel-loop agents on long workflows; treat it as a target rather than a guarantee.
One step of the agentic vision loop
Same diagram every step. The MCP host fires gemini_computer_use once. Inside, the SDK runs this loop until the model returns completed: true or hits max_steps. Cancellation token is wired through, so a stop_execution tool can break the loop mid-step.
gemini_computer_use: one step
What you reach for, when
A simple decision rule: if the OS knows the element exists (it shows up in Accessibility Insights or inspect.exe), use a selector. If not, drop into vision. Same agent, different tool calls.
| Feature | Vision-only agent | Terminator (selector / vision hybrid) |
|---|---|---|
| Click a button you know exists | Vision loop: screenshot, infer pixels, click, screenshot, verify | click_element({ selector: 'role:Button && name:Save' }) - one MCP call |
| Read text from a dialog | Screenshot + OCR or model vision | get_window_tree returns Name and Value strings from the OS accessibility API |
| Drive a fully alien UI (canvas, game) | Vision loop, every action | gemini_computer_use({ process, goal }) - one tool call, agentic loop until done |
| Mix both in one workflow | Two installs, two harnesses | Same script. Selector path for known UI, vision path for the unknown. |
| Audit what the agent did | Screenshots only | ComputerUseStep history persisted to %LOCALAPPDATA%/terminator/executions/ |
How the SDK shape compares
Most projects in this space cover one slice of the problem. This is a feature-by-feature look at the slice they cover vs. what an SDK with both paths needs to ship.
| Feature | Single-path SDK | Terminator |
|---|---|---|
| Selector-driven actions (no model in the loop) | No (vision-only) or browser-only | Yes. 32 typed tools, Windows UIA + macOS AX (Windows GA, macOS partial). |
| Vision-loop fallback in the same package | Either/or, separate projects | gemini_computer_use ships in the same install. Same process. |
| Languages the SDK ships | Usually one (Python or TS) | Rust (terminator-rs), Node (@mediar-ai/terminator), Python (terminator), CLI, MCP. |
| MCP server included | Rare | Yes: 32 tools, one npx command, works in Claude Code, Cursor, VS Code, Windsurf. |
| Vision backend pluggable | Hardcoded to one provider | GEMINI_COMPUTER_USE_BACKEND_URL env var, point at any compatible endpoint. |
| Pixel coordinate math written out | Hidden inside the framework | lib.rs:336 convert_normalized_to_screen, ~25 lines of arithmetic, MIT-licensed. |
| Cursor / keyboard takeover | Yes, breaks during runs | Selector path uses accessibility APIs, your cursor stays free during the deterministic phase. |
| License | Mixed (Apache, AGPL, custom) | MIT. |
The audit checklist most people don't run
Before you commit to an open source SDK in this space, these are the things worth verifying yourself by opening the repo. Every item below is checkable in under five minutes.
Things to confirm in the source
- There is one place where tool names are defined (not duplicated between code and prompt).
- The vision-backend URL is pluggable, not hardcoded to a vendor.
- The pixel-coordinate conversion is visible and unit-tested.
- The deterministic and vision paths share the same action executors.
- Cancellation tokens propagate from the MCP host into the inner loop.
- License is permissive enough for commercial deployment (MIT or Apache-2.0).
- The same SDK is published in at least two languages so you are not locked in.
Two commands to a working hybrid agent
One command for the SDK, one command for the MCP runtime. The same npx package works for Claude Code, Cursor, VS Code, and Windsurf. Substitute pip install terminator or cargo add terminator-rs for the matching language.
From npm install to first hybrid run
The end-to-end flow once the SDK is installed. Five steps. The interesting one is step three: this is where you decide which actions belong on the deterministic path and which deserve the vision loop.
Install the SDK in your language of choice
npm i @mediar-ai/terminator (Node, prebuilt napi-rs binary), pip install terminator (Python, PyO3), cargo add terminator-rs (Rust). All three speak to the same core.
Inspect the target app's accessibility tree
Use Accessibility Insights for Windows or inspect.exe to walk the tree. Note the role and name of each element you plan to interact with. These become your selectors.
Decide where vision belongs
Anything the tree exposes (buttons, edits, menu items, lists) goes through Locator. Anything the tree refuses to describe (canvas elements, custom-rendered widgets, games) goes through gemini_computer_use.
Wire the MCP server if you want agent-host control
claude mcp add terminator "npx -y terminator-mcp-agent@latest" registers the server in Claude Code with stdio transport. Same npx command in the mcpServers JSON for Cursor, VS Code, Windsurf.
Run, inspect, iterate
Each run drops a JSON of ComputerUseStep entries plus per-step screenshots into %LOCALAPPDATA%/terminator/executions/. Replay or grep them when something fails. The deterministic path also captures a UI tree diff so you can see what changed between actions.
Vision-only vs hybrid, one click
The single click that separates the two. On the left: a pure vision loop, screenshot-in, coordinate-out, every action. On the right: the same outcome with the deterministic path, no model invoked. Both shapes coexist in the SDK; you only pay for the right side when accessibility actually fails.
one click, two implementations
// the inner loop of a pure-vision agent
// take screenshot
// POST to model
// parse function_call
// convert normalized to screen pixels
// execute click
// take screenshot again
while !done {
let png = capture_window();
let resp = call_vision(png, goal, &history).await?;
if resp.completed { break; }
let fc = resp.function_call.unwrap();
let (x, y) = convert_normalized_to_screen(
fc.x, fc.y, win_x, win_y, w, h, dpi, scale,
);
desktop.click_at(x, y).await?;
history.push(fc);
}Bring an open source computer use SDK into your stack
Walk through your target apps with us, see the deterministic and vision paths run live, and leave with a working hybrid agent in your repo.
Questions readers actually ask
What does an open source computer use agent SDK actually need to ship?
Three things, at minimum. First, a way to observe the desktop: either a screenshot pipeline or an accessibility tree reader. Second, a way to act on it: cursor, keyboard, focus, scroll, drag, application launch. Third, a model loop that converts a goal into a sequence of actions and recovers from failures. Most projects with this label ship one of the three (a recorder, a tree reader, a vision loop) and call themselves an SDK. Terminator ships all three, with bindings in Rust, Node, and Python, plus an MCP server so the same runtime is reachable from Claude Code, Cursor, VS Code, and Windsurf without writing glue.
Why a hybrid deterministic + vision design instead of pure vision?
Because most desktop work is boring and structured. Clicking a Save button, typing into a known field, opening an app, reading a status string. The OS accessibility tree already knows where each of those elements is, what role they have, and what they are named. Sending a screenshot to a vision model to find the same element is paying for a model inference plus an image-token charge for something the OS gave you for free. Terminator's selector path uses Windows UIA or macOS AX directly, in-process. Vision is reserved for cases where the tree is empty or lies, which on Windows is mostly games, canvas-heavy apps, and a few Electron windows that opted out of accessibility. The README pitches a 100x speedup and a 95%+ success rate over pixel-loop agents on this premise.
Where is the Gemini Computer Use integration in the source?
It is its own crate: crates/terminator-computer-use, 402 lines total. The interesting functions are call_computer_use_backend at lib.rs line 140 (the HTTP call to the vision endpoint), translate_gemini_keys at line 212 (maps Gemini's human key format like control+a into Windows uiautomation's brace format like {Ctrl}a), and convert_normalized_to_screen at line 336 (converts Gemini's 0-999 normalized output coordinates into absolute screen pixels accounting for screenshot resize, DPI scale, and window offset). The Desktop method that ties them together lives in crates/terminator/src/computer_use/mod.rs, and the MCP tool wrapper that calls it is at crates/terminator-mcp-agent/src/server.rs line 8635.
Can I point the vision loop at my own backend?
Yes, that is the design. call_computer_use_backend reads the GEMINI_COMPUTER_USE_BACKEND_URL environment variable, falling back to the hosted Mediar endpoint at app.mediar.ai. The contract is a JSON POST with image (base64 PNG), goal (string), and previous_actions (array). The response is a small JSON object with completed (boolean), an optional function_call (the next action), an optional text (model reasoning), and an optional safety_decision. If you want to run Gemini 2.5 Computer Use yourself, OpenAI Operator, Anthropic computer use, or your own model fine-tuned on UI traces, implement that contract and set the env var. The whole protocol is two structs in lib.rs, ComputerUseFunctionCall and ComputerUseBackendResponse, both visible at the top of the file.
How do I install the SDK and the MCP runtime in one go?
Two commands. npm i @mediar-ai/terminator gives you the Node SDK with prebuilt napi-rs binaries. claude mcp add terminator "npx -y terminator-mcp-agent@latest" registers the MCP server with Claude Code (the same npx command works in Cursor, VS Code, and Windsurf via their respective mcpServers JSON blocks). Python is pip install terminator, Rust is cargo add terminator-rs. All of them speak to the same Rust core, so a workflow recorded in one environment runs in another without translation.
Is it really only Windows?
The deterministic path is full-feature on Windows and partial on macOS today. Linux is unsupported. The vision path (gemini_computer_use) targets Windows in the current build because the action executors call into the Windows UIA invoke and click APIs after the coordinate conversion. The architecture is platform-clean (the computer_use crate has no OS-specific code, only types and pixel math), so a macOS executor is a matter of wiring the platform-specific click and type adapters that already exist for the selector path.
What does the agentic vision loop actually look like step-by-step?
Capture the target window with terminator-rs's screenshot API. Resize so the longest edge is 1920 pixels and remember the resize scale. Encode as base64 PNG. POST to the backend with the goal and any previous actions. Parse the response: if completed is true, finish. Otherwise, take the function_call (one of click_at, type_text_at, scroll_document, key_combination, drag, wait, etc.), translate any keys via translate_gemini_keys, run convert_normalized_to_screen on the coordinates, and execute the action through Terminator's deterministic backend. Take a fresh screenshot, append the previous action with its success or error, and loop. Each step is recorded as a ComputerUseStep in the result and the per-step PNG is dropped into %LOCALAPPDATA%/terminator/executions/ for replay.
How is this different from browser-use, OpenAI Operator, or open-interpreter?
Scope and shape. browser-use is a Python library focused on Chromium DOMs through Playwright. OpenAI Operator is a hosted product with no SDK to self-host. open-interpreter is a code-execution loop that drives a shell, not a desktop. Terminator is a cross-language SDK plus an MCP runtime, the deterministic path covers the entire desktop (any Windows app, not just browsers), and the vision path is a thin 402-line crate sitting on top of the same primitives. You can mix them in a single script: drive Excel via UIA selectors, drop into Gemini 2.5 Computer Use for a canvas-heavy CRM screen, then go back to selectors for the final report.
Is the system prompt and tool list compiled into the binary?
Yes. crates/terminator-mcp-agent/build.rs scans server.rs at build time, finds the dispatch_tool match block, and extracts the tool names into the MCP_TOOLS environment variable via cargo:rustc-env. crates/terminator-mcp-agent/src/prompt.rs reads env!("MCP_TOOLS") and inlines the list into the system instructions. The practical consequence: the model cannot be told about a tool the dispatcher does not handle, and the dispatcher cannot ship a tool the model is not told about. Same list by construction, every build.
More from the Terminator guides
Keep reading
Claude computer use, and the selector-based path nobody talks about
What Claude actually emits under computer_20251022, and the 32 selector tools that replace the pixel loop.
Accessibility API for computer use agents
How Windows UIA and macOS AX expose a tree the model can target by role and name instead of pixels.
Terminator on GitHub
Core Rust crates, MCP agent, Node and Python bindings, workflow recorder. MIT licensed.