Run vLLM locally with a desktop agent

Most write-ups on this pair either stop at “here is how to start vLLM” or mock a fake desktop with Playwright. None of them show the specific JSON contract a real desktop automation framework expects to receive. Terminator ships that contract in public Rust, reads one environment variable, and will send every screenshot to a URL of your choosing. That is the whole bridge.

Matthew Diakonov, Terminator maintainer

Published April 22, 202612 min read

4.9from 600+

Based on published Rust in crates/terminator-computer-use

Covers the exact request and response shape

Every screenshot stays on the loopback interface

One env var. Same agentic loop.

Point Terminator at a local vLLM server and never phone home

GEMINI_COMPUTER_USE_BACKEND_URL=http://localhost:8000

Shim POSTs base64 PNG to vLLM, returns one of 9 actions

Terminator dispatches the action via UI Automation

Every screenshot stays on your machine

0:00 / 0:05

The whole bridge is one HTTP endpoint

Terminator’s computer-use loop lives in theterminator-computer-usecrate. The entire backend integration is one function that reads one env var, POSTs a fixed JSON body, and expects a fixed JSON response. There is no SDK-shaped adapter, no model-specific client, no gRPC. Just reqwest, serde, and a 300-second timeout.

crates/terminator-computer-use/src/lib.rs

What your shim gets, what your shim returns

Three inputs cross the wire. Three outputs come back. That is the contract. If you can produce the right output from the right input, you can drive the desktop with any model vLLM can serve.

Inputs and outputs across the localhost boundary

0Valid action names

0Max normalized coordinate

0Screenshot max edge, px

0Turn history cap

Numbers that shape the shim. All four are hard-coded in the Rust side, not negotiable from the model.

The nine action names you are allowed to emit

The dispatch match incrates/terminator/src/computer_use/mod.rsenumerates every action the Rust side will accept and execute. If your model returns anything outside this set, the Rust side logsUnknown actionand ends the loop. Train your prompt on this vocabulary and ignore the rest of the function-calling schema the model was fine-tuned on.

click_at

args: {x, y}. Coords in 0-999. Terminator inverts the resize + DPI + window offset transforms and fires a real mouse click at the screen pixel.

type_text_at

args: {x, y, text, press_enter?}. Clicks to focus, Ctrl+A to clear, types via root().type_text, optionally presses Enter. Use this instead of raw 'type'.

key_combination

args: {keys}. A string like 'control+a' or 'Meta+Shift+T'. Terminator's translator maps it to uiautomation format internally. F1 through F24 supported.

scroll_document / scroll_at

args: {direction, magnitude?, x?, y?}. If x and y are provided, Terminator clicks there first to set focus, then scrolls through root().scroll.

drag_and_drop

args: {x, y, destination_x, destination_y}. Both in 0-999. start_x/start_y or end_x/end_y are also accepted for compatibility.

navigate / search

navigate{url} activates the app, Ctrl+L focuses the address bar, types the URL, hits Enter. search{query} opens google.com/search?q= for the query you return.

A 180-line FastAPI shim that actually works

Here is the bridge, end to end. The shim speaks Terminator’s contract on its POST endpoint and vLLM’s OpenAI-compatible chat API on the outbound side. SetVLLM_MODELto whichever VL model you decided to serve.

vllm_computer_use_shim.py

What happens when you press run

The Rust loop runs six steps on every iteration, most of which you never see. Knowing the sequence matters because your shim sits between steps 3 and 4, and the pre/post conditions on either side dictate what the model can and cannot assume.

Step 1 — Capture the target window

Desktop::gemini_computer_use finds the target process, grabs its window, screenshots it with the native OS API, and converts BGRA to RGBA. No other window is captured.

Step 2 — Resize to 1920 long-edge

capture_window_for_computer_use enforces a MAX_DIM of 1920 pixels on the longer side. Larger monitors get Lanczos-resized and the scale factor is retained so we can invert it later. This is what keeps the image small enough for a local VL model to process in a reasonable latency budget.

Step 3 — POST to your backend URL

The base64 PNG, the goal string, and the last 3 (action, result, screenshot, url) tuples are serialized into one JSON payload. If GEMINI_COMPUTER_USE_BACKEND_URL is set, it goes to your shim. If not, it goes to app.mediar.ai.

Step 4 — Receive one function_call

Your shim replies with completed=false and a function_call whose name is one of the 9 valid action strings. args contains integers in 0..999 for x and y, or a text field for typing, or a keys field for key combinations.

Step 5 — Normalized coord -> screen pixel

convert_normalized_to_screen divides by 1000, inverts the resize scale, inverts the DPI scale, adds the window's top-left offset, and clicks. On a 4K display at 150% scale this is where you stop missing buttons by a third of their height.

Step 6 — 1000ms settle, capture again

Terminator sleeps 1000ms, captures a post-action screenshot, and appends it to previous_actions. If the list exceeds 3 entries, the oldest is removed. That is why your prompt template should never assume it can see the whole session history in one frame.

One step of the agentic loop, actors and messages

Boot the stack and drive it

Three long-running processes, one env var, and a client. The vLLM server and the shim both bind to loopback, so you get the full latency benefits of a local backend (no TLS handshake, no inter-AZ hop, no per-request billing). A Qwen2.5-VL-7B on a 24GB card returns a function_call in roughly 1-2 seconds on a 1600px screenshot.

Boot sequence and first run

The client is one block of TypeScript

Set the env var before you constructDesktop. Everything else is the same call shape you would make against the hosted backend. The onStep callback fires after each dispatched action so you can log, visualize, or gate on errors.

run_agent.ts

The HTTP contract, every line of it

If your shim diverges from the hosted backend on any of these rows, the Rust side will fail closed rather than silently mis-dispatch.

Feature	Hosted default	Your shim
Request method + content type	POST application/json with {image, goal, previous_actions}.	Same. Terminator does not care which process on localhost answers.
image field encoding	base64 PNG string, no data: prefix, no url.	Identical on the shim side. You rebuild a data URL before passing to vLLM.
previous_actions length	Up to 3 items. Oldest is dropped at line 679 of mod.rs.	Your shim never sees more than 3, so prompt compression is the Rust side's job, not yours.
completed=true semantics	Terminator breaks the agentic loop and records final_status='success'.	Emit completed=true from the shim when the VL model says the goal is satisfied. No more screenshots are sent.
function_call.name not in the 9-action set	warn! log line, final_status='failed'. The loop ends.	Guard against this in the shim. Return null to force the Rust side to treat the step as no_action.
safety_decision='require_confirmation'	Terminator breaks with final_status='needs_confirmation' and records the pending args.	Your shim can gate destructive actions (delete, send, pay) behind this value. The Rust side honours it.

Counting tokens when the model only ever sees four screenshots

The turn-history cap is the single most important fact about designing a prompt for this loop. A 20-step plan is not 20 screenshots of context. It is at most four screenshots on any given inference: the current frame plus the last three (action, result, post-action) tuples.0 images, not 0. Plan for that.

On a 1600 by 1000 screenshot, Qwen2.5-VL encodes each image into roughly 1200 visual tokens at default settings. Four images is about 5000 image tokens per turn plus a few hundred text tokens for the system message, goal, and previous action names. That fits comfortably in an 8K context window, which is what you want to keep time-to-first-token low. If you need longer memory, persist a textual trajectory outside the shim and inject it as a small rolling summary in the system message. Do not try to sneak extra screenshots past the cap. The Rust side drops them.

Shipping a local-first desktop agent?

Book a 20-minute call with the Terminator team. We will walk through the shim on your hardware and show where other teams got stuck.

Frequently asked questions

Which env var do I actually need to set?

GEMINI_COMPUTER_USE_BACKEND_URL. It is read in crates/terminator-computer-use/src/lib.rs on line 145 and falls back to https://app.mediar.ai/api/vision/computer-use when unset. Set it to your shim's endpoint before constructing the Desktop object, and every screenshot in that session goes to your shim instead of Mediar's hosted route.

Does Terminator actually require Gemini, or is the env var a real hand-off?

It is a real hand-off. The Rust side does not speak Gemini. It POSTs a fixed JSON payload and expects a fixed response shape. Any process on any URL that honours the contract can be the brain. The 'gemini' prefix in the env var is a historical naming choice, not a binding. You can run Qwen2.5-VL, Llama 3.2 Vision, InternVL, Pixtral, or any other VL model vLLM supports.

What is the exact JSON the shim has to return?

{"completed": bool, "function_call": {"name": string, "args": object} | null, "text": string | null, "safety_decision": "require_confirmation" | null}. If completed is true, the loop ends and text is treated as the final answer. If completed is false, function_call.name must be one of click_at, type_text_at, key_combination, scroll_document, scroll_at, drag_and_drop, wait_5_seconds, hover_at, navigate, or search. Anything else logs a warning and ends the loop.

How are coordinates supposed to be encoded?

Integers in a 0..999 grid that is normalized to the screenshot you were sent. Terminator divides by 1000 to get a pixel inside the (possibly resized) image, then multiplies back up by 1/resize_scale, 1/dpi_scale, and finally adds the captured window's top-left offset. The shim does none of this math. It just returns the normalized numbers the VL model emitted.

Why does previous_actions get capped at 3?

Line 679 of crates/terminator/src/computer_use/mod.rs runs `if previous_actions.len() > 3 { previous_actions.remove(0); }` at the end of every step. Each past action carries a full base64 PNG, and without the cap a 20-step plan would balloon the request body to 20 full screenshots. Three is enough context for action chaining and keeps the POST payload under typical proxy limits. Design your prompts knowing the model sees at most the current screen plus 3 prior screens.

Do I need a GPU to run the shim?

You need a GPU for the VL model vLLM is serving, not for the shim itself. The shim is a FastAPI process that translates JSON. Qwen2.5-VL-7B fits comfortably on a single 24GB card in FP16, and vLLM's paged-attention kernels make it practical for one user driving one desktop session. Smaller models like Qwen2.5-VL-3B run on 12GB. The shim can be a CPU-only container talking to vLLM over localhost.

What happens to the screenshot stream? Does anything leave my machine?

With GEMINI_COMPUTER_USE_BACKEND_URL pointed at localhost, nothing leaves the machine on the inference path. Terminator writes every screenshot to %LOCALAPPDATA%/terminator/executions/ as <timestamp>_geminiComputerUse_<process>_NNN.png. Your shim receives the same bytes over a loopback socket, forwards them to vLLM on another loopback socket, and returns JSON. The only wire traffic is to whatever target the agent is navigating to (e.g., a browser fetching a page).

Can the shim enforce safety gates before destructive actions?

Yes. Return safety_decision='require_confirmation' instead of a function_call and Terminator will break the loop with final_status='needs_confirmation', write the pending (action, args, text) into pending_confirmation, and return. Your outer harness can inspect the pending args, prompt the human, and resume. This is how you put 'send', 'delete', 'pay', or 'overwrite' behind an approval step without trusting the model.

Is this the same as running Ollama instead of vLLM?

Same shape, different backend. The Terminator example folder at crates/terminator-mcp-agent/examples/terminator-ai-summarizer uses the ollama-rs crate for a simpler summarize-the-screen loop. For the full computer-use agentic loop, you still need a shim that honours the JSON contract above because Ollama's /api/generate response shape is not the one Terminator's Rust side expects. If you prefer Ollama, replace the AsyncOpenAI client in the shim with an Ollama client and keep everything else.

How do I debug a misaligned click on a HiDPI monitor?

The four coordinate transforms are in convert_normalized_to_screen. If the click lands above-and-left of the target, the resize_scale is being undercounted. If it lands in the right spot on your primary monitor but wrong on your secondary monitor, window_x and window_y are wrong (the capture happened on the wrong window). Terminator saves the initial screenshot at executions/<id>_000_initial.png and every post-action screenshot at executions/<id>_NNN_after.png, so you can confirm what the model actually saw versus what it tried to click.

Adjacent guides on the Terminator stack

Keep reading

Guide

Open source computer use agents, April 2026

The four coordinate transforms every hybrid agent runs on each click, shown in public Rust. Companion piece to the vLLM shim.

Read

Guide

Claude computer use, and the selector path nobody explains

Anthropic's native computer use tool is a pixel-coordinate loop. Terminator also exposes 32 selector-based MCP tools for when the screenshot path is overkill.

Read

Repo

terminator-computer-use on GitHub

The crate that owns the backend URL, the 9-action dispatch table, and the normalized coordinate converter. MIT licensed.

Read