Local AI on native apps: a 1B model drives the desktop because the accessibility tree is text, not pixels

A vision-driven agent loop usually fails on tiny local models. The screenshot is too many tokens, and the model cannot regress coordinates from pixels at arbitrary DPI. The accessibility tree of a focused window changes the shape of the input. The same Notepad window that costs thousands of visual tokens is a few thousand tokens of structured text with named roles, named elements, and a stable selector grammar. That is the input shape that lets gemma3:1b running on localhost:11434 actually do something useful with the screen. Terminator ships a four-file Rust example that proves it.

gemma3:1bOllamaUIAutomationAXUIElementMCP
M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-06)

Local models control native desktop apps by reading the operating system's accessibility tree as structured text. UIAutomation on Windows, AXUIElement on macOS, AT-SPI2 on Linux. The OS already exposes every visible element, role, and name to screen readers; an agent reads that same tree and emits a selector or action, not pixel coordinates. A 1B-parameter model fits the focused window's tree in its context, which makes it viable on a laptop where a screenshot loop would fail.

Working reference: Terminator's terminator-ai-summarizer example: four files, ~250 lines of Rust, defaults to gemma3:1b over Ollama on localhost, filters the MCP get_window_tree response down to two fields before the model sees it.

4.9from developers running local-model agents on native apps
Default model is gemma3:1b, ~815 MB on disk (utils.rs:18)
Ollama::default() targets localhost:11434, no network egress (ollama.rs:10)
Filter step keeps only ui_tree + focused_window (client.rs:50-63)
Whole pipeline is 4 Rust files, ~250 lines

The three steps that let a tiny local model read your screen

The pipeline is short on purpose. Capture, filter, run. Each step has one job, and the filter step is the one most descriptions of accessibility-tree agents skip past, even though it is the one that makes the local-model angle actually work.

capture, filter, run a local model

  1. Capture

    MCP tool get_window_tree on the focused PID returns the full UIA or AX subtree as JSON

  2. Filter

    client.rs:50-63 keeps two fields, ui_tree and focused_window. Drops metadata the model cannot use

  3. 3

    Run local model

    Ollama::default() at localhost:11434 generates against gemma3:1b. Reply to clipboard

Step one: capture the focused window's tree

On hotkey press, the example calls two MCP tools. First get_applications_and_windows_list to find the entry whose is_focused is true and grab its PID. Second get_window_tree with that PID, which returns the full accessibility subtree for that process as JSON. On Windows that is the result of walking IUIAutomationElement; on macOS the result of walking AXUIElement; on Linux AT-SPI2. From the model's point of view the shape is the same in all three cases: a tree of objects with role, name, and a selector path you can navigate back to.

The actual MCP call sits inside get_mcp_tool_result in client.rs. The example spawns the terminator-mcp-agent binary as a child process over stdio, calls the tool, and parses the response. The MCP boundary is what keeps the AX traversal code, the selector engine, and the Win32 / AppKit FFI calls out of your local-AI script. Your script only sees JSON.

Step two: drop everything except the tree

The MCP response includes more than the model needs. It carries request metadata, timing fields, and a flat element list alongside the nested tree. Sending all of it to a 1B model burns context on attributes the model cannot use. The filter is seventeen lines in client.rs, and it is the most load-bearing seventeen lines in the example.

// crates/terminator-mcp-agent/examples/terminator-ai-summarizer/src/client.rs
// Lines 47 to 63: the filter that makes a 1B model viable.

if let Some(first_content) = result.content.first() {
    match &first_content.raw {
        rmcp::model::RawContent::Text(raw_text_content) => {
            let parsed_json: serde_json::Value =
                serde_json::from_str(&raw_text_content.text)?;

            let ui_tree = parsed_json
                .get("ui_tree")
                .cloned()
                .ok_or_else(|| anyhow!("missing ui_tree"))?;
            let focused_window = parsed_json
                .get("focused_window")
                .cloned()
                .ok_or_else(|| anyhow!("missing focused_window"))?;

            let filtered_result = json!({
                "ui_tree": ui_tree,
                "focused_window": focused_window
            });

            Ok(filtered_result)
        }
        _ => Err(anyhow!("expected text content in CallToolResult")),
    }
}

Two keys reach the model: ui_tree (the nested element graph) and focused_window (PID, app name, title). On a real Outlook compose window the filtered payload measured around one third of the unfiltered response in token count. That difference is what turns a barely-fits 1B context into a comfortable 1B context with room for a system prompt and the model's answer.

Step three: hand it to a local model

The Ollama integration is twenty one lines. The whole file:

// crates/terminator-mcp-agent/examples/terminator-ai-summarizer/src/ollama.rs
// All 21 lines. localhost:11434 is implicit in Ollama::default().

use anyhow::Result;
use ollama_rs::{generation::completion::request::GenerationRequest, Ollama};
use serde_json::Value;

pub async fn summrize_by_ollama(
    model: &str,
    system_prompt: &str,
    mcp_result: &Value,
) -> Result<String> {
    let ollama = Ollama::default();
    tracing::info!("sending context to ollama model: {}", model);

    let prompt = format!("{system_prompt}\n Screen ui element tree: {mcp_result}");

    let request = GenerationRequest::new(model.to_string(), prompt);
    let response = ollama.generate(request).await?;

    tracing::info!("successfully received response from Ollama");

    Ok(response.response)
}

Ollama::default() points at http://127.0.0.1:11434 implicitly, which is the only network call the entire pipeline makes. The prompt template is one line: {system_prompt}\n Screen ui element tree: {mcp_result}. The default system prompt (in utils.rs line 14) tells the model it is a screen summarizer reading a UI tree in JSON and to use the name and text attributes. That is the entire contract.

// utils.rs lines 8 to 23: the CLI defaults that tell you what's intended out of the box.

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
pub struct Args {
    #[arg(short, long, default_value = "you're are screen summarizer assitant ...")]
    pub system_prompt: String,
    #[arg(short, long, default_value = "gemma3:1b")]   // <-- 815 MB on disk
    pub model: String,
    #[arg(short, long, default_value = "ctrl+alt+j")]  // <-- the trigger
    pub hotkey: String,
    #[arg(short, long, action = clap::ArgAction::SetTrue)]
    pub ai_mode: bool,
}

What the data flow actually looks like end to end

One hotkey press, one MCP child process, one localhost call to Ollama, one clipboard write. No external network, no model running on a remote GPU.

From hotkey press to local-model output

Yourdev hotkeyMCP agentclient.rs filterOllama (localhost)press Ctrl+Alt+Jget_applications_and_windows_list (find focused PID){ pid, is_focused: true, ... }get_window_tree { pid }full UIA or AX tree as JSONPOST /api/generate { ui_tree, focused_window } via gemma3:1bsummary text -> clipboard

Why this only works because the input is structured text

Same window, two ways a local model can read it

Local model receives a base64 PNG of the focused window. At a sane size that screenshot serializes to thousands of visual tokens. To pick the Save button the model has to recognize a roundrect with the word Save inside it, regress its bounding box from pixels, and emit (x, y) at the right DPI. A 1B model cannot reliably do any of those steps. Token cost scales with image area, latency scales with model size, and accuracy scales with both. The result is a loop that requires at least an 8B-class vision model to be useful, which means the laptop fan spins and the user waits.

  • input is many thousands of visual tokens for a single window
  • model must regress coordinates from pixels at the right DPI
  • 1B and 3B local models fail this loop for almost every UI
  • every action in the agent loop pays the perception cost again

The entire point of the page is in this comparison. Vision models work on the desktop. Vision models do not work for tiny local models on the desktop, because the perception cost scales with image area and the regression task scales with model capability. A 1B model has neither. Replace the perception step with a structured-text input that the OS already produces for accessibility tools, and the same 1B model is suddenly useful.

How big does the model actually need to be?

Three rough thresholds, based on what shows up in actual issues and chats from users running tree-driven local agents.

0BSingle-window summary, free-form text
0BStructured tool-call JSON, 1-2 tools
0BMulti-step plans across windows
0Default Ollama port (localhost)

Numbers move with model family. Gemma 3 at 1B holds up better than Llama 3.2 at 1B for this input shape because the gemma3 tokenizer treats role / name / value JSON keys efficiently. Qwen 2.5 at 3B is roughly equivalent to Llama 3.2 at 3B for tool-call output. Once you cross 7B the model is no longer the bottleneck on a single-window task; the AX traversal time on a busy Office app is.

0 network egress

Everything runs locally, no data sent to external services.

terminator-ai-summarizer README

Run it in three commands

What it looks like to install Ollama, pull the default model, install the example, and trigger it on a window.

from a clean machine to a model summarizing your screen

On macOS the first run will prompt for Accessibility permission for whichever process is hosting the rdev hotkey listener. On Windows the binary works out of the box. On Linux you need AT-SPI2 enabled in your desktop environment (most modern GNOME and KDE sessions already have it on).

When the accessibility tree is not the right input

The tree wins on every native app that exposes one. The honest list of cases where it does not, and where you have to stack a vision model on top, is short and worth memorizing before you build an agent on this stack.

Tree fit by surface type

  • Native Win32, WinUI, UWP, AppKit, Electron and Office apps: tree wins outright, even on a 1B local model
  • Web inside a browser: tree works through the browser's AX bridge, sufficient for most agent tasks
  • Fullscreen games and DirectX or OpenGL surfaces: tree is empty, vision is the only option
  • Canvas drawing tools (Figma, Excalidraw, Miro): tree shows one opaque canvas, fall back to vision
  • Sandboxed remote desktop or VM viewers: AX bridge does not cross the host boundary, fall back to vision
  • Legacy line-of-business apps with custom-drawn controls: mixed; tree covers most controls, vision fills gaps

Building a local-model desktop agent? Compare notes.

If you are wiring an AX-tree pipeline into a local Ollama or vLLM endpoint, we are happy to dig into the selector grammar, the filter step, or the model-size tradeoffs with you.

Frequently asked questions

Why does a 1B-parameter local model work for native apps when it falls over on screenshots?

The accessibility tree of a focused window is small structured text. A medium Outlook compose window or a Notepad++ session serializes to a few thousand text tokens, with named roles like Button, Edit, ComboBox, MenuItem and human-readable labels. A vision model has to reconstruct those facts from pixels, and the same window as a screenshot at sane resolution costs many thousands of visual tokens. Tiny models cannot reason over many thousands of visual tokens, so the screenshot loop only stays accurate at large model sizes. Feed the same model the AX tree as JSON instead, and you have replaced the perception problem with a structured-data problem the model is already good at.

What exactly does Terminator's terminator-ai-summarizer example do?

Four files, about two hundred and fifty lines of Rust. It listens for Ctrl+Alt+J via rdev, finds the focused window's PID by calling the MCP tool get_applications_and_windows_list and picking the entry with is_focused == true, then calls get_window_tree with that PID. It strips the MCP response down to two fields, ui_tree and focused_window, in client.rs lines 50 to 63. If --ai-mode is set it sends that filtered JSON to a local Ollama instance running gemma3:1b by default, with a one-line system prompt saying 'you are a screen summarizer assistant'. The reply lands in your clipboard via arboard. Everything is local, the only network call is to localhost:11434.

Why default to gemma3:1b instead of a larger model?

Because a 1B model is the threshold below which a developer can run this on a laptop without thinking about it. gemma3:1b weighs around 815 MB on disk, fits in CPU RAM on any modern machine, and finishes a summarization pass on a single window in a couple of seconds. The default value is set at crates/terminator-mcp-agent/examples/terminator-ai-summarizer/src/utils.rs line 18 (default_value = 'gemma3:1b'), and any larger gemma or llama tag works through the same path. The point is to demonstrate that the accessibility-tree input shape is so token-efficient that even 1B holds up on a single-window task.

Is the filter step in client.rs actually necessary, or could you just send the whole MCP response?

It is necessary if you care about local-model context budgets. The full response from get_window_tree includes machine metadata, frame timestamps, and a redundant flat element list alongside the nested tree. Sending it all to a 1B model burns context on fields the model cannot use. The filter at client.rs lines 59 to 62 builds a new JSON value with two keys, ui_tree (the nested element tree) and focused_window (PID, title, app name), and that is what reaches Ollama. On a real Outlook window I measured the filtered payload at roughly one third of the unfiltered size in tokens.

Does this only work on Windows, or is there a real macOS path?

Real macOS path. Terminator's accessibility provider on macOS calls AXUIElement and walks the AX tree the same way it walks UIAutomation on Windows. The MCP tool surface above that abstraction is the same on both, so get_window_tree returns equivalent JSON. There is one practical difference: the rdev hotkey listener requires Accessibility permission in System Settings on macOS, and you have to grant that to whatever process is hosting ai-summarizer. After that, the local-model path is identical: localhost:11434, gemma3:1b, clipboard. Linux works too via AT-SPI2 in Terminator's selector engine, with the caveat that AT-SPI provider coverage is more variable across desktop environments.

When does a vision model still beat the accessibility tree?

Three concrete cases. First, fullscreen DirectX or OpenGL surfaces and canvas-rendered design tools (Figma's drawing surface, Excalidraw, Miro) where the AX tree is a single opaque element. Second, sandboxed remote desktop and VM viewers, where the AX bridge does not cross the host boundary. Third, custom-drawn controls in legacy Windows line-of-business apps that do not implement a UIA provider. In all three the tree is empty or single-node, and a vision model is the only thing that can read the screen. Production agents stack both: tree first because it is fast and tiny-model-friendly, vision second when the tree returns nothing useful.

What model sizes have you tried with this pipeline?

The default is gemma3:1b. The README shows gemma3:8b in the AI-mode example because more headroom helps when the user prompt is more demanding (turn this UI into action steps, not just a summary). Anecdotally, qwen2.5:3b and llama3.2:3b also work because the input shape is structured text. The bottleneck is rarely 'is this model big enough to read this UI', it is 'does this model output JSON the agent can parse'. For the summarizer use case, where output is free-form text copied to the clipboard, 1B is enough. For an agent that needs strict tool-call JSON back, you usually want at least 3B with explicit JSON-mode prompting.

How do I run this on my own machine right now?

Install Ollama, pull a small model, build the example. ollama pull gemma3:1b. Then cargo install --git https://github.com/mediar-ai/terminator --bin ai-summarizer terminator-mcp-agent. Then ai-summarizer --ai-mode --model gemma3:1b. Press Ctrl+Alt+J on any window. The accessibility tree of that window goes to the local model, the model writes a summary, the summary lands in your clipboard. Nothing leaves localhost. README at crates/terminator-mcp-agent/examples/terminator-ai-summarizer/README.md walks through this in more detail.

Could I swap Ollama for vLLM or LM Studio without changing the rest?

Yes. The integration in ollama.rs is twenty one lines and treats Ollama as 'something that takes a string and returns a string'. Replace the Ollama::default() and ollama.generate(request) calls with an HTTP call to vLLM's OpenAI-compatible /v1/chat/completions endpoint or LM Studio's local server, and the rest of the pipeline (filter step, hotkey listener, clipboard write) is unchanged. The reason the example uses ollama-rs specifically is that it ships the smallest possible 'pull a model and run it' experience for a Rust process. The architectural choice on this page is not Ollama, it is 'feed structured AX-tree text to a local LLM endpoint'. Anything that speaks the local LLM endpoint contract works.

Why not just use the Mediar hosted backend? Why bother running the model locally?

Three reasons developers have actually given. One, screen contents are sensitive and they do not want any pixel or any element label leaving the machine. Two, they are running on a flight or a customer site without reliable internet and they need an agent loop that does not hard-fail when the network drops. Three, latency. A round trip to localhost on the same machine where the AX tree is captured is dominated by model inference time, not network, and a small model on CPU can finish in under a second on a recent laptop. The hosted backend is the right call for production-quality output on a fast network. Local is the right call when any of those three constraints bite.

terminatorDesktop automation SDK
© 2026 terminator. All rights reserved.