tree diff, not screenshot diff

Computer use agent state tracking, done with two regexes and a tree diff

Every guide on this topic says the same thing: take a screenshot before, take a screenshot after, ask the model what changed. That works, and it costs you a model turn per action plus a vision inference that does not know the difference between a button becoming enabled and a pixel of subpixel rendering shifting in the font.

Terminator does it differently. The framework captures the accessibility tree before the action, performs the action, waits 1500ms for the UI to settle, captures the tree again, strips the volatile fields (IDs and bounds) out of both, and runs a line diff. The result is a structural description of what changed, returned in the same response as the action result. No second model call. No vision compare. A window that moved 118 pixels but did not re-render reports zero changes.

ui_diff_before_afterexecute_with_ui_diffsimple_ui_tree_diffsettle_delay_ms: 1500

Matthew Diakonov, Written with AI

Published May 14, 20269 min read

Direct answer (verified 2026-05-14)

Most computer use agents track UI state by re-snapshotting the screen or the accessibility tree after every action and asking the model to compare. Terminator does it structurally. Pass ui_diff_before_after: true to any action tool. The framework grabs the accessibility tree before the action, runs the action, waits 1500ms, grabs the tree again, strips IDs and bounds from both copies, and returns the diff lines in the action response under ui_diff and has_ui_changes. No second model call. No vision compare.

Source: crates/terminator/src/ui_tree_diff.rs, 227 lines, and execute_with_ui_diff in lib.rs from line 1748.

The reason you cannot just diff the raw tree

The accessibility tree of a real foreground window has 200 to 800 nodes on a typical screen. Every node carries an AutomationId, an internal element_id, and a bounding rectangle. On most apps the AutomationId rotates on every new session. The internal element_id rotates on every tree fetch. The bounds shift any time a parent reflows by a pixel, which happens during animations, DPI changes, and font fallbacks.

If you write a naive tree_after == tree_before check, you get a diff on every fetch even when nothing changed. The volume is roughly: one +/- pair per element under the parent that animated. For a Save dialog with 40 children, that is 80 noise lines. Real changes get buried.

naive raw-tree diff after a single Save click

- [Button] Save #id8d3a (bounds: [892,512,80,30], focusable)
+ [Button] Save #idee71 (bounds: [892,514,80,30], focusable)
- [Group] Toolbar #id2b91 (bounds: [12,8,1280,42])
+ [Group] Toolbar #id7f02 (bounds: [12,8,1280,42])
- [Edit] Filename #idab43 (bounds: [420,460,260,28])
+ [Edit] Filename #idc197 (bounds: [420,462,260,28])
... 218 more lines of pure noise ...

Two pixels of vertical shift (the dialog scrolled up by an animation easing curve) plus fresh IDs on every node. None of it is a state change. All of it is in the diff. An agent that reads this output learns nothing about whether the Save actually did anything.

The two regexes that fix it

Open crates/terminator/src/ui_tree_diff.rs in the Terminator repo. The whole module is 227 lines. The state tracking primitive lives in two functions.

The compact YAML format the framework uses for tree dumps looks like - [Button] Save #id8d3a (bounds: [892,512,80,30], focusable). Two regex substitutions get applied to every line before the diff runs:

ui_tree_diff.rs lines 40-50

pub fn remove_ids_and_bounds_from_compact_yaml(yaml_str: &str) -> String {
    // Remove #id patterns (e.g., #12345, #abc-def-123)
    // This regex matches: space + # + word characters (letters, numbers, hyphens)
    let id_re = Regex::new(r" #[\w\-]+").unwrap();
    let result = id_re.replace_all(yaml_str, "");

    // Remove bounds patterns: "bounds: [x,y,w,h]" with optional trailing comma/space
    // Matches: "bounds: [123,456,789,100]" or "bounds: [123,456,789,100], "
    let bounds_re = Regex::new(r"bounds: \[[^\]]+\],?\s*").unwrap();
    bounds_re.replace_all(&result, "").to_string()
}

For trees serialized as JSON (the larger, more verbose form), a separate function preprocess_tree parses with serde_json, walks the object, drops every key named id or element_id, and re-serializes with pretty printing. After preprocessing, both cleaned strings go into similar::TextDiff::from_lines, the Rust port of Python's difflib.ndiff, and only the lines tagged Insert or Delete come back. Equal lines are dropped.

Anchor fact

A window that moves 118 pixels without a re-render returns zero changes.

The repo carries this as a test. Open ui_tree_diff.rs at line 202: test_simple_ui_tree_diff_yaml_bounds_change_no_diff. Two trees with the same structure but bounds shifted by 118 pixels on the Y axis. The assertion is diff.is_none(). If you regress the regex you break this test.

ui_tree_diff.rs lines 201-212

#[test]
fn test_simple_ui_tree_diff_yaml_bounds_change_no_diff() {
    // Same structure, only bounds changed (element moved down 118px)
    let tree1 = "- [Group] Comment from flappy-goose (bounds: [26,472,617,367], focusable)\n  - [Button] Reply (bounds: [128,961,82,34], focusable)";
    let tree2 = "- [Group] Comment from flappy-goose (bounds: [26,472,617,485], focusable)\n  - [Button] Reply (bounds: [128,1079,82,34], focusable)";

    let diff = simple_ui_tree_diff(tree1, tree2).unwrap();
    assert!(diff.is_none(), "Bounds-only changes should not produce a diff");
}

The full call shape, end to end

One click, traced through the agent loop. The MCP server is in the middle. The tree capture engine is the UIA backend on Windows or the AX backend on macOS. The diff engine is the 227-line module above.

One action, one structural diff returned to the agent

What the agent gets back is two fields: ui_diff (a string of plus and minus lines) and has_ui_changes (a boolean). The model reads them in the same turn it issued the action. Branch logic is local: if has_ui_changes is false, retry with a different selector. If new dialog lines appeared, focus the next action on those. If a target node disappeared, the action succeeded.

What gets stripped, and why each one matters

The preprocessing pass is the entire point. Without it the diff is noise. With it the diff is signal. Here is what falls out before the line comparison runs.

Volatile fields removed before diffing

AutomationId values rotate per-session on some apps (notably Office, every browser, anything WPF). Diffing them produces a +/- pair for every element under the action's parent.
element_id is Terminator's own internal identifier, regenerated on each get_window_tree call. It exists for selector caching, not for state comparison.
bounds: [x,y,w,h] shifts when a window animates, when a parent reflows after a font fallback, or when DPI scaling kicks in on a second monitor. Pure visual noise from a semantic standpoint.
Equal lines (similar::ChangeTag::Equal) are dropped on purpose. The diff is a description of what changed, not a re-statement of what stayed the same. Keeps the diff inside a few hundred tokens even for large trees.
JSON path is handled separately by preprocess_tree: parse with serde_json, walk the object, drop keys named id and element_id, re-serialize with pretty printing.
Compact YAML path is handled with two regex substitutions because the YAML never gets parsed back into a tree, it stays as the literal indented representation an LLM is good at reading.

Screenshot diff vs structural diff

Both approaches answer the question "what changed after my action." They answer it at different layers. The cost profile, the failure modes, and the LLM token budget look very different.

The two approaches, same question, different answers

Take a full screenshot before and after. Send both images to the model. Ask the model to compare. The model spends another inference budget on a vision compare that does not know what is structural and what is decorative.

Two full vision inferences per action
Subpixel jitter and animation easing count as changes
Font fallback renders as a real diff to the model
DPI scaling changes everything at once
Costs tokens proportional to screen resolution

The settle delay, in one paragraph

Most native UI animations finish inside one second. Capture the AFTER tree too early and you snapshot a half-rendered dialog with children that vanish on the next frame. The default of 1500ms is the empirical floor that produces a clean diff for the common cases (Office, Chrome, Electron apps, native Windows shell). The number lives at line 1814 of crates/terminator/src/lib.rs as the unwrap_or default on opts.settle_delay_ms. Override it per call when you know more about the action. Keyboard sequences inside a single text field finish in under 200ms, so 200-500 is fine. Actions that trigger a server round trip before the next dialog renders want 2500-5000.

Wiring it into your own loop

Four steps. The first one is the install you already have if you ship anything on top of Terminator.

From install to first structural diff

1
Install Terminator's MCP server
claude mcp add terminator "npx -y terminator-mcp-agent@latest". Same npm command goes into Cursor's, VS Code's, and Windsurf's MCP config block.
2
Set ui_diff_before_after: true on the first action call
The system prompt the MCP server ships already pushes the assistant to do this. If you want belt-and-suspenders, mention it once in your own instructions to Claude.
3
Read ui_diff in the response
Lines starting with + are new in the after tree. Lines starting with - were removed. No diff lines means the action did not change the UI semantically, which is a real failure mode worth handling.
4
Override settle_delay_ms when the action waits on a network
Default is 1500ms. Increase to 3000-5000 for clicks that trigger a server round-trip before the dialog renders. Decrease to 200-500 for keypress sequences inside a single text field.

node SDK: state tracking on a click

import { Desktop } from "terminator.js";

const desktop = new Desktop();
const saveBtn = await desktop
  .locator("process:notepad >> role:Button && name:Save")
  .first();

const result = await saveBtn.click({
  uiDiffBeforeAfter: true,
  uiDiffMaxDepth: 30,
});

if (result.uiDiff?.hasChanges) {
  // result.uiDiff.diff is a string of + and - lines
  // route on it: dialog opened, button disabled, etc.
} else {
  // the click resolved but nothing semantic changed
  // retry, re-acquire tree, or surface failure
}

rust: the lower-level shape

use terminator::{Desktop, UiDiffOptions};

let desktop = Desktop::new_default()?;
let options = UiDiffOptions {
    settle_delay_ms: Some(1500),
    ..Default::default()
};

let (result, element, diff) = desktop
    .execute_with_ui_diff(
        "process:notepad >> role:Button && name:Save",
        |el| el.click(),
        Some(options),
    )
    .await?;

match diff {
    Some(d) if d.has_changes => { /* d.diff is a String of +/- lines */ }
    Some(_) => { /* "No UI changes detected" */ }
    None    => { /* tree capture failed; action still happened */ }
}

On the MCP path (Claude Code, Cursor, VS Code, Windsurf), nothing on your side. The server prompt the assistant receives already steers it toward ui_diff_before_after: true on every action tool. The diff arrives in the response JSON. The assistant reads it before deciding what to do next. The changelog entry at version 0.24 made the parameter mandatory at the MCP layer specifically because agents that omitted it kept losing track of state.

What this does not solve

Structural state tracking is not a replacement for vision in every case. There are two regimes where the tree diff is the wrong tool.

First, apps with empty accessibility trees. Older games, custom DirectX surfaces, some Java Swing dialogs, and a handful of Electron apps that never wire up ARIA. The before and after trees are both effectively empty. The diff is always empty. You need vision here. Terminator's SDK ships a Gemini Computer Use adapter for this case; the click_element tool also exposes an include_omniparser path that runs an OCR plus visual model on top of the screenshot. The structural diff coexists with the visual grounding; you do not have to pick one.

Second, changes that are purely visual. A button glow, a color shift in a chart, a hover state that fades in. None of those show up in the accessibility tree because they are not announced to assistive tech. If your agent has to act on visual-only feedback, you want a vision pass; the tree diff will report has_changes: false and be technically correct but not useful. The honest split is to use the tree diff for navigation and semantic state, and vision for the cases where the application explicitly opted out of announcing the change.

Frequently asked questions

What is computer use agent state tracking, in one sentence?

It is the part of an agent loop that answers the question 'did my action actually change the UI, and how?' without taking another full screenshot or re-querying the whole accessibility tree by hand. For Terminator that answer is a structured diff returned alongside the click or type response, listing only the lines of the tree that changed after the volatile fields (IDs, bounds) were stripped. For a typical screenshot-only computer use loop it is a second model call against a pair of images.

Why isn't 'diff the accessibility tree before and after' enough on its own?

Because the raw tree is full of fields that change every time you fetch it even when nothing visible changed. AutomationId values regenerate, runtime element_id values rotate, and bounding rectangles shift by a pixel when a parent reflows. If you diff the raw trees you get a wall of noise on every click. The interesting signal (a new dialog appeared, a button became enabled, a text field gained focus) drowns. Terminator's preprocess_tree and remove_ids_and_bounds_from_compact_yaml functions remove id, element_id, and bounds entries before the line diff runs, so the diff is over semantic structure only. A window that moved 118 pixels without re-rendering returns no diff at all. There is a test for exactly that case (test_simple_ui_tree_diff_yaml_bounds_change_no_diff at line 202 of ui_tree_diff.rs).

Where exactly in the code does this happen?

Two functions in crates/terminator/src/ui_tree_diff.rs do the work. remove_ids_and_bounds_from_compact_yaml uses a regex ` #[\w\-]+` (literal space, hash, then word characters or hyphens) to strip the inline id markers from compact YAML lines like `- [Button] Submit #id123 (bounds: [10,20,100,30], focusable)`. Then it uses `bounds: \[[^\]]+\],?\s*` to strip the bounds payload. preprocess_tree handles the JSON variant: it walks the tree and drops any key named id or element_id. The cleaned strings then go into similar::TextDiff::from_lines, the Rust equivalent of Python's difflib.ndiff, and only the lines tagged Insert or Delete come back. Equal lines are skipped. The whole module is 227 lines including tests.

How does an agent actually receive the diff?

If you are using the Rust crate directly, you call desktop.execute_with_ui_diff(selector, action, Some(UiDiffOptions { settle_delay_ms: Some(1500), .. })) and get back a tuple of (action_result, element, Option<UiDiffResult>). UiDiffResult has two fields: diff (a string of + and - lines) and has_changes (a boolean). If you are on the Node.js SDK, you pass `uiDiffBeforeAfter: true` to .click() or .typeText() options and the result object includes a uiDiff field with the same shape. If you are using the MCP server from Claude Code, Cursor, VS Code, or Windsurf, you set `ui_diff_before_after: true` on click_element, type_into_element, press_key, or any other action tool and the response JSON carries ui_diff and has_ui_changes. The system prompt the MCP server ships explicitly tells the assistant not to call get_window_tree after an action, because the diff is already attached to the action response.

What does the structured diff look like in practice?

A list of lines that changed after the strip pass. Lines starting with `-` were in the before tree and not in the after. Lines starting with `+` are new. For a Save dialog opening, you would see something like `+ - [Dialog] Save As (focusable)` followed by `+ - [Edit] File name` and `+ - [Button] Save`. Plus and minus signs are the only markers; equal lines are dropped on purpose to keep the diff short enough that an LLM can read it in one token budget. The diff is plain text, not JSON, because the model already speaks diff syntax fluently.

Why is the default settle delay 1500 milliseconds?

Because most native UI animations (Windows Fluent transitions, macOS view animations, web view fade-ins inside Electron apps) complete inside one second, and you need a margin. If you capture the AFTER tree too early you see the half-rendered intermediate state, which produces a diff full of partially-built dialog children that disappear on the next frame. 1500ms is the empirical floor that produces a clean diff for the common cases. The number is the unwrap_or default on opts.settle_delay_ms inside execute_with_ui_diff in crates/terminator/src/lib.rs around line 1814. Override it per call: shorter (200-500ms) for purely keyboard actions inside a single text field, longer (2500-5000ms) for actions that trigger a network round-trip the UI is waiting on.

How is this different from Anthropic's computer use loop or other screenshot-based agents?

Screenshot-based loops track state by comparing two raster images. The model has to do the comparison, which costs another full inference, and the comparison is pixel-aware (not semantically aware), so font rendering changes, subpixel jitter, and animation easing all register as differences. Terminator's loop tracks state structurally at the accessibility tree level. The diff is computed in Rust before the model ever sees it, so the per-action token cost goes to zero unless something semantically changed. Vision still has a place in the loop (Terminator's open source SDK ships a Gemini Computer Use adapter for cases where the tree is empty), but for the 90% case where the app exposes a real accessibility tree, you want structural state tracking, not visual.

Does the diff capture cost a full extra tree walk?

Yes. It walks the tree twice (once before, once after) and computes a line diff between the two compact YAML strings. On a typical foreground window with 200-800 elements this costs in the tens of milliseconds per direction on Windows UIA, plus the settle delay. The tradeoff is that you save a model call per action. If your action tool would otherwise be followed by get_window_tree (the LLM's default move to figure out what happened), you have replaced one tree walk plus one full model turn with one tree walk and a small diff string. Cheaper, faster, and more honest about what changed.

What if the action did nothing visible?

You see `No UI changes detected` in the diff field and has_changes false. This is the failure mode the diff was built to surface. A click that resolved against the right element but did not change anything (because the button was already pressed, or the element is a dead area inside a misnamed container) shows up as `has_changes: false` instead of being silently reported as success. Pair this with the system prompt rule that says 'never hallucinate success' and the agent can decide whether to retry, re-acquire the tree, or escalate. The output of the action and the structural state delta are separated on purpose: one tells you the call returned ok, the other tells you the world moved.

Can I turn it off if I don't want the diff?

Yes. On the Rust SDK, just call the non-diff variants of the action methods (element.click(), element.type_text()). On the Node SDK, leave uiDiffBeforeAfter false (it defaults to false at the binding layer to keep simple scripts fast). On the MCP server, the server prompt strongly encourages it for agent loops; in the changelog at version 0.24 the team made the parameter mandatory at the MCP layer specifically because agents that omitted it kept losing track of state. If you are running a deterministic recorded workflow where you already know the expected state, you can omit it and gain back the settle delay. If you are running an LLM-driven loop, leave it on.

Building a computer use agent that has to track real desktop state?

Bring the loop you have. We will look at where state tracking is leaking and whether the structural diff fits.

Computer use agent state tracking, done with two regexes and a tree diff

The reason you cannot just diff the raw tree

The two regexes that fix it

The full call shape, end to end

What gets stripped, and why each one matters

Screenshot diff vs structural diff

The two approaches, same question, different answers

The settle delay, in one paragraph

Wiring it into your own loop

From install to first structural diff

What this does not solve

Frequently asked questions

Frequently asked questions

Building a computer use agent that has to track real desktop state?

Related reading

Keep reading

Accessibility API for computer use agents: the cache topology under click_element

Sentence to desktop automation script, the structural answer

Open source computer use agent SDK: deterministic + vision in one install

Accessibility tree closes the browser-to-native gap

Comments ()

Computer use agent state tracking, done with two regexes and a tree diff

The reason you cannot just diff the raw tree

The two regexes that fix it

The full call shape, end to end

What gets stripped, and why each one matters

Screenshot diff vs structural diff

The two approaches, same question, different answers

The settle delay, in one paragraph

Wiring it into your own loop

From install to first structural diff

What this does not solve

Frequently asked questions

Frequently asked questions

Building a computer use agent that has to track real desktop state?

Related reading

Keep reading

Accessibility API for computer use agents: the cache topology under click_element

Sentence to desktop automation script, the structural answer

Open source computer use agent SDK: deterministic + vision in one install

Accessibility tree closes the browser-to-native gap

Comments (••)

Comments ()