Desktop application automation where every click returns a post-state diff

Every guide on desktop application automation compares pricing, recorder UIs, and supported scripting languages. Those comparisons skip the one thing that decides whether your script is useful or dangerous in production: when you call element.click(), can the caller tell the click actually reached a handler? Terminator’s click is a five-phase function that validates the element, picks coordinates with a fallback chain, snapshots state before and after, and returns a ClickResult whose details string contains window_title_changed and bounds_changed. That line is on element.rs 713.

Matthew Diakonov, Written with AI

Published April 23, 202610 min read

4.9from early design partners

Five labeled phases per click, not one

ClickablePoint to BoundsCenter fallback in one function

post-state diff returned on every click

One click, five phases

Desktop application automation that returns a diff you can assert on

Phase 1: validate visible, enabled, in viewport

Phase 2: ClickablePoint, fallback to BoundsCenter

Phase 3: snapshot window title and bounds

Phase 4: real OS mouse event, not InvokePattern

Phase 5: diff post-state, return ClickResult

0:00 / 0:05

The problem nobody in the space admits to

You write a script that automates SAP Logon. The script clicks the connect button. The click returns true. The next step reads the username field. It times out. Fifteen minutes of log reading later, you discover the UIA provider for that SAP dialog reported the button as visible and enabled, and the click fired, but a modal dialog from a different process had stolen focus and the click landed on an invisible pane. Your tool told you success.

This is the default behavior of almost every desktop application automation library in production use. Raw mouse-injection tools (AutoIt, AutoHotkey, pywinauto’s mouse module) send the click and return. UIA-based tools (Microsoft’s UI Automation, White, FlaUI in its InvokePattern mode) call the element’s invoke handler and return success if the call dispatched. Neither class captures state around the click. Both can report success after a click that did nothing.

The fix is not a bigger model or a smarter selector. It is a function that takes a snapshot before the click, sends a real mouse event, takes a snapshot after, and returns the diff. Five phases. Fifty-six lines of Rust. That function is in Terminator.

The five phases, walked end to end

This is the entire click pipeline. Each phase has a single responsibility and produces an output consumed by the next. Phase ordering is deterministic. A phase that errors short-circuits the rest.

Inside element.click() on Windows

Phase 1. validate_clickable

is_visible, is_enabled, ensure_in_viewport. An off-screen or disabled element returns an error here. No mouse event sent yet.

Returns AutomationError::ElementNotVisible or ElementNotEnabled if the UI is not ready. The 800ms wait_for_stable_bounds call was removed in a later pass and replaced with the tree-capture delay upstream, which is why you see the explanatory comment on line 407.

Phase 2. determine_click_coordinates

Ask UIA for GetClickablePoint. If it returns None or an error, fall back to bounds center. Return a tuple of (x, y, method, path_used).

On the success path the method is ClickablePoint and the path is UIA::GetClickablePoint. On the fallback path they become BoundsCenter and UIA::BoundingRectangle. Both end up in the final ClickResult so the caller can see which route was taken.

Phase 3. Snapshot pre-state

Read the current window title and element bounds into pre_window_title and pre_bounds. Two UIA reads. They become the left side of the diff.

Phase 4. execute_mouse_click

Delegate to super::input::send_mouse_click with ClickType::Left. This is a real injected mouse event at the OS level, not an accessibility-layer Invoke call.

Real mouse events matter because some apps have handler chains that only fire on WM_LBUTTONDOWN and WM_LBUTTONUP, not on UIA Invoke. Games, some Electron apps, custom WPF controls, and many SAP widgets fall into this category.

Phase 5. Diff post-state

Read window title and bounds again. Compute window_title_changed and bounds_changed. Format the details string. Return the full ClickResult.

5 fields

“let details = format!("path={path_used}; validated=true; window_title_changed={window_title_changed}; bounds_changed={bounds_changed}; pre_title=... post_title=... duration_ms=...");”

crates/terminator/src/platforms/windows/element.rs line 713. The final string every click call returns to the caller.

The whole function, verbatim

Fifty-six lines. The PHASE N comments are in the source, not inserted for this guide. They exist so the flow is readable when you land inside the function from a stack trace.

element.rs

Phase 1, verbatim: three checks in order

Visibility, enabled state, and viewport membership. The multi-monitor visibility test is a separate function, is_visible_on_any_monitor, that iterates xcap::Monitor::all() and checks whether the element’s bounds intersect any physical display. An element placed on a disconnected second monitor fails here instead of producing a phantom click at (0, 0).

element.rs

Phase 2, verbatim: the coordinate fallback nobody admits to

GetClickablePoint is the UIA-preferred way to ask “where should I click?”. It returns the point an assistive technology would click, which accounts for offset from the element’s bounding rectangle, split buttons, and controls that are reachable only through a sub-region. It also fails silently on custom controls that do not implement it. Older WPF apps, Java Swing windows under the JAWS bridge, and many Electron apps with minimal UIA providers hit the Ok(None) branch. Terminator falls through to the bounds center and stamps BoundsCenter into the result so the caller knows which route was used.

element.rs

The five bundled into a bento

Each card is one phase. Phase 2 is the one with the fallback, so it gets the wide slot.

Phase 1. Validate

is_visible covers multi-monitor placement, is_enabled rejects greyed-out controls, ensure_in_viewport confirms the element sits inside the visible region. A click on an off-screen element returns an ElementNotVisible error before a single mouse event fires.

Phase 2. Pick coordinates

UIA::GetClickablePoint first. On failure, BoundingRectangle center. The chosen path is stamped into the ClickResult so you know which one won. Nothing else in this space admits to a fallback.

Phase 3. Snapshot pre-state

Window title and bounds go into local variables before the click. Two UIA calls, fewer than 20ms on a warm tree. These become the left side of the diff.

Phase 4. Click

A real OS-level mouse event via super::input::send_mouse_click. Not InvokePattern. Not a synthetic focus event. The handler chain fires exactly as if a human moved the cursor and pressed the left button.

Phase 5. Diff post-state

Window title and bounds are read again. window_title_changed and bounds_changed are derived booleans that land in the returned details string. If both are false and you expected a reaction, you know.

Inputs, phases, and the ClickResult

Four inputs into the click function, one classifier in the middle, three outputs on the other side. This is the shape callers see when they integrate the SDK.

element.click() pipeline

What every click returns

The struct is defined in crates/terminator/src/lib.rs. Three fields. Serializable so it crosses IPC into the TypeScript and Python SDKs without reshaping.

lib.rs

Three real ClickResults

Same API, three different outcomes. Notice how the details string tells the caller exactly what happened without needing a screenshot or a log scrape.

Click opened a dialog

ok.dialog.json

Fallback path, bounds changed

ok.fallback.json

Click fired, nothing reacted

silent.miss.json

The same call from the TypeScript SDK

The Rust core is the source of truth, but almost no one writes Rust directly against it. The TypeScript SDK (terminator.js) and the Python SDK (terminator-py) both return the same ClickResult object. This is the pattern we recommend in every agent tool and every workflow step.

save.ts

Numbers from the source

0labeled phases in click()

0lines in the click function

0coordinate paths (UIA, bounds)

0fields on ClickResult

Line in element.rs where the details string is formatted. Everything interesting a caller wants to assert on is concatenated here: path, window_title_changed, bounds_changed, duration_ms.

0ms saved

The explicit comment at element.rs:407 documents the removal of a wait_for_stable_bounds step. The tree-capture delay upstream covers the same stability check, so the click path is 800ms shorter per call with no loss of correctness.

Coordinate-selection paths. The preferred one is UIA::GetClickablePoint. The fallback is UIA::BoundingRectangle center. The chosen path is stamped into ClickResult.method.

Feature-by-feature, versus the usual desktop tools

Every tool in this table is production software used to automate native Windows apps. The rows describe behaviors that are either present or absent from the return value of their click primitive, not opinions about the product overall.

Feature	UiPath / AutoIt / pywinauto / FlaUI	Terminator
Per-click return value tells you whether the UI reacted	No (void return or boolean)	ClickResult with pre/post diff, lib.rs line 191
Coordinate selection with explicit fallback chain	Either UIA-only or mouse-only, no mixing	ClickablePoint → BoundsCenter, element.rs line 415
Visibility, enabled, and viewport validated before click	Mouse-injection tools skip this; UIA-only tools do partial	validate_clickable, element.rs line 389
Real OS mouse event (not InvokePattern)	Most prefer InvokePattern for speed	send_mouse_click, element.rs line 698
Path taken (ClickablePoint or BoundsCenter) recorded in result	Not exposed	path_used field stamped into details string
Duration per click measured and returned	Rare; usually requires wrapping the call yourself	duration_ms in details, element.rs line 713

Verify in the repo

Five grep lines against the public mediar-ai/terminator repo. The phase comments, the diff computation, the fallback function, and the ClickResult struct are all at the line numbers quoted in this guide.

zsh

Want to see the details string for your own click path?

Bring a flaky desktop workflow. We will run it through Terminator and walk through each ClickResult together so you can see exactly which clicks land and which do not.

Frequently asked questions

What is desktop application automation, in one sentence?

Programmatic control of native applications on Windows and macOS by talking to the operating system's accessibility layer and the input subsystem, so a script or an AI agent can drive Notepad, Excel, SAP, VLC, or any other installed app without a human touching the mouse. Web automation is a subset that targets browsers; desktop automation covers the other 90% of the surface where browser DevTools and Selenium cannot reach.

Why do most desktop automation libraries not tell you whether a click worked?

Because they ship one of two code paths. Path one is InvokePattern via the accessibility API, which calls the element's invoke handler directly and returns success if the call dispatched. It does not check that the dispatch produced a visible effect, because the API was not designed to. Path two is a raw mouse.click(x, y), which also has no feedback channel; the mouse subsystem acknowledges the injected event and returns. Neither path captures window title, bounds, or focus state before and after the click, so neither can tell you whether the click actually hit the control. Terminator's click method on element.rs line 666 wraps a real mouse click with a pre-state snapshot and a post-state snapshot and returns the diff in the ClickResult.details string.

Where exactly do Terminator's five click phases live in the source?

One function, one file. crates/terminator/src/platforms/windows/element.rs. Lines 666 to 722 define fn click. Phase 1 calls validate_clickable on line 675, which is defined at line 389 and runs is_visible, is_enabled, and ensure_in_viewport in that order. Phase 2 calls determine_click_coordinates on line 679, defined at line 415, which first tries UIA::GetClickablePoint and falls back to the bounding rectangle's center. Phase 3 snapshots the window title and bounds on lines 682 to 688. Phase 4 sends the physical mouse click on line 698 via super::input::send_mouse_click. Phase 5 snapshots post-state on lines 702 to 708 and builds the details string on line 713.

What does the ClickResult.details string actually look like?

A semicolon-delimited list of key=value pairs. A real example from a click that opened a new window: path=UIA::GetClickablePoint; validated=true; window_title_changed=true; bounds_changed=false; pre_title='Notepad'; post_title='Save As'; duration_ms=94. The path field is either UIA::GetClickablePoint when UIA returned a clickable point, or UIA::BoundingRectangle when the library fell back to the bounds center. window_title_changed and bounds_changed are the two cheapest signals that something reacted to the click. If both are false and you expected a dialog, your selector probably hit the wrong element.

What happens when UIA cannot return a ClickablePoint?

The function on element.rs lines 415 to 450 matches on the Ok(None) or Err branch and pulls bounds via UIA::BoundingRectangle, then computes (x + width/2, y + height/2). The path_used string in the returned ClickResult becomes BoundsCenter and the fallback is logged at debug level. This matters for custom controls whose UIA implementation reports bounds but skips GetClickablePoint, which is common in older WPF apps, Swing windows running under JAWS bridges, and Electron apps that expose a minimal UIA provider.

Is this faster or slower than a raw mouse.click?

Marginally slower per click, dramatically faster per successful workflow. The validation phase adds roughly one UIA round-trip, typically under 10ms on a modern machine. The pre and post snapshots are two more UIA calls each, another 10 to 20ms. In exchange, the caller gets a structured signal on every click about whether it produced a UI change. A workflow that would otherwise fail silently and require a human to inspect a screenshot now fails loudly with a details string that says window_title_changed=false, and the next line in the script can branch. A 200ms delay on wait_for_stable_bounds was explicitly removed (see the comment on element.rs line 407 and line 701) because the tree capture delay already covers it.

How does the AI agent use the post-state diff?

Terminator's MCP agent exposes click as a tool. The tool response includes the full ClickResult, so the model sees method, coordinates, and details on every call. When the model sees window_title_changed=false after a click that was supposed to open a dialog, it does not assume success. It either retries, asks the UI tree for the current state, or falls back to a different selector strategy. This is how the 95%+ success rate claim in the Terminator README is achieved: not by running a bigger model, but by giving the model a verifiable signal after every physical action.

Does the same five-phase flow run on macOS?

The phase structure is the same, the APIs are different. macOS uses the AX accessibility framework (AXUIElement) instead of UIAutomation. The click path lives under crates/terminator/src/platforms/mod.rs which defines the trait, and the macOS implementation implements the same ClickResult return type so downstream code does not branch on OS. The macOS equivalent of GetClickablePoint is AXUIElementGetAttributeValue(kAXPositionAttribute) plus kAXSizeAttribute, and the fallback is the same bounds-center computation. The point is that the contract (validate, coordinate, snapshot, click, diff) is portable across OSes; the plumbing behind each phase is not.

Can I skip the verification and just send the click?

Yes, but it is not the default. The lower-level click_at_coordinates trait method on platforms/mod.rs line 237 sends a raw click at the given (x, y) with no validation and no state snapshot. You give up the details string. Most users do not call it directly; it exists for the recorder replay layer, which already knows the target coordinates and only cares about faithfully reproducing the event stream. For anything that resembles a test assertion or an agent tool, use element.click() and read the returned ClickResult.

The problem nobody in the space admits to

The five phases, walked end to end

Inside element.click() on Windows

Phase 1. validate_clickable

Phase 2. determine_click_coordinates

Phase 3. Snapshot pre-state

Phase 4. execute_mouse_click

Phase 5. Diff post-state

The whole function, verbatim

Phase 1, verbatim: three checks in order

Phase 2, verbatim: the coordinate fallback nobody admits to

The five bundled into a bento

Phase 1. Validate

Phase 2. Pick coordinates

Phase 3. Snapshot pre-state

Phase 4. Click

Phase 5. Diff post-state

Inputs, phases, and the ClickResult

element.click() pipeline

What every click returns

Three real ClickResults

The same call from the TypeScript SDK

Numbers from the source

Feature-by-feature, versus the usual desktop tools

Verify in the repo

Want to see the details string for your own click path?

Frequently asked questions

Comments (••)

Comments ()