Alternative / Accessibility APIs vs deterministic scripts

Accessibility APIs vs deterministic scripts. The word ‘deterministic’ is doing too much work.

The framing of this comparison gives the wrong answer before you start. People who call coordinate-and-sleep scripts ‘deterministic’ mean the inputs are deterministic: same (x, y), same Sleep, 2000, same keystrokes, every run. That is true and almost useless. The thing you care about when a script runs in production is the outcome, and on that axis the picture flips. Coordinate scripts are deterministic-in-inputs and non-deterministic-in-outcomes. Accessibility-API scripts use semantic inputs that look fuzzier but produce outcomes that hold across every environment change that breaks coord scripts.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-14)

Accessibility APIs are more deterministic in outcomes. A coord-and-sleep script will produce the same input sequence forever, but its outcomes drift with DPI, theme, window position, screen resolution, animation timing, and whatever your machine happens to do in the background. An accessibility-API script (Terminator, pywinauto, atomacos, Hammerspoon) targets elements by their structural identity in the OS's accessibility tree, which survives all of those. When the resolution fails it fails with a typed error, not a silent misclick. Authoritative reference for the underlying API: Microsoft UI Automation overview. Source for the wait-loop pattern shown below: terminator/src/locator.rs.

4.9from developers shipping desktop automation
Coordinate scripts encode the machine, not the task
Accessibility selectors carry semantic identity that survives DPI, theme, locale, window position
Real waits compile to polled predicates, not Sleep N
The recorder's contains_relative_time filter (events.rs:67-81) is what outcome-determinism looks like in source

What ‘deterministic scripts’ usually means in practice

The phrase shows up most in three contexts. The first is AutoHotkey and AutoIt scripts on Windows: short imperative programs that move the mouse to a coordinate, click, send keystrokes, and sleep between steps. The second is PyAutoGUI in Python: the same idea with a different syntax, usually paired with locateOnScreen for template-matching against a saved screenshot. The third is the recorded-macro lane: the user runs a recorder, performs the workflow once, and the tool captures the sequence of mouse moves and keystrokes for replay. Older RPA tools, Sikuli, Pulover's Macro Creator, and the legacy desktop-test suites built on UI Recorder all sit here.

What unites all three is that the script encodes the state of one specific machine at one specific moment. The position of a button, the time the previous action took, the appearance of an icon: all baked into the script as literal values. The script is ‘deterministic’ in the sense that two runs on that same machine produce the same syscalls. On a different machine the same syscalls land in different places.

Same task, written both ways

Save an Excel workbook with a specific filename, the canonical hello-world for desktop automation in office workflows. The coord-script version is shorter to write and feels more direct. The selector version is shorter to keep alive. Toggle between them.

Save report-q2.xlsx, two ways

# autohotkey, pyautogui, autoit, recorded macros: same shape. # the script encodes screen coordinates and time, and prays. Sleep, 2000 ; wait for the window MouseMove, 412, 300 ; the Save button on my machine, today Click Sleep, 500 ; wait for the save dialog Send, report-q2.xlsx Sleep, 200 MouseMove, 720, 540 ; the Save button on the dialog Click ; the inputs are deterministic: same coordinates, same sleeps, ; every run. the outcomes are not deterministic: any of these ; will silently land the click on the wrong thing. ; ; - user is on a 4K monitor at 200% scale (button at 824, 600) ; - window opened off-center (button anywhere) ; - save took 2.3s, not 2.0s (clicked the canvas) ; - a notification toast popped at top-right (clicked the toast) ; - the dialog opened above another dialog (clicked the wrong dialog) ; - the user switched to dark mode (different image hash if template-matching) ; ; every fix is "make the inputs more specific": longer sleep, ; smaller template, more retries. none of those make the outcomes ; deterministic. they just narrow the band of luck.

  • every action depends on a wall-clock guess
  • every click depends on a screen coordinate
  • no predicate on whether the target exists, is visible, is enabled
  • fails silently when any environment input changes

What a wait actually compiles to

The single most expensive line in a deterministic script is Sleep, 2000. It is expensive twice: once because it always waits the full 2000ms even when the screen is ready in 200ms, and once because it does not wait long enough when the screen takes 2300ms. The accessibility-API equivalent is a polled predicate. The Rust source for the canonical implementation is short enough to read in one breath: tab to it below.

The wait, on both sides

# the deterministic-script equivalent. canonical autohotkey shape.
# pyautogui's pyautogui.PAUSE has the same effect for every action.
# autoit's Sleep() is identical. recorded macros bake this in.

Sleep, 2000        ; wait for some duration we picked on our machine
MouseMove, 412, 300
Click

# the script does not check whether the element exists.
# does not check whether it is visible.
# does not check whether it is enabled.
# does not check whether anything is focused.
# the OS dispatches the click at (412, 300) into whatever window
# happens to be in foreground after the 2000ms.
#
# the "input" is deterministic: 2000ms, then click at (412, 300).
# the "outcome" is whatever the OS happens to be showing at that point.
-71% of the wait becomes a real predicate

The right side is roughly the body of Locator::wait_for in terminator/src/locator.rs lines 170 to 233. The poll interval is fixed at 100ms (line 186). The four typed conditions are WaitCondition::Exists, Visible, Enabled, Focused (lines 13 to 22). The loop returns the moment the predicate holds, returns a typed AutomationError::Timeout at the budget, and otherwise does nothing. There is no fudge factor for ‘the screen might still be animating’, because the predicate is the answer to that question.

What replay looks like the next day

The interesting place to look at outcome-determinism is at replay time. The coord-based script makes a contract with the OS: dispatch these synthetic input events at these coordinates after this delay. The OS keeps that contract. The contract just does not say anything about what the target will be.

Coord-script replay: the click lands on whatever happens to be at (412, 300)

ScriptOS HID layerWindow managerTarget appSleep 2000ms (no predicate)(returns; window may or may not exist)SendInput LBUTTONDOWN at (412, 300)WM_LBUTTONDOWN dispatchedWM_LBUTTONDOWN to whatever is at (412, 300)(silently clicked the wrong element)

The selector-based script makes a different contract: resolve this structural identity in the tree, wait until it is visible, then invoke its default action. The contract is over the OS's structural state, which is what the user was actually pointing at.

Selector-script replay: the action runs in the target process when the predicate holds

ScriptLocatorUIA treeTarget appwait("Visible", 5000)find_element(role:Button && name:Save)Element { is_visible: false }sleep 100ms, poll againElement { is_visible: true }IUIAutomationInvokePattern::InvokeOk(()) - action ran in the target process

The first sequence has no ‘the wrong window is foregrounded’ arrow because the OS does not need one; it just dispatches. The second sequence has no ‘the wrong window is foregrounded’ arrow because the selector scope-clause (process:excel) excludes everything that is not Excel before resolution starts.

The defense recorders need (and what the coord recorders skip)

Even semantic selectors can drift if a recorder is not careful. A button labelled ‘3 hours ago’ on a Slack message becomes ‘yesterday’ tomorrow, ‘last week’ in a week, and ‘last month’ eventually. A recorder that captured name:3 hours ago into a selector chain produces a deterministic-looking script that matches nothing tomorrow. The fix is straightforward and lives in real source: filter the names the recorder is allowed to encode.

What Terminator's workflow recorder filters before writing a selector

  • `" ago"` (catches "3 hours ago", "5 minutes ago", "10 days ago" on every messaging app, every social feed, every notification surface). Source: events.rs:71.
  • `"just now"`, `"yesterday"`, `"today"`, `"last week"`, `"last month"` (catches the human-readable timestamps on Slack messages, Gmail rows, calendar events, Jira tickets). Source: events.rs:72-76.
  • `" min"`, `" mins"`, `" hr"`, `" hrs"` (catches "5 min ago", "2 hr ago" shorthands that some apps use instead of full English). Source: events.rs:77-80.
  • Empty strings, BSTR placeholders, `variant()`, `variant(empty)`, `<unknown>`, `<null>`, COM null patterns (catches the Win32 IUIAutomationElement returning a non-string default the recorder must not encode as a selector value). Source: events.rs:8-36.
  • Browser internal Chromium and Firefox window names like `legacy window`, `chrome_widgetwin`, `intermediate d3d window`, and panes whose title is just `<doc> - Google Chrome` because the `process:` clause already pins the browser. Source: events.rs:703-712.

The filter is defined in terminator-workflow-recorder/src/events.rs lines 67 to 81 (contains_relative_time), with the empty-name and null-like-value filter at lines 8 to 36 (is_empty_string, NULL_LIKE_VALUES). A coord-based recorder does not need this filter because it does not encode meaning in the first place. It captures (412, 300) and replays (412, 300). The replay outcome happens to be wrong; the recorder cannot tell. Outcome-determinism is not free in either approach: it costs a list of relative-time substrings on the AX side, and on the coord side it costs everything the script does.

When to use which, written out

The honest rule is short. The accessibility tree is the right default for everything that exposes a tree. Coordinate input is the right answer for the residual targets where the tree is empty or single-node. Most production scripts mix the two; the question is what is your default and where do you fall back, not which one to pick exclusively.

The decision, in order

  • Are you driving the same OS, same DPI, same theme, same monitor, same user, every time, forever? A deterministic script will work. This is roughly nobody outside a fixed-image CI runner.
  • Does your script ever run on a different machine, a different monitor, a different OS version, a different theme? Use the accessibility tree. The selectors are the only thing that survives.
  • Are you encoding a wait as `Sleep, 2000`? Replace it with `wait_for(Visible, 5000)` or the equivalent in your library. The first is deterministic in input. The second is deterministic in outcome.
  • Are you recording a user session for replay? Make sure the recorder filters drifting names (relative timestamps, counts, dates). Otherwise yesterday's recording matches a button that says something else today.
  • Is the target a fullscreen DirectX game, a canvas-rendered design tool (Figma's drawing surface), or a remote desktop viewer where the AX bridge does not cross the boundary? Deterministic coordinate input is the only option. The tree is empty.
  • Are you mixing the two (most production agents do)? Default to the tree. Fall back to synthetic input only on the call sites where the tree returns nothing useful.

Why the framing matters for AI agents

A growing category of computer-use agents (ChatGPT Agents, Anthropic computer use, BrowserUse) follows the deterministic-script pattern with one substitution: an LLM picks the coordinates from a screenshot instead of the author hardcoding them. The shape is the same: the agent decides where to click, then synthesizes a click at that location through the OS's HID input layer. The outcome is bound to whatever is at those pixels at that moment, and to whatever the model picked given the image it saw.

An agent built on accessibility APIs uses the model differently. The model picks a selector (role:Button && name:Save), the framework resolves it through the OS tree, and the framework calls the element's pattern. The model never needs to ground from pixels because the tree is text. Per-action latency is bounded by IPC instead of inference, which is the source of the roughly 100x speed delta Terminator claims at llms.txt line 243. The shape difference matters more than the speed: a model that picks selectors and gets typed errors back is in a closed loop the framework can recover from. A model that picks pixels and gets ‘the click happened’ back is in an open loop.

Replacing a pile of AutoHotkey or PyAutoGUI scripts?

30 minutes. We walk through your scripts, show what the selector translation looks like for each one, and you leave with a concrete first step.

Questions about accessibility APIs vs deterministic scripts

What do people actually mean by 'deterministic scripts'?

Coordinate-and-time-based scripts. AutoHotkey one-liners, AutoIt routines, PyAutoGUI sequences, recorded macros from older RPA tools (Selenium IDE-style recorders, Sikuli, Pulover's Macro Creator), and the hardcoded fixtures inside legacy QA test suites for desktop apps. The shared property is that the script encodes the OS state implicitly: a position on the screen, a number of milliseconds to wait, a screenshot to template-match against. Given the same machine in the same state, every replay produces the same byte-for-byte sequence of system calls. The word 'deterministic' here describes the inputs.

Why is that not actually deterministic in production?

Because the inputs encode the machine, not the task. A coordinate of (412, 300) is the Save button on a 1920x1080 display at 100% DPI with the window opened in its default position. The same script on a 4K display at 200% DPI, with the window opened off-center because the user dragged it last session, lands the click on the canvas. The script's behavior changed without the script itself changing. Production environments are full of these silent state changes: DPI, theme, locale, screen resolution, animation timings, notification toasts, background activity that holds the foreground window for a beat. Each one quietly flips a script that was 'deterministic' on the developer's laptop into a script that does the wrong thing on the user's laptop.

In what sense are accessibility APIs deterministic?

In outcomes. The script asks for an element by role, accessible name, and AutomationId. The OS resolves that to the structural identity the application registered. The Save button is the same UIAutomationElement on every monitor, in every theme, at every DPI. When the script asks for it, it gets it; when the element does not exist, the script gets a typed error (ElementNotFound, ElementNotVisible, ElementNotEnabled, Timeout) instead of a silently misdirected click. The inputs look fuzzier than coordinates, but the outcomes are stable across every environment change that breaks coord scripts.

What does 'wait for the screen to look like X' compile to in each approach?

In a deterministic script it compiles to Sleep, N where N is the number of milliseconds the script author guessed on their machine. The script does not check what the screen looks like; it sleeps and continues. In an accessibility-API script it compiles to a poll loop with a typed predicate. Terminator's wait_for at crates/terminator/src/locator.rs lines 170 to 233 is the canonical shape: a tokio loop with a 100ms poll interval, returning Ok(element) the instant any of four conditions holds (Exists, Visible, Enabled, Focused), returning Err(Timeout) at the budget. The script knows when the screen looks like X. The first approach knows that some milliseconds have passed.

What happens when the script records a user session and replays it later?

Coordinate-based recorders capture (x, y) and replay it as a SendInput. Selector-based recorders walk the AX tree from the clicked element up to a stable parent and capture a selector chain. Both look 'deterministic' until you replay them a week later. The interesting part is that even semantic selectors can drift if the recorder captures names that change with calendar time. Terminator's workflow recorder explicitly filters those: events.rs lines 67 to 81 define contains_relative_time, which rejects any name matching ' ago', 'just now', 'yesterday', 'today', 'last week', 'last month', or the short-form ' min'/' hr' suffixes. A button labelled '3 hours ago' would otherwise produce a selector that matches nothing tomorrow. This is what defensive coding for outcome-determinism actually looks like; coord-based recorders have nothing to filter because they do not encode meaning in the first place.

Are there cases where a deterministic script is still the right answer?

Three real cases. Fullscreen DirectX or OpenGL games where the only thing on the screen is a frame buffer and the AX tree is one opaque element. Canvas-rendered design tools (Figma's drawing surface, Excalidraw, Miro, Photoshop document area) where every tool lives inside a single canvas element with no children. Sandboxed remote desktop or VM viewers where the AX bridge does not cross the host boundary. In these targets the tree is empty or single-node, and coordinates are the only addressable thing. Everywhere else (Win32, WinUI, WPF, AppKit, SwiftUI, every Electron app, every web page through the browser's AX bridge), the tree is the better default.

How does this connect to AI agents and computer use?

An agent that screenshots, asks an LLM to output (x, y), and PyAutoGUIs the coordinates is a deterministic script with an LLM as the input-generator. Same brittleness, plus the latency of model inference per action. An agent that resolves a selector through the AX tree and calls a pattern (Invoke, Toggle, ExpandCollapse, Value, SelectionItem) is using the OS's own structural identity. The model's job is to pick a selector; the framework's job is to resolve it. The first design is bounded by inference latency (hundreds of milliseconds per action), the second by IPC latency (single-digit milliseconds per action). Over an agent loop with hundreds of steps, the difference compounds to roughly the published 100x speed delta Terminator claims in its llms.txt at line 243.

What does outcome-determinism actually look like in code?

Three small pieces. First, every action takes a selector with structural identity (role, accessible name, AutomationId, process scope), not a coordinate. Second, every wait is a predicate over the tree, not a clock-time sleep; the wait returns the moment the predicate holds. Third, every failure is a typed error, not a click that landed on the wrong element. The Terminator Rust core does each of these in one place: Locator::wait, Locator::wait_for, Locator::validate, all in crates/terminator/src/locator.rs. The Python and Node bindings (terminator-py, @mediar-ai/terminator) and the MCP agent (terminator-mcp-agent) all route through that same locator. If you are evaluating whether a library is deterministic-in-outcomes, those three properties are the checklist.

terminatorDesktop automation SDK
© 2026 terminator. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.