Matthew Diakonov, Written with AI

Published April 23, 202615 min read

Automation for desktop application, judged on its selector language, not its action list

Nearly every guide on automation for desktop application grades products by what actions they can perform: click, type, scroll, screenshot. The actions are trivial. The reason desktop automation frameworks differ in quality is the selector language you write to decide which element to act on. This page is about Terminator's selector language, the Shunting Yard parser that compiles it, and the spatial relators that make it actually targetable on real desktop UIs.

4.9from MIT-licensed, written in Rust, cross-platform via UIA, AX, and AT-SPI2

24 selector AST variants, compiled by Shunting Yard

Boolean operators && || ! with explicit precedence

Spatial relators rightof, leftof, above, below, near

near threshold hardcoded at 50.0 pixels

One selector language, every desktop app

Boolean + spatial + accessibility tree

Every desktop app has an accessibility tree.

Most frameworks query it with one attribute at a time.

Terminator parses Boolean and spatial expressions.

role:Button && !name:Close || rightof:name:Username

Compiled by Shunting Yard, evaluated on the tree.

0:00 / 0:05

The part of desktop automation that everyone underestimates

When a developer first tries to automate something outside the browser, they reach for coordinate clicks or image matching. Both are obvious, both feel like they work, and both are wrong. Coordinates lie the moment someone moves a window, changes a theme, or switches to a high-DPI display. Image templates lie the moment a button style tweaks. The only durable way to target elements in a desktop application is to query the operating system's accessibility tree and describe elements by what they structurally are, not where they happen to sit in pixels today.

Which means the interesting design question for a desktop automation framework is not "does it have a click action" (everyone does) but "what language do I use to describe the element I want to click". That is the selector language. Terminator's is a typed algebra with Boolean operators and spatial relators, compiled to an AST by a Shunting Yard parser and evaluated against the OS accessibility tree on Windows, macOS, and Linux.

0Selector AST variants

0Spatial relators

0Boolean operators

0Pixel threshold for near:

Coordinate scripts vs a selector language, in one side-by-side

This is the example I wish existed when I was first looking for automation for a desktop application. On the left is the version every tutorial shows you. On the right is the same flow written against Terminator's selector language. Count the things that would have to change on the left if the Slack window moved, resized, changed theme, or got translated.

Same login flow, two philosophies

# the typical "automation for desktop application" starter code
import pyautogui
import time

# log into an app
pyautogui.click(420, 312)          # username field at screen coords
pyautogui.typewrite("alice", 0.05)
pyautogui.click(420, 362)          # password field, 50px below
pyautogui.typewrite("hunter2", 0.05)
pyautogui.click(520, 412)          # submit button

# the moment someone moves the window, resizes it,
# uses high-dpi, or changes theme, all of these coords lie.
time.sleep(2)
pyautogui.locateOnScreen("inbox.png")   # image match fallback

-33% lines, zero coordinates

The AST has 24 variants and every operator maps to one

The selector type is defined in crates/terminator/src/selector.rs. It is a plain Rust enum, not a trait object. Every selector you type at the API layer ends up as one of these variants. Role is a struct variant because role and name are so often combined, so role:Button && name:Save compiles to a single Role { role: "Button", name: Some("Save") }, not an And of two selectors. That keeps the hot path fast.

crates/terminator/src/selector.rs

Operator precedence, stated plainly

Every algebra needs precedence. Terminator's is the boring one every programming language uses: Not binds tightest, And is next, Or is loosest. The function below lives in the same file and is the whole rule. If you want to override it, you use parentheses, which the tokenizer already produces dedicated tokens for.

crates/terminator/src/selector.rs

So role:Button && !name:Disabled || role:MenuItem is parsed as (role:Button AND (NOT name:Disabled)) OR role:MenuItem. If that is not what you meant, add parens: role:Button && (!name:Disabled || role:MenuItem). It is the same mental model as any expression parser you have written against, just wearing a selector skin.

Shunting Yard, verbatim

The parser itself is Dijkstra's algorithm, written in Rust. Output queue for operands, operator stack for operators, apply higher-or-equal precedence operators before pushing, drain at the end. Nothing exotic. The value is that every valid expression becomes one AST node with predictable shape, and every invalid expression becomes a Selector::Invalid with a specific reason.

crates/terminator/src/selector.rs

50.0

“The Euclidean distance threshold that decides whether two elements are 'near' each other.”

crates/terminator/src/platforms/windows/engine.rs line 1815, hardcoded constant NEAR_THRESHOLD

Spatial relators are the part that actually makes layouts targetable

Boolean operators let you compose. Spatial relators let you describe layout. rightof, leftof, above, and below are not just "candidate has bigger X than anchor". Each one requires perpendicular-axis overlap so the relationship is actually visual: a button directly beside a label counts as rightof, a button three rows down does not. near takes a different tack, Euclidean distance between the bounding-box centers, with a hardcoded 50.0-pixel threshold so the relation is tight rather than wobbly. The whole function lives in one match arm in the Windows engine.

crates/terminator/src/platforms/windows/engine.rs

Because the relator requires the anchor selector to resolve first, the engine evaluates the expression in the right order: inner selector, then filter candidates by bounds. rightof:(role:Text && name:Username) resolves the Text element with name Username, grabs its bounds, and then evaluates the outer selector against every visible element, keeping only those whose rectangles sit to the right and share rows. This is why you never write offsets in Terminator. The bounds math is the offsets.

Where a selector string goes between typing it and clicking something

From string to platform adapter

Six steps from a selector string to a match

Tokenize the input

tokenize() at selector.rs:94 walks the string one character at a time. && becomes Token::And, || or , becomes Token::Or, ! becomes Token::Not, parentheses become LParen/RParen. Anything else accumulates into a Token::Selector. Whitespace inside text: values is preserved; whitespace outside tokens is dropped.

Check for unbalanced parens

has_unbalanced_parens() at selector.rs:76 scans the string for opening and closing parentheses, incrementing and decrementing a depth counter. Closing without opening or unbalanced total returns a parse error before the Shunting Yard stage even begins.

Run Shunting Yard

parse_boolean_expression() at selector.rs:215 pushes atomic selectors onto an output queue and operators onto an operator stack. When an operator comes in, higher-or-equal precedence operators on the stack get applied first. Parentheses group explicitly. The output is a single Selector AST root.

Flatten nested And and Or

apply_operator() at selector.rs:275 checks whether an And or Or operand is itself already an And or Or of the same kind. If so, it concatenates the inner vecs rather than nesting. role:A && role:B && role:C compiles to a single And with three operands, not a binary tree. The engine loops instead of recursing.

Dispatch to the platform engine

Each Selector variant has a match arm in platforms/windows/engine.rs (IUIAutomation), platforms/macos/engine.rs (AXUIElement), and platforms/linux/engine.rs (AT-SPI2). Role queries turn into TreeScope walks. Spatial relators turn into bounding-box math. Chain iterates the list, passing the previous result as the new root.

Return elements or Selector::Invalid

A successful parse returns a Selector tree the engine can walk. A failed parse returns Selector::Invalid with a reason string so the caller can log it, show it to the user, or surface it into the MCP response. The framework never silently matches nothing for a malformed input.

What this looks like when an agent actually runs it

Install the MCP agent, point Claude Code or Cursor at it, and watch a selector get tokenized, parsed, and resolved against a running Slack window. The log below is the kind of thing you see in the agent stream. Parsed AST is printed verbatim so you can tell at a glance what the parser thought you meant, before any action fires.

terminator-mcp-agent stdio stream

24-variant AST

Every selector compiles to one of 24 Selector enum variants defined in src/selector.rs. Role is a struct, not just a string, so role:Button && name:Save is a single Role variant with Some(name) rather than a boolean AND of two selectors. This keeps the common case fast.

Shunting Yard parser

parse_boolean_expression at selector.rs:215 is a classic Shunting Yard implementation. Operator precedence is explicit (Or=1, And=2, Not=3), nested Ands and Ors are flattened into single Vec nodes so the engine evaluates them in one pass, and malformed input returns Selector::Invalid rather than panicking.

Spatial operators

rightof, leftof, above, below, near. The first four use half-plane bounds tests with a perpendicular-axis overlap check so 'to the right' means literally aligned rows, not merely higher X. near uses a 50.0-pixel Euclidean distance between element centers, defined as const NEAR_THRESHOLD: f64 = 50.0.

Descendant combinator >>

Chains compile to Selector::Chain(Vec<Selector>). Each step runs against the previous result. process:slack.exe >> role:Window && name:Slack >> role:Button && name:Send is three links the engine walks in order, with its own TreeScope tuning at each level to keep the tree walk cheap.

Parent and nth navigation

The .. token produces Selector::Parent, which moves up one level in the UIA tree. nth:N selects the Nth match from a result set (0-indexed). nth-N selects from the end, so nth-1 is the last and nth-2 is the second-to-last. has: takes another selector and returns only elements whose descendants match it.

Text with special chars

The tokenizer special-cases text: values so parentheses, colons, and single pipes inside the text value are preserved rather than treated as operators. text:'RPA Hospital (MGP)? : r/foo' survives the parse intact. Only unescaped && and || get tokenized as Boolean operators.

The full variant inventory, because a selector language with hidden features is not a selector language

Everything below is a public Selector variant. If a surface API supports a selector, it supports every one of these. There are no private extensions, no hidden flags, no proprietary operators that the Enterprise tier unlocks.

Selector enum variants, from selector.rs

Role { role, name }
Id(String)
Name(String)
Text(String)
Path(String)
NativeId(String)
Attributes(BTreeMap)
Filter(usize)
Chain(Vec<Selector>)
ClassName(String)
Visible(bool)
LocalizedRole(String)
Process(String)
RightOf(Box<Selector>)
LeftOf(Box<Selector>)
Above(Box<Selector>)
Below(Box<Selector>)
Near(Box<Selector>)
Nth(i32)
Has(Box<Selector>)
Parent
And(Vec<Selector>)
Or(Vec<Selector>)
Not(Box<Selector>)

What survives, what does not

A selector language is only as useful as the scenarios in which it keeps working. The whole point of building on the accessibility tree instead of pixels is that the tree is stable under the changes that break coordinate scripts. These are the scenarios where the selector survives.

Window moves

Selector keeps working. Bounds math is recomputed at query time, no frozen coordinates.

Theme changes

Role, name, and AutomationId do not change with theme. Image-match selectors break here; accessibility-tree selectors do not.

High-DPI scaling

Bounds are delivered by UIA in the display's native coordinate space; spatial relators scale with the window automatically.

Localization

Use LocalizedRole for display strings or AutomationId for stable IDs; role: and id: are locale-invariant.

Two of the same widget

Compose with Boolean operators and spatial relators: rightof:(role:Text && name:Password) && role:Edit.

Dialog nested inside dialog

Chain with >> so selectors are scoped. has: asserts that a parent must contain a specific descendant before matching.

How this compares to the other things called "automation for desktop application"

Feature	Typical desktop automation tooling	Terminator
Selector language	Coordinates or image templates	Boolean algebra + spatial relators, parsed by Shunting Yard
Logical operators	None (one attribute per call)	&&, \|\|, !, parentheses, operator precedence Or=1 And=2 Not=3
Spatial relators	Manual offset math	rightof, leftof, above, below, near (50px Euclidean threshold)
Chaining	Nested function calls	>> descendant combinator, .. parent, nth:, has:
Parse errors	Runtime exception or silent no-match	Selector::Invalid variant with reason string, caught at compile
Cross-platform	Windows only or web only	Windows UIA, macOS AX, Linux AT-SPI2, identical syntax
Works with AI coding assistants	Bolt-on after the fact	MCP server native, works with Claude Code, Cursor, VS Code, Windsurf
License	Per-seat commercial	MIT, source on GitHub

One last note on where to read the code

All of this is in the public repository under crates/terminator/src/selector.rs for the parser and crates/terminator/src/platforms/ for the per-OS evaluator. If you are the kind of reader who bounced off the marketing pages and wanted the actual file that decides "right of means candidate_left >= anchor_right && vertical_overlap", you have it. That is the file.

Want the selector language pointed at your own desktop app?

Twenty minutes with the team, we write a live selector against whatever you have open, and you see the AST and the tree walk in real time.

Frequently asked questions

What does automation for a desktop application actually mean in 2026?

It means driving an application that runs outside the browser the same way a human does, by a program rather than a person. That includes Outlook, SAP, Excel, Photoshop, internal WPF and WinForms tools, Electron apps like Slack and Notion, and native Mac apps. The way a serious framework does this is by querying the operating system's accessibility layer (UI Automation on Windows, AXUIElement on macOS, AT-SPI2 on Linux) to locate elements by role, name, and other attributes, then synthesizing input events or invoking UIA control patterns to act on them. Image matching and coordinate pushing are last-resort techniques, not the primary interface.

Why does Terminator ship a whole selector language instead of a Python API?

Because desktop UIs are messy and the selector is where the mess lives. A Python function call like click_button(name='Save') collapses the moment you have two Save buttons, or a Save button that is named differently in a localized build, or a Save button that only exists inside one dialog. A selector language lets you express 'a Button whose name is Save AND which is inside the Export dialog', or 'the Checkbox to the right of the Username label', or 'the nth-from-last MenuItem that is not disabled', all in one string. Terminator's parser lives in crates/terminator/src/selector.rs and compiles these expressions into a Selector enum with 24 variants (Role, Id, Name, Text, Path, NativeId, Attributes, Filter, Chain, ClassName, Visible, LocalizedRole, Process, RightOf, LeftOf, Above, Below, Near, Nth, Has, Parent, And, Or, Not, plus Invalid for parse errors).

How does the Boolean part work?

The tokenizer in selector.rs walks the string character by character, recognizing && as And, || or , as Or, ! as Not, and parentheses as grouping. Whitespace outside a token is skipped; whitespace inside a text: value is preserved, which is how text:'RPA Hospital (MGP)? : r/foo' survives parsing. Operator precedence is defined at lines 206 through 212: Or is precedence 1, And is precedence 2, Not is precedence 3. The parser is a textbook Shunting Yard implementation at lines 215 through 272. It builds an AST directly, flattens nested Ands into a single Vec<Selector> and nested Ors the same way, and returns a Selector::Invalid if the expression is malformed rather than panicking. You write role:Button && !name:Close || role:MenuItem and you get back a tree that the engine can walk.

How exactly does the near: relator decide something is 'near'?

It takes the center point of the anchor element's bounding box, the center point of the candidate element's bounding box, computes the Euclidean distance between them, and returns true when that distance is strictly less than NEAR_THRESHOLD, which is defined as 50.0 pixels at crates/terminator/src/platforms/windows/engine.rs line 1815. It is a hardcoded constant, not configurable, intentionally tight. rightof:/leftof: and above:/below: work differently: they use half-plane tests (candidate_left >= anchor_right for rightof) combined with an overlap check on the perpendicular axis (candidate_top < anchor_bottom && candidate_bottom > anchor_top for rightof, so the candidate must share vertical rows with the anchor). The relevant math sits at engine.rs lines 1783 through 1826.

What does a real selector look like when I use all of this together?

A login dialog click might read process:slack.exe >> role:Window && name:'Slack' >> rightof:(role:Text && name:'Username') && role:Edit. That says: inside the Slack process, under the Slack window, find an Edit element whose bounding box sits to the right of the 'Username' label and shares vertical rows with it. The descendant combinator >> chains selectors to walk down the accessibility tree. The .. token moves up to a parent. nth:0 picks the first match, nth-1 picks the last, nth-2 the second-to-last. has: asserts that an element contains a matching descendant. Each of these compiles to a specific Selector variant and the engine evaluates them against the UIA tree with timeout and depth parameters you can set per call.

Does the selector language work the same on Windows, macOS, and Linux?

The selector syntax is identical. The engine adapters differ. On Windows the selector runs against IUIAutomation and walks the UIA tree via TreeScope. On macOS it runs against AXUIElement. On Linux it runs against AT-SPI2. A few attribute names translate: id: resolves to AutomationId on Windows and AXIdentifier on macOS, classname: resolves to Win32 ClassName on Windows and AXRoleDescription on macOS. The Boolean and spatial operators are platform-independent because the bounds math only needs an element rectangle, which every accessibility API exposes. In practice you write one selector string and it runs everywhere the framework supports.

How is this different from Playwright's locator system?

Playwright is DOM-only. Its selectors evaluate against HTML elements, aria attributes, and CSS pseudo-classes, all of which exist only inside a browser. Terminator's selectors evaluate against an accessibility tree that spans every application on your desktop, so the same vocabulary (role, name, id, classname, text, visible, rightof, leftof, above, below, near, nth, has, parent, and, or, not) works inside Excel, SAP, Photoshop, VS Code, Slack, Chrome, and your own WinForms tool. The shape of the language is deliberately Playwright-adjacent so a web-automation engineer can pick it up in a day, but the backend is completely different. It is like Playwright for the whole OS.

How do I debug a selector that is not matching?

Three ways. First, the Selector::Invalid variant preserves the reason string so a bad expression like role:Button && && name:Save comes back as Invalid('Invalid expression: multiple selectors without operators') rather than silently returning zero matches. Second, the tree_formatter module prints the accessibility tree of a running window so you can see exactly what role, name, and AutomationId the target element actually exposes. Third, the MCP agent has a tool that takes a selector plus a screenshot and shows which nodes matched, which failed, and why. The combination catches the three common failure modes: the name you thought you saw is really an AutomationId, the role you guessed is Pane rather than Group, or the element is under a different process than you assumed.

Can I use this selector language with my AI coding assistant?

Yes. Terminator ships an MCP server, terminator-mcp-agent, that exposes the selector language as tool calls for Claude Code, Cursor, VS Code, and Windsurf. Your assistant reads the accessibility tree of your running application, composes a selector in the exact vocabulary described above, and invokes click, type, or press_key actions against it. Because the selector language is expressive enough to target by role, Boolean composition, and spatial layout, the assistant does not have to fall back to pixel coordinates or screenshot recognition, which is where most LLM-driven desktop agents get slow and unreliable. Install with npx -y terminator-mcp-agent@latest and point your assistant's MCP config at it.

Automation for desktop application, judged on its selector language, not its action list

The part of desktop automation that everyone underestimates

Coordinate scripts vs a selector language, in one side-by-side

The AST has 24 variants and every operator maps to one

Operator precedence, stated plainly

Shunting Yard, verbatim

Spatial relators are the part that actually makes layouts targetable

Where a selector string goes between typing it and clicking something

From string to platform adapter

Six steps from a selector string to a match

Tokenize the input

Check for unbalanced parens

Run Shunting Yard

Flatten nested And and Or

Dispatch to the platform engine

Return elements or Selector::Invalid

What this looks like when an agent actually runs it

24-variant AST

Shunting Yard parser

Spatial operators

Descendant combinator >>

Parent and nth navigation

Text with special chars

The full variant inventory, because a selector language with hidden features is not a selector language

What survives, what does not

How this compares to the other things called "automation for desktop application"

One last note on where to read the code

Want the selector language pointed at your own desktop app?

Frequently asked questions

Comments (••)

Comments ()