Selectors vs screenshots

Accessibility selectors vs screenshot automation: one is a query, the other is a guess

By Matthew Diakonov · Written with AI · Updated May 15, 2026 · 9 min read

Both approaches end with the same click. They get there in completely different ways. A screenshot tool crops a picture of a button and hunts for that picture on screen. An accessibility selector asks the operating system a structured question and gets back the element. That gap is why one of them goes flaky the week someone ships a redesign and the other does not.

Short answer

Use accessibility selectors whenever the app exposes an accessibility tree, which is almost every native Windows and macOS application. A selector is a structural query that names what an element is (its role, its accessible name, its automation id). A screenshot only records what a pixel region looks like. Reach for screenshots and OCR only for the surfaces the tree genuinely cannot describe: custom-painted canvases, GPU-composited game UIs, and remote-desktop pixel streams. Terminator is built selector-first for exactly that reason, with vision kept as a fallback.

The same task, both ways

Here is "click the Save button" written against a screenshot library and against a selector. Look less at the line count and more at the comment block at the bottom of each side. That is where the real cost lives.

Click the Save button

import pyautogui, time

def click_save():
    # the button is identified by a cropped screenshot
    loc = pyautogui.locateOnScreen("save_button.png", confidence=0.9)
    if loc is None:
        # retry once in case the toolbar is still animating
        time.sleep(0.5)
        loc = pyautogui.locateOnScreen("save_button.png", confidence=0.9)
    if loc is None:
        raise RuntimeError("Save button not found on screen")
    pyautogui.click(pyautogui.center(loc))

# save_button.png has to be re-cropped on every theme,
# DPI scale, locale, or toolbar change. There is no query.

33% fewer lines, and no template image to maintain

The screenshot version is not wrong. It works, today, on this machine, in this theme, at this resolution. The problem is that every one of those qualifiers is load-bearing. The croppedsave_button.pngis a snapshot of a moment, and the moment keeps ending. A selector is not a snapshot. It is a description that stays true as long as the button still exists and is still called Save.

A selector is a query. A coordinate is not.

This is the distinction that most write-ups on the topic skip. They land on "accessibility selectors are more stable than pixels" and stop, as if a selector were just a sturdier string. It is not a string. It is a query, and a screenshot approach has no query language at all.

A screenshot tool gives you two primitives:locateOnScreen(image)andclick(x, y). Every other concept you need has to be hand-built on top of those. "The third row" becomes pixel arithmetic. "The field next to the Email label" becomes a cropped region and a hope that the layout never shifts. "A button that is visible and not disabled" cannot be expressed at all, because a picture does not carry an enabled flag.

A selector carries all of that as first-class syntax. You can compose by role and name, filter by visibility, scope to a process, walk to a parent, require a descendant, pick the nth match, and combine clauses with boolean operators. The accessibility tree already knows what every node is; the selector engine just lets you ask.

Real selector strings Terminator parses

role:Button && name:Saverightof:name:Usernamebelow:name:OKhas:role:Editprocess:notepad >> role:Edit!role:Button && visible:truename:Save || name:Submitwindow:Calculator >> role:Button >> name:Sevennear:text:Cancelnativeid:42nth:0..

0 ways to name an element, zero ways to name a pixel

This is the part you can check yourself. Terminator's selector engine lives incrates/terminator/src/selector.rs. TheSelectorenum at the top of that file defines 25 variants. They group into five kinds of question you can ask about an element:

Identity. role:, name:, id:, nativeid:, classname:, text:. What the element is and what it is called.
Spatial. rightof:, leftof:, above:, below:, near:. An element described by its position relative to another element, computed from the tree at runtime.
Structure. has: (a Playwright-style descendant requirement), .. (parent navigation), >> (chain through the hierarchy), and nth: for ordinal selection.
Filters. visible: and process:. State and scope a picture cannot record.
Boolean. &&, ||, and !, with parentheses. The test file boolean_selector_tests.rs exercises real expressions like (role:button && name:Submit) || (role:link && name:Cancel) and !role:button && visible:true.

Now count the equivalents on the screenshot side. A template-match library exposes one way to name a target: a cropped image. A vision model adds one more: a natural-language description that it grounds into a bounding box. Neither is composable. You cannot AND two template images. You cannot ask a picture for the element to the left of another element without writing the geometry yourself. The asymmetry is not a matter of polish. One side is a language and the other side is a pair of primitives.

Watch one selector narrow the tree

A selector resolves by progressively filtering nodes. Each clause shrinks the candidate set until one element is left. A screenshot never narrows: it carries the whole window into every match attempt.

role:Button && name:Save resolving

01 / 04

The whole window

A single desktop window can expose hundreds of accessibility nodes: every button, field, menu item, and label.

How each side resolves one click

The two paths below do not just differ in reliability. They differ in how many moving parts sit between "I want to click Save" and an actual click. The selector path is one request and one response. The screenshot path adds a capture, an inference step, and a coordinate that is only ever a best guess.

Selector resolution vs screenshot resolution

The red step at the end of the screenshot path is the one that bites. Once you have a coordinate, you have committed to it. If the window scrolled, if a notification pushed the layout down, if the inference was off by twelve pixels, the click lands on the wrong thing and the script keeps going as if it succeeded. The selector path never produces a loose coordinate to be wrong about: it hands you an element, and the click goes through the operating system's own accessibility action for that element.

Where a screenshot is still the right call

Selector-first does not mean selector-only. There are real surfaces where the accessibility tree has nothing useful in it, and on those surfaces a screenshot plus OCR or a vision model is the honest answer. The mistake is using screenshots as the default and falling back to selectors, instead of the other way around.

Screenshot fallback: yes for the first three, no for the last two

The surface is custom-painted: a game UI, a Figma-style canvas, a chart rendered straight to a bitmap. The accessibility tree exposes one opaque node and there is nothing inside it to select.
You are driving a remote-desktop or VNC stream where only pixels cross the wire and no accessibility tree exists on your side of the connection.
You are testing the visual result itself: layout, spacing, color, a rendering regression. That is exactly what a screenshot is for, and a selector cannot see any of it.
The app is a normal native Windows or macOS program. It almost certainly exposes a tree, so a selector will be faster and will not break the next time someone restyles a button.
You reached for screenshots because the tree 'looked hard to read'. Inspect it once with Accessibility Insights or Accessibility Inspector and the selector usually writes itself.

The dividing line is whether the element exists in the tree at all. A button in a normal AppKit, WinUI, WPF, or WinForms app has a node with a role and a name. A button drawn by a game engine onto a Metal or Direct3D surface does not: the whole window is one opaque node. The first is a selector's job. The second is genuinely a screenshot's job. Confusing the two, in either direction, is where automations go brittle.

How Terminator draws that line

Terminator is a desktop automation framework with an API shaped like Playwright, except it targets the whole operating system instead of just the browser. It drives apps through the native accessibility APIs: UI Automation on Windows, the Accessibility API on macOS. Element lookup is selector-first by default, which is why the selector engine in selector.rs is as expressive as it is. OCR and vision detection ship in the box, but they are the fallback for surfaces the tree cannot describe, not the primary mechanism.

It is a developer framework, not a consumer app. You pull in the Rust crate terminator-rs, the Python bindings, or the Node package, or you wire the MCP server into Claude Code, Cursor, or VS Code so an AI assistant can drive real desktop apps as a tool. Whichever entry point you pick, the selector is the unit of work, and the selector is a query. The full prefix list is documented in the selector cheat sheet in the open-source repository, and the core crate is published as terminator-rs on crates.io.

Porting a screenshot-based automation off pixel coordinates?

30 minutes. Bring the flakiest part of your script and we will work out which elements are clean selector targets and which genuinely need vision.

Accessibility selectors vs screenshot automation: common questions

Are accessibility selectors actually faster than screenshot matching?

Yes, and the reason is structural. A selector resolves through accessibility API calls that return an element handle and its bounds rect directly. Screenshot matching first captures a bitmap of the window, then runs template matching or a vision model over that bitmap before it has any coordinate at all. The selector skips both the capture and the inference. For an AI agent the difference is also a model turn: with a selector the agent names the element, with screenshots it spends a turn grounding a coordinate from pixels.

Do accessibility selectors break when the UI is restyled?

No, and that is the whole point of using them. A selector names an element by its accessibility role and its accessible name, not by pixels. Dark mode, a 150 percent DPI scale, a moved toolbar, a refreshed icon set: none of that changes role:Button && name:Save. A cropped template image breaks on every one of those changes, because the picture it was cropped from no longer matches the screen.

What about apps that do not expose an accessibility tree?

Some surfaces genuinely do not expose a usable tree: custom-painted canvases, GPU-composited game UIs, remote-desktop pixel streams. There a selector has nothing structural to bind to, and screenshots plus OCR are the correct tool. Terminator keeps OCR and vision detection as a built-in fallback for exactly those cases. The point is not that screenshots are never right, it is that they should be the exception, not the default.

Can a screenshot tool target 'the field to the right of the Email label'?

Not directly. A screenshot tool matches a template image or a fixed coordinate, so a relationship like 'to the right of' has to be hand-coded as pixel math against a layout you hope never moves. Terminator's selector engine has spatial operators built in: rightof:, leftof:, above:, below:, and near:. You write rightof:name:Email and the resolver computes the geometry from the accessibility tree at runtime.

Is this just Playwright?

Same idea, wider scope. Playwright's getByRole locators query the browser's accessibility tree, which is why they survive CSS refactors. Terminator applies the same selector model to the operating system accessibility tree, so it reaches every native app, not only web pages. The selector syntax is intentionally shaped like Playwright's: role and name, chaining, :has(), parent navigation.

How do I find the role and name to put in a selector?

Inspect the live tree once. On Windows use Accessibility Insights, on macOS use Accessibility Inspector. Terminator can also dump a window's tree with getWindowTree. You read the node, copy its role and its accessible name into a selector, and you are done. It is the same workflow as opening dev tools to copy a CSS selector, except the tree is the OS accessibility tree.

Does Terminator ever use screenshots?

Yes, as a deliberate fallback rather than the default. Selectors resolve first; OCR and vision detection kick in only for surfaces the accessibility tree does not describe. That is the opposite of a screenshot-first tool, which treats the tree as an afterthought and grounds everything from pixels even when a clean structural address was available.

Deeper dives into the same trade-off