Alternative / Structural locators vs pixel matching

Structural locators vs pixel matching for computer use

If your agent finds a button by matching a cropped image or replaying a coordinate, you have stored the one property of a UI that is guaranteed to change: how it looked the day you captured it. A structural locator stores something that does not move when the pixels do. This is the difference, with the line of code that makes it real.

Matthew Diakonov, Written with AI

Published May 22, 20267 min

Direct answer (verified 2026-05-22)

A structural locator addresses a UI element by what it is: role, name, and position in the accessibility tree. It re-resolves that query against the live UI every time it acts. Pixel matching addresses an element by what it looked like: a captured image or a fixed coordinate, frozen at capture time, which breaks the instant resolution, theme, scale, localization, or layout shifts. For a computer-use agent the locator is deterministic and survives UI change; the pixel match is a one-shot guess against a frame that is already stale. You can read the locator mechanism in crates/terminator/src/locator.rs.

Two definitions, side by side

Strip away the framework names and the two approaches store fundamentally different things. One stores a picture. One stores a question.

Pixel matching

A photograph

A cropped reference image or a recorded coordinate. Resolution is captured at one moment and compared against later screenshots by correlation. When the screen changes, the comparison drifts. The control still works; the picture no longer fits.

Structural locator

A question

A query over the accessibility tree: role:Button|name:Save. It holds no element and no coordinate. It re-asks the tree every time you act, so it answers correctly even after the layout moves.

What each one actually stores, and when it resolves

The reliability gap is not about accuracy on a static screen. On a frozen screenshot a good template match and a locator both find the button. The gap opens the moment anything between capture and action changes. Toggle between what the two approaches hold in memory.

The artifact you keep between steps

# What a pixel matcher keeps template = crop of "Save" button, captured at 2x scale, light theme last_hit = (412, 238) # where it matched last run # Next run, the window opened on a 1x external monitor, # the user is in dark mode, and the toolbar gained one icon. match_score = 0.61 # below threshold, or worse: click(412, 238) # confident click into empty space

stores appearance: a crop and a coordinate
scale, theme, locale, or layout change invalidates it
a bad match still returns a coordinate, so the click lands somewhere wrong

The line that makes it a locator, not a snapshot

The reason a structural locator is resilient is not a clever matching heuristic. It is that the locator deliberately resolves nothing until you act. Here is the struct, verbatim in spirit from crates/terminator/src/locator.rs:

// crates/terminator/src/locator.rs
pub struct Locator {
    engine: Arc<dyn AccessibilityEngine>,
    selector: Selector,
    timeout: Duration,        // default for this locator instance
    root: Option<UIElement>,
}

// One-time search by default. Opt into polling by setting a timeout.
const DEFAULT_LOCATOR_TIMEOUT: Duration = Duration::from_secs(0);

There is no field for a found element. Every action method re-runs the query against the live tree: wait() calls the engine’s find_element again, all() calls find_elements again, and wait_for() loops on a 100 millisecond poll, re-checking the element against one of four conditions until it holds:

pub enum WaitCondition {
    Exists,
    Visible,
    Enabled,
    Focused,
}

// wait_for(): poll_interval = Duration::from_millis(100)
// loop: validate() -> check condition -> sleep 100ms -> repeat
// on timeout: AutomationError::Timeout, with the selector in the message

That is the uncopyable part. A pixel matcher cannot adopt this behavior because it has nothing to re-resolve: its target is a crop, and a crop is just as stale on the second poll as the first. The locator can poll because its target is a description, and the description is still true after the button moves.

What it looks like in code

The public API is shaped like Playwright, so the lazy-resolution model reads the way you expect. You build a locator, optionally wait for a condition, then act. Each call resolves fresh. This is from examples/notepad.py and the Node test suite:

# Python: build a locator, resolve it now, act on it
editor   = desktop.open_application("notepad.exe")
add_tab  = await editor.locator("name:Add New Tab").first()
add_tab.click()

document = await editor.locator("role:Document").first()
document.type_text("hello from terminator!")

// Node: wait for a condition before acting
const win = await desktop
  .locator("role:window")
  .waitFor("exists", 5000);   // re-resolves until it exists or 5s passes

Notice there is no screenshot in that loop and no coordinate anywhere. The agent never says “click at (412, 238)”. It says “find the thing named Save and invoke it,” and the framework re-answers that on the spot.

Four numbers that define the behavior

These come straight from the source, not a benchmark. They are the dials that govern how a locator resolves.

0sdefault locator timeout: one-shot search unless you ask it to wait (DEFAULT_LOCATOR_TIMEOUT)

0mspoll interval while a locator waits for a condition (wait_for)

0wait conditions a locator re-checks each poll: exists, visible, enabled, focused

0spatial-relation selector kinds: right of, left of, above, below, near

The failure mode is the real difference

Both approaches succeed on a clean run. What separates them is what happens when the target is not where it was. A pixel matcher returns its best-correlation coordinate no matter how poor the match, so a missed target becomes a click into empty space that corrupts the next several steps silently. A locator that cannot resolve raises a typed, located error and stops.

agent loop: same selector, two outcomes

The timeout message carries the selector string, so when an agent stalls you know exactly which element it could not find. That is the difference between a five-minute fix and an hour of bisecting a run that drifted three steps after the actual failure.

Where pixel matching still wins

A structural locator needs structure to query. On surfaces that have none, the locator is the worse tool and you should reach for pixels. A game rendered to a GPU canvas, a remote-desktop viewport delivered as a single video stream, the document area of Figma or Photoshop, a custom-drawn chart: the accessibility tree hands back one opaque node with the window bounds and nothing actionable inside. There is no role and no name to match, so a vision model or a pixel detector is the only thing carrying signal there.

The honest answer is not “locators always, pixels never.” It is: use a locator wherever the tree has structure, which is the large majority of native and well-built desktop apps, and fall back to pixels only on the opaque surfaces. The mistake the brittle agents make is using pixels everywhere because pixels are the only thing a pure-screenshot loop can see. For how to merge both signals into one tool result when you do need them together, see the accessibility tree vs pixel breakdown.

Building an agent that keeps clicking the wrong place?

30 minutes. Bring your automation loop, leave with a concrete plan for replacing pixel matches with re-resolving locators where the tree has structure.

Structural locators vs pixel matching: common questions

What is a structural locator, in one sentence?

A description of an element by what it is: its role (Button, Edit, Document), its name, its place in the accessibility tree, and optionally its spatial relation to another element. The locator does not hold a pixel, a coordinate, or even a resolved element handle. It holds the query. You point it at a live window and it walks the tree to find a match at the moment you act. In Terminator that object is the Locator struct in crates/terminator/src/locator.rs, which carries an engine reference, a Selector, a timeout, and an optional root, and nothing else.

What is pixel matching, and how is it different?

Pixel matching covers two related techniques. Template matching crops a reference image of a control and slides it across a fresh screenshot looking for the best correlation (PyAutoGUI's locateOnScreen, OpenCV's matchTemplate). Coordinate replay records that the Save button was at (412, 238) and clicks there next time. Both freeze the target at capture time. A structural locator freezes nothing: it stores the description and resolves it against whatever the screen looks like when the action runs.

Why does pixel matching break so easily?

Because the thing it stored is appearance, and appearance is the least stable property of a UI. Change the display scale and every cached coordinate is off by the scale factor. Switch from light to dark theme and the template no longer correlates. Resize the window, collapse a sidebar, localize the button text, or let the OS animate a transition mid-frame, and the captured crop is matching against pixels that have moved or recolored. The control is still there, still does the same thing, still has the same role and name. Only its pixels changed, and pixels are exactly what the match keyed on.

Does a structural locator resolve once and cache the element?

No, and that is the whole point. Each action method on the locator re-runs the search. wait() calls the engine's find_element again. all() calls find_elements again. wait_for() loops, calling validate() every 100 milliseconds until the condition holds or the timeout fires. There is no stale handle to go bad between the moment you build the locator and the moment you click. The query is the durable artifact; the element is recomputed on demand.

What happens when a structural locator cannot find its target?

You get a typed error, not a wrong click. The engine returns ElementNotFound, and the locator upgrades it to AutomationError::Timeout with the selector string baked into the message, for example 'Timed out after 1s waiting for element role:Button|name:Save'. A pixel match in the same situation returns its best-correlation coordinate regardless of whether the match is any good, so the agent clicks confidently into empty space and the failure surfaces three steps later as nonsense. Loud, located failure beats silent, displaced failure every time you are debugging an agent.

Can a structural locator express position, the way a screenshot region can?

Yes, structurally rather than absolutely. The Selector enum in crates/terminator/src/selector.rs has RightOf, LeftOf, Above, Below, and Near, each wrapping another selector. So 'the field to the right of the Total label' is one query that re-resolves spatially as the layout reflows, instead of a fixed rectangle that points at the wrong place after a resize. You also get Has for parent-by-child matching and Nth for index selection, and you chain them with .locator() to scope into a subtree.

Is this just Playwright locators for the desktop?

Deliberately, yes. The API is shaped like Playwright's: desktop.locator("role:window").waitFor("visible", 5000), then chain into it, then act. The difference is the backend. Playwright resolves against the browser DOM. Terminator resolves against the OS accessibility tree (UIAutomation on Windows, AXUIElement on macOS), so the same lazy, re-resolving locator model reaches Notepad, Settings, Office, and native dialogs, not just a browser tab.

When is pixel matching actually the right call?

When there is no structure to query. A game rendered to a GPU canvas, a remote-desktop viewport that arrives as a single video stream, a Figma or Photoshop document area, a custom-painted chart: the accessibility tree returns one opaque node with the window bounds and nothing inside. There is no role and no name to match on, so a vision or pixel signal is the only thing carrying information. The honest architecture uses a locator wherever the tree has structure and falls back to pixels only on those opaque surfaces.

Where can I read the implementation?

The Locator type, its wait/validate/wait_for methods, the 100ms poll interval, and the WaitCondition enum are all in crates/terminator/src/locator.rs in github.com/mediar-ai/terminator. The selector grammar, including the spatial relations, is in crates/terminator/src/selector.rs. Runnable examples that build locators and act on them are in the examples folder, for instance examples/notepad.py.

Deeper dives into the same stack

Adjacent reads

Alternative

Accessibility selectors vs screenshot automation

The selector grammar in depth: a selector is a query, a screenshot is a guess. Role, name, spatial relations, and boolean composition in one string.

Read

Alternative

Accessibility tree vs pixel for computer use: the framing is wrong

When the tree is empty, pixels carry the signal. How Terminator clusters UIA, DOM, OCR, and vision detections into one prefixed list the model clicks by.

Read

Guide

Accessibility tree vs PyAutoGUI for desktop automation

Why selector-driven trees beat pixel coordinates and OpenCV template matching for production desktop automation.

Read

Two definitions, side by side

What each one actually stores, and when it resolves

The artifact you keep between steps

The line that makes it a locator, not a snapshot

What it looks like in code

Four numbers that define the behavior

The failure mode is the real difference

Where pixel matching still wins

Building an agent that keeps clicking the wrong place?

Structural locators vs pixel matching: common questions

Adjacent reads

Accessibility selectors vs screenshot automation

Accessibility tree vs pixel for computer use: the framing is wrong

Accessibility tree vs PyAutoGUI for desktop automation

Comments (••)

Comments ()