Three axes
Why accessibility APIs beat OCR and pixel matching for OS-level automation
OCR and pixel matching identify a button by what it looks like to the camera. Accessibility APIs identify it by what the developer named it in code. That gap shows up in three places: latency, stability, and i18n. The third one is the one no other writeup on this topic mentions.
Direct answer (verified 2026-05-17)
Accessibility APIs identify UI elements by semantic identity set in app code (role plus AutomationId plus class name) instead of by pixel patterns the OS happened to render. The practical consequences:
- Latency. One COM call into the UIA runtime that is already loaded in both processes, vs a full screen capture plus ML inference plus coordinate math plus synthetic input.
- Stability. AutomationId is set once in the app's source. DPI changes, theme changes, font substitution, and animation frames do not move it. Pixel coordinates move on every one of those.
- i18n. The role enum and AutomationId are the same in every locale. Windows OCR (Windows.Media.Ocr) is bound to whichever language packs are installed on the user's machine, via
OcrEngine.TryCreateFromUserProfileLanguages(). Switch the customer's Windows display language, and OCR breaks; accessibility selectors do not.
Authoritative source for the OCR-locale behavior: Microsoft docs for OcrEngine.TryCreateFromUserProfileLanguages.
Axis 1 of 3
Latency: one COM call vs a pipeline of screen capture, inference, and synthetic input
Calling invoke() on a UIA element resolves to a single COM call into the IUIAutomationInvokePattern proxy that is already loaded inside the target process. The runtime returns when the target acknowledges. There is no frame buffer involved, no pixel math, no synthetic input event posted to the OS message queue.
The OCR path has a strict minimum number of stages, and you pay all of them on every click. Terminator's own implementation of the Windows OCR path is the cleanest evidence: it lives in crates/terminator/src/platforms/windows/engine.rs starting at line 720, and the stages are visible in the source.
What OCR-then-click actually executes
The UIA-pattern alternative collapses every one of those stages into a single message: caller calls invoke_pat.invoke(), target app acknowledges. Terminator's llms.txt at line 243 frames the resulting performance gap as "100x faster (CPU speed, not LLM inference)" specifically against screenshot-based agents like ChatGPT Agents, Claude computer use, and BrowserUse. The multiplier is not the point. The shape of the cost curve is: pattern invocation is bounded by CPU, OCR-then-click is bounded by whatever ML model is doing the recognition.
Axis 2 of 3
Stability: AutomationId is set once in source, pixels are reconstructed every render
The standard argument for accessibility-API stability stops at DPI: if the user drags the window to a 4K monitor, an (x, y) click misses but a tree path does not. That argument is correct and undersold. There are at least five other shifts that move pixels without moving identity, and any one of them is enough to break a pixel-matched script:
- Subpixel anti-aliasing. ClearType on Windows, font smoothing on macOS. The same character at the same size renders to different pixels on different machines.
- Theme changes. Light vs dark, custom accent color, high-contrast mode. Every button repaints.
- Font substitution. If the user does not have the requested font installed, the OS picks a near-match with different glyph widths. The button is now five pixels wider.
- Scrollbar width and chrome. Different OS versions, different accessibility settings, different per-app window-decoration policies all push the content area by a few pixels.
- Animation. Pixel matching mid-fade matches the wrong frame. Half the time the script clicks too early; the other half it clicks the wrong place.
None of those shifts move AutomationId, ControlType, or the element's position in the UIA tree. The selector role:Button && id:save_btn resolves the same element on a 1080p ClearType light-theme box with Segoe UI 9pt as it does on a 4K dark-theme box with a Segoe UI substitute at 11pt mid-fade-in. That is the entire stability story.
The two failure modes that do break an accessibility selector are both honest: the developer renamed the control's AutomationId (which shows up as a typed ElementNotFoundError, not a silent-wrong-click), or the control is a custom-drawn widget that never implemented a UIA provider (in which case the tree is single-node and you legitimately need the OCR fallback below).
Axis 3 of 3 — the one nobody covers
i18n: your OCR is pinned to the user's installed language packs
This is the axis the existing guides on accessibility-vs-OCR all miss. Pull up Terminator's Windows OCR engine creation:
// crates/terminator/src/platforms/windows/engine.rs:763
let ocr_engine = WinOcrEngine::TryCreateFromUserProfileLanguages()
.map_err(|e| {
AutomationError::PlatformError(
format!("Failed to create Windows OCR engine: {e}"),
)
})?;That single Microsoft API call, Windows.Media.Ocr.OcrEngine.TryCreateFromUserProfileLanguages(), decides which languages your automation can read. The documentation is explicit: it iterates the language profiles installed in the user's settings and tries to build an OCR engine that supports them. If none of the installed languages are supported by the OCR runtime, it returns null and Terminator raises a PlatformError. There is no graceful fallback. There is no auto-download. The user's machine is the source of truth for which scripts are even readable.
One automation, two locales
OCR script targeting a Save button by reading pixels. Selector logic: screenshot the window, OCR-detect the word 'Save', click its center.
- Works on en-US Windows with English OCR installed
- Returns garbage on ja-JP Windows: the button now reads 保存
- Returns garbage on de-DE Windows: the button now reads Speichern
- Works on no locale that does not have an English OCR pack installed
The trap people fall into is selecting on the visible text instead. Terminator exposes a name: selector that matches the accessible name, and a text: selector that matches visible text. Both of those are translated. So is LocalizedControlType, the property Narrator reads aloud ("button" in English, "Schaltfläche" in German). Terminator's property mapping at crates/terminator/src/platforms/windows/utils.rs lines 165 to 175 keeps LocalizedControlType and the locale-independent ControlType as separate fields specifically so you can choose which one your selector hits.
The discipline that survives locale flips is simple: write selectors against role: (UIA ControlType enum, never translated) and id: (AutomationId, set once in source). Treat name: and text: as last resorts, only when the app developer did not give you an AutomationId. If your selector grammar starts with a string the user can see in their language, you have implicitly opted in to localization risk.
“No pixel-based automation or image matching by default, though OCR and vision AI are available as supplementary detection methods.”
terminator/llms.txt:9
When OCR and pixel matching are actually the right call
None of the above means OCR is wrong. It means OCR is a fallback, not a default. Three honest cases where you reach for it:
- The target renders through a frame buffer. A fullscreen game on DirectX or OpenGL, a 3D modeller, a custom CAD canvas. The window is one accessibility node and pixels are the only addressable thing.
- The accessibility bridge does not cross the host boundary. A remote desktop session, a VM viewer, or a Citrix-streamed app. The host machine sees a single video surface where the guest's tree should be.
- The app embeds a HTML5 canvas or WebGL surface for its real UI. Figma's drawing surface, a browser-based game engine, a canvas-backed data grid. Each tool's hit region lives inside one canvas element with no children.
Terminator ships exactly that fallback. The ocr_screenshot_with_bounds method at engine.rs:720 returns a tree of OCR-detected lines and words with bounding rectangles in absolute screen coordinates, which you can feed back into the same selector grammar. The right mental model is: try the tree, and only if the tree is empty fall through to OCR. That order is the difference between a flaky script and a deterministic one.
What Terminator actually does
Terminator is an open source desktop automation framework for Windows and macOS. The selector grammar is intentionally shaped like Playwright, but the targets are the whole OS. It exposes:
- A Rust core at
terminator-rson crates.io with Windows UIA and macOS AX adapters. - Node.js bindings at
@mediar-ai/terminatorvia NAPI-RS. - Python bindings at
terminator-pyvia PyO3. - An MCP server at
npx -y terminator-mcp-agent@latestthat exposes desktop control to Claude Code, Cursor, VS Code, and Windsurf as MCP tools.
The source is at github.com/mediar-ai/terminator, MIT licensed. The line numbers cited in this guide are from the current main branch as of 2026-05-17. If the file moves between then and the time you read this, the structure of the argument still holds: structured selectors win on latency, stability, and i18n; OCR is the escape hatch when the tree is empty.
Stuck on a flaky desktop test that survives the demo and breaks in prod?
20 minutes with the team. Bring a screenshot of the selector that's failing, leave with a UIA-tree-grounded replacement.
Frequently asked questions
What is the literal difference between an accessibility API and OCR or pixel matching?
An accessibility API reads a tree of UI elements that the app developer (and the OS framework) populated with semantic identity: role, AutomationId, class name, accessible label, bounding rectangle. OCR reads pixels and runs an ML model to guess what text those pixels represent. Pixel matching takes a reference screenshot and looks for a region of the current screen that matches it pixel-for-pixel within some tolerance. The accessibility API answer is symbolic data the app emitted on purpose. The OCR and pixel matching answers are reconstructions of intent from pixels the OS happened to render at a moment in time. When the rendering changes (DPI, theme, locale, anti-aliasing, scrollbar widths, font substitution) the reconstruction can drift while the symbolic answer stays put.
Is the latency advantage really that big?
Yes, and the reason is structural, not a benchmark trick. Pattern invocation through UIA on Windows is one COM call into the accessibility runtime that is already loaded into both processes. There is no screen capture, no buffer conversion, no inference. The OCR path on Windows is at minimum: lock the framebuffer, copy pixels out, convert RGBA to BGRA (Terminator does this at engine.rs:734), wrap it in a SoftwareBitmap, hand it to the Windows.Media.Ocr engine, await async recognition, walk the returned lines and words, then still have to compute screen coordinates and call SendInput to actually click. Terminator's own llms.txt at line 243 frames the result as CPU speed instead of LLM inference. The exact multiplier varies by app, but the ceiling for OCR is set by inference time, not by your CPU.
What is the i18n problem with OCR exactly?
OS-level OCR engines like Windows.Media.Ocr ship language packs separately from the engine. Terminator constructs its OCR engine at crates/terminator/src/platforms/windows/engine.rs line 763 with WinOcrEngine::TryCreateFromUserProfileLanguages(), which Microsoft documents as falling back to whichever languages are installed in the user's profile. An English-only Windows install pointed at a Japanese app produces garbage hiragana-to-ASCII transliteration, an English-and-German install pointed at a Korean app produces nothing usable, and your automation breaks the moment a customer in a different locale runs it. Accessibility APIs sidestep this entirely. AutomationId is set by the app developer in source code, in ASCII, once. It does not change when the OS display language changes. role:Button is a UIA enum value, not a translated string. Your selector role:Button && id:save_btn resolves the same element on en-US, ja-JP, ko-KR, and ar-EG without any per-locale work.
What about LocalizedControlType? Doesn't that mean the accessibility tree is also localized?
Two properties, one of them is and one of them is not. Terminator handles both at crates/terminator/src/platforms/windows/utils.rs lines 165 to 175. ControlType is an enum (Button, Edit, CheckBox) and is locale-independent. LocalizedControlType is the human-readable string Narrator reads out loud (in English: 'button'; in German: 'Schaltfläche') and is locale-dependent. The lesson is to write selectors against ControlType and AutomationId, never against LocalizedControlType or the user-facing Name. Terminator's role: prefix maps to ControlType, and id: maps to AutomationId. If you find yourself selecting on name:Save you have implicitly opted in to localization risk, and Terminator's docs are explicit about that.
When should I actually reach for OCR or pixel matching?
When the target does not expose a tree. Three honest cases: a fullscreen game or a 3D modeller rendered through DirectX or OpenGL where the only surface on screen is a frame buffer; a remote desktop or VM viewer where the accessibility bridge does not cross the host boundary; a canvas-based design tool where each hit region lives inside one accessibility node. Terminator ships OCR for exactly this fallback role: the Windows engine surfaces ocr_screenshot_with_bounds at engine.rs:720 returning a tree of OCR lines and words in absolute screen coordinates, but the selector grammar still prefers role:, id:, classname:, and nativeid: when they resolve. The framework's own llms.txt at line 9 puts it as 'no pixel-based automation or image matching by default, though OCR and vision AI are available as supplementary detection methods'. Default is structured; OCR is the escape hatch.
Does this also apply to web apps inside Electron or WebView2?
Yes, with one caveat. Electron and WebView2 expose their DOM as an accessibility tree through Chromium's accessibility layer, so role: and id: selectors work just like in a native app. The caveat is that AXPress and AXClick on macOS Chrome and Safari sometimes silently no-op, which is why a production AX engine maintains a hardcoded fallback list for those apps and synthesizes input instead. Terminator's locator grammar is the same across native Win32, WPF, UWP, Cocoa, Electron, and WebView2 surfaces, and the framework decides at runtime whether to fire a UIA pattern or fall back to SendInput. Either way, you do not reach for OCR or pixel matching by default.
Why is pixel matching so brittle, beyond the obvious DPI argument?
DPI is the headline, but five other shifts also break pixel matching. Subpixel anti-aliasing settings (ClearType on Windows, font smoothing on macOS) re-render the same character at the same size into slightly different pixels per machine. Theme changes (light vs dark, accent color) repaint every button. Font substitution silently swaps a font you don't have for a near-match with different glyph widths. Scrollbar widths differ between OS versions and accessibility settings, pushing every element a few pixels. And animation: pixel matching against an element mid-animation matches the wrong frame. Every one of those moves the pixel coordinates of a button. None of them move its AutomationId or its UIA tree position.
What does Terminator give me to enforce this discipline in code?
Selectors with explicit prefixes. role:Button binds to UIA ControlType. id:save_btn binds to AutomationId. nativeid: binds to the OS-specific identifier (AutomationId on Windows, AXIdentifier on macOS). classname: binds to ClassName. Combinators &&, ||, !, and >> let you compose those. The pos:x,y selector exists but is documented as 'last resort'. There is no name-localized: selector, by design. If you want to type-check at the locator layer, write your locators as constants and reuse them, and lint for any selector that begins with name: or text: because those are the two strings most likely to be localized. The full grammar is in docs/SELECTORS_CHEATSHEET.md in the Terminator repo.
Related
Other deep dives on the accessibility-tree approach
Accessibility tree vs PyAutoGUI: the two clicks are not the same operation
PyAutoGUI's click(x, y) always lowers to SendInput. UIA invoke() calls a COM method on the element. Twenty two lines of Rust vs eighty.
Accessibility API desktop automation: fire Control Patterns, skip the mouse
UIA ships Control Patterns (Invoke, Toggle, ExpandCollapse, Value) that act on elements without moving the cursor. Read the line numbers.
Desktop automation and the accessibility tree: what one node costs to capture
UIElementAttributes has 17 fields. Terminator's default mode reads two. The tree is not free to capture.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.