Guide / autonomous agents

Self-driving agents and the legacy desktop ceiling: what actually breaks, and what to do about it

Every autonomous AI agent demo runs on a clean Chrome window or a freshly installed VS Code. The wall it hits in real customer environments is the population of legacy desktop apps that still drive day-to-day work: a 1998 MFC ledger, a Delphi-built shop-floor terminal, a WinForms claim system that has not been recompiled in twelve years. The reason the agent stalls there is not that it lacks reasoning; it is that the accessibility tree it relies on returns nothing useful for those windows, and the agent has no fall-through to a second grounding source.

Matthew Diakonov, Written with AI

Published May 6, 20268 min read

Direct answer (verified 2026-05-06)

Self-driving agents break on legacy desktop apps the moment the accessibility tree returns a single Pane with no name, or a LegacyIAccessibleRole = 43 (custom client) with no default action. The agent has no element to ground its next click on, the loop stalls or hallucinates a coordinate, and reliability drops to that of a screenshot-only agent. The fix is structural: pair the UIA tree with the MSAA LegacyIAccessible bridge plus a fall-through to OCR, an icon detector, and vision, all behind one tool signature so the model never plans its own grounding source.

The ceiling, in one sentence

A self-driving agent is only as autonomous as its weakest grounding source on the surface in front of it. On a modern UIA-clean app the planner can be a 7B model and still hit 90%+ task completion because the click target is unambiguous. On a 1998 MFC window the planner can be a frontier model and you still get 30% completion if the runtime cannot read the controls. The autonomy level is not a property of the model. It is a property of the runtime that translates the model's intent into a click.

What “legacy” means here, structurally

For the purposes of a self-driving agent, an app is “legacy” when its accessibility metadata sits on the wrong side of three Microsoft transitions. First, it predates UIA (introduced with Windows Vista, 2006), so the tree is populated by the MSAA-to-UIA bridge rather than by native UIA providers. Second, it was built before the UIA pattern model became normative (InvokePattern, ValuePattern, ExpandCollapsePattern), so its controls expose only the older IAccessible default-action verb. Third, in many cases the developer never implemented IAccessible at all, so what UIA can extract is whatever the default Win32 IAccessible proxy can scrape from the HWND tree. The end result is a UIA tree that is shallow, mostly unnamed, and whose only non-empty fields are in the LegacyIAccessible* namespace.

The eight properties that keep an autonomous agent alive on those windows

Every UIA wrapper that wants to recover something useful from a legacy control reads from the MSAA bridge. In Terminator, the mapping is in crates/terminator/src/platforms/windows/utils.rs at lines 198-205:

// Text properties
"LegacyIAccessibleValue"            => Some(UIProperty::LegacyIAccessibleValue),
"LegacyIAccessibleDescription"      => Some(UIProperty::LegacyIAccessibleDescription),
"LegacyIAccessibleRole"             => Some(UIProperty::LegacyIAccessibleRole),
"LegacyIAccessibleState"            => Some(UIProperty::LegacyIAccessibleState),
"LegacyIAccessibleHelp"             => Some(UIProperty::LegacyIAccessibleHelp),
"LegacyIAccessibleKeyboardShortcut" => Some(UIProperty::LegacyIAccessibleKeyboardShortcut),
"LegacyIAccessibleName"             => Some(UIProperty::LegacyIAccessibleName),
"LegacyIAccessibleDefaultAction"    => Some(UIProperty::LegacyIAccessibleDefaultAction),

Those eight properties are the difference between an agent that can save a record in a 25-year-old MFC app and an agent that emits coordinates derived from a screenshot. The LegacyIAccessibleName field carries the button label that UIA's Name often does not. The LegacyIAccessibleDefaultAction field carries the verb (“Press”, “Open”, “Toggle”) that lets the runtime invoke the control without a synthesized mouse event. The LegacyIAccessibleRole field disambiguates a button from a menu item from an edit when the modern ControlType comes back as Pane by default.

The two loops, side by side

Self-driving agent loop with one grounding source (UIA tree only, or screenshots only). Step 1: model proposes "click Save in customer ledger". Step 2: runtime queries UIA on legacy MFC window. Step 3: result is a single Pane element with role:Pane, name:"" (empty), no InvokePattern, no children of interest. Step 4: agent has nothing to click. It either stalls, retries with a vision call (expensive, slow, often wrong on busy enterprise UIs), or hallucinates a coordinate. Reliability past a 5 to 10 step horizon collapses to roughly that of a screenshot-only loop.

Why one tool signature, not two

The recurring temptation is to expose the grounding sources as separate tools (click_by_selector, click_by_ocr, click_by_vision, click_by_coords) and let the model pick. That fails for a structural reason. The model has no way of knowing which source will work on the surface it is looking at without trying. So it tries one, observes the failure, tries another, observes that failure, and burns its planning budget on bookkeeping. The grounding source is a runtime concern, not a planning concern. It belongs in one tool that takes a selector or an index, and the runtime decides which source to consult by trying them in priority order.

Terminator's click_element accepts three modes (selector, index, raw coordinates) and one vision_type field that tells the runtime which index source the model is referring to. The five legal index sources are defined as one enum at crates/terminator-mcp-agent/src/utils.rs lines 1062-1073:

#[derive(Debug, Clone, Copy, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "lowercase")]
pub enum VisionType {
    Ocr,
    Omniparser,
    UiTree,
    Dom,
    Gemini,
}

From the model's view there is one tool. From the runtime's view there are five sources of grounding plus a coordinate escape hatch, and the LegacyIAccessible bridge sits inside the UiTree path so a selector-based call already gets the legacy-aware fall-through for free.

One step on a legacy MFC window, then one step on an owner-drawn grid

The ceiling, by tier

Stack the grounding sources from cheapest to most expensive, and the autonomy ceiling on a given UI is whichever tier first returns a usable element. A self-driving agent is not asking which tier is “best” on average; it is asking which one matches the surface in front of it right now.

Grounding tiers, in priority order

1
UIA tree (modern)
WPF, UWP, Catalyst, well-behaved WinForms. Role + Name + AutomationId + InvokePattern all present. Selector matches in one query.
2
LegacyIAccessible (MSAA bridge)
Old MFC, Delphi, classic WinForms, Win32 LOB. UIA Name is empty; the eight LegacyIAccessible* properties carry the metadata. Mapped at utils.rs:198-205.
3
OCR / Omniparser
Owner-drawn grids, custom controls, anything with rendered text but no IAccessible. The agent indexes visible text or detected icons and clicks by index.
4
Vision model
Canvases, PDFs, swap-chain text, screen-share embeds. Slowest and most expensive tier. Only consulted when the three above fail.
5
Raw coordinates
Last resort. The agent emits an (x, y) it derived from one of the upper tiers (a vision detection box, an OCR bounding box, a known offset). Almost never the planning model's own guess.

The three failure modes that survive even fall-through

Even a well-implemented chain has a bottom. Three categories of UI surface stay hard for any autonomous loop:

Owner-drawn controls. Custom grids, third-party drawing libraries that paint via raw GDI, in-house chart widgets that never call any IAccessible API. The MSAA bridge returns ROLE_SYSTEM_CLIENT (43) with no name and no default action. OCR or vision is the only path. Latency goes up by an order of magnitude on these steps.
Apps with UIA explicitly disabled. A small population of game-adjacent and industrial apps disable UIA for performance. The tree is empty by design. Vision is the only viable source. Plan workflows around a small number of these, never as the steady state.
Direct2D / swap-chain text. Modern UIs that render text into a swap chain bypass GDI text rendering and therefore are invisible to UIA in many cases. OCR is the recovery path; even a strong vision model handles them comfortably because the rendered text is sharp.

The point of the chain is not that it eliminates these. It is that the agent does not collapse when it hits one. It degrades to a slower tier, the loop continues, and the cost shows up only on the steps where it is unavoidable.

The minimum change that moves the ceiling

If you are running an autonomous loop on top of Anthropic computer use, OpenAI Operator, or Gemini computer use today, the smallest useful change is to stop letting the model emit raw screen coordinates as its first move. Put a structural-grounding tool in front of the model and let the runtime fall through to vision only when the structural sources fail. Concretely:

Install an MCP server whose click_element tool reads the UIA tree and the LegacyIAccessible bridge before falling through to OCR and vision. (Terminator is one such server: claude mcp add terminator "npx -y terminator-mcp-agent@latest".)
Wire the same tool into Cursor, VS Code, or Windsurf via the same MCP config so the agent can drive native windows the same way it drives the editor.
Reserve the model's coordinate-emitting capability for the canvases and PDFs where structural grounding genuinely cannot help. Treat any coordinate the model emits on a non-canvas surface as a smell, not a feature.

Hitting the legacy ceiling on a real workflow?

If your autonomous loop stalls on a specific Win32 or MFC window, send the screenshot and the UIA dump. We will look at it with you and tell you which tier in the chain is missing.

Frequently asked questions

Where do self-driving agents actually break on legacy desktop apps?

They break at the grounding step, not at the planning step. The model proposes a perfectly reasonable next action ("click the Save button in the customer ledger window") and asks the runtime which element to click. On a modern UWP or WPF app the runtime returns a UI Automation element with role:Button, name:Save, AutomationId set, an InvokePattern attached. On a 1998 MFC line-of-business app the runtime returns a single Pane with no name, no AutomationId, and no InvokePattern. There is nothing to click. The agent has two bad options: emit a coordinate guessed from the screenshot, or stall. Both produce the kind of flake that makes computer-use loops unusable past a 5 to 10 step horizon.

Is this just a Windows problem, or does macOS have the same ceiling?

macOS has its own ceiling but a different shape. AppKit and Mac Catalyst apps publish a reasonably rich AX tree by default; the failures cluster around Electron child windows, custom-rendered controls, and apps that explicitly disable accessibility for performance. Windows is where the legacy population is large enough to be its own category: tens of thousands of internal LOB apps written in MFC, Delphi, PowerBuilder, classic WinForms, and Win32 with hand-rolled WM_PAINT. Microsoft's answer for those is the MSAA-to-UIA bridge, exposed in UIA as the LegacyIAccessible property family. That bridge is what an agent has to read from when the modern UIA properties are empty.

What does the LegacyIAccessible bridge actually give an agent?

Eight properties: LegacyIAccessibleName, LegacyIAccessibleValue, LegacyIAccessibleDescription, LegacyIAccessibleRole, LegacyIAccessibleState, LegacyIAccessibleHelp, LegacyIAccessibleKeyboardShortcut, and LegacyIAccessibleDefaultAction. They are the UIA wrapper around the MSAA IAccessible interface that legacy controls have implemented since the late 1990s. In Terminator's code they are mapped explicitly in crates/terminator/src/platforms/windows/utils.rs at lines 198 to 205, so a selector like role:Button && name:Save will fall back to LegacyIAccessibleName and LegacyIAccessibleDefaultAction when the UIA Name and InvokePattern come back empty. That is enough to recover most legacy buttons, menu items, and edits. It is not enough for owner-drawn lists, custom grids, or anything painted by a third-party drawing library.

Why is one grounding source structurally insufficient for a self-driving agent?

Because the population of UI surfaces an autonomous agent will encounter is non-uniform. A modern Office canvas, a WinForms grid, an Electron Chromium child window, an OS-level toast, a PDF, a screen-share embed: each has a different relationship with the accessibility tree. UIA covers the first set, the LegacyIAccessible bridge covers the second, OCR covers anything with rendered text, an icon-detection model like Omniparser covers icon controls, a vision model covers everything else, and raw screen coordinates are the last resort. An agent that only reads UIA will fail roughly 100% of the time on the canvas surfaces. An agent that only sends screenshots will be expensive, slow, and will fluff the modern surfaces it could have hit deterministically. The only working shape is a fall-through chain, and the chain has to live behind one tool signature so the model does not have to plan which grounding source to use.

What does that fall-through chain look like in code, concretely?

In Terminator's MCP server it is an enum with five variants at crates/terminator-mcp-agent/src/utils.rs lines 1062 to 1073: Ocr, Omniparser, UiTree, Dom, Gemini. The click_element tool accepts a vision_type field that names which source the index came from, and an optional x and y for raw coordinate mode. The agent calls click_element with role:Button name:Save first; if that 404s on this UI, it calls get_window_tree with include_omniparser or include_gemini_vision to get an indexed list of icon-shaped or vision-detected items, and then clicks by index. The selector grammar and the index grammar both go through the same dispatch arm. From the model's view it is one tool. From the runtime's view it is five sources of grounding plus a coordinate escape hatch.

How is this different from "computer use" agents that already use vision?

Computer use models are vision-first. The model emits an (x, y) per click and the runtime is responsible for clicking those exact coordinates. That works, but it is the most expensive and most flake-prone of the available grounding sources. The relevant change is to invert the default: structural grounding (UIA, LegacyIAccessible, DOM) is the first try, and vision is the fallback when the structural sources cannot resolve the element. Terminator has a Gemini computer-use arm in the same dispatch (server.rs has it as one match arm next to click_element); it is there for canvases and PDFs, not as the steady state. That inversion is what makes a 50-step workflow feasible without a full-screen screenshot per step.

What is the practical ceiling, then? Where do even fall-through agents fail?

Three places. First, owner-drawn controls that paint themselves with raw GDI and never call any IAccessible API. The LegacyIAccessible role comes back as ROLE_SYSTEM_CLIENT (43) with no name and no default action. Vision and OCR are the only path. Second, applications that explicitly disable UIA for performance reasons (some game-adjacent industrial apps do this), where you get nothing structural at all. Third, dialogs that render text via Direct2D into a swap chain, where OCR is needed because the text is not in the tree. In all three the agent still works, just at the lowest tier of the chain (OCR or vision), with the corresponding latency and cost. The point of the chain is that the agent does not collapse when it hits one of these; it degrades.

If I'm running an autonomous loop today, what's the minimal change I should make?

Stop letting the model plan its own grounding. Give it one tool that takes a selector or an index and let the runtime decide which source to consult. If your stack is OpenAI Operator or Anthropic computer use, that means putting an MCP server in front of them whose click and type tools accept role:/name: selectors and fall through to vision only when the selector misses. Terminator is one such server (one MCP install line: claude mcp add terminator "npx -y terminator-mcp-agent@latest"). The same shape can be built on top of pywinauto, FlaUI, or raw UIAutomationCore.dll, but the LegacyIAccessible bridge has to be wired in for any of it to help on legacy LOB apps.

Is this a problem that goes away as legacy apps get rewritten?

On a long enough timeline, yes. In practice the LOB application replacement cycle measured in actual customer environments is closer to 15 years than 5, and the tail of unmaintained MFC and Delphi apps inside large organizations is enormous. Any autonomous agent claiming end-to-end automation of office work has to handle that tail or it is only automating the modern half. The accessibility-tree-plus-fall-through pattern is the only one that scales across both halves without a separate code path per app.

Keep reading

Deep dive

Accessibility API for computer use agents: the seven-mode click_element router

The full breakdown of all seven grounding modes the click_element MCP tool dispatches across, with the file references in utils.rs.

Read

Comparison

Accessibility tree vs PyAutoGUI for desktop automation

Why pixel matching is the wrong default for an autonomous loop, and what the structural alternative looks like end to end.

Read

Architecture

Browser agents leaving the DOM

The other half of the autonomy ceiling: browser agents stall the moment a workflow leaves the page. Same selector grammar, both sides of the boundary.

Read