The everyday computer-use modal failure mode.
A save prompt slides up. A consent sheet appears. A cookie banner covers the field. Your computer-use agent keeps clicking. The OS already labelled that overlay as modal. The agent never read the label.
This is the single most common way pixel-driven agents fall over on real desktop work. Not jailbreaks. Not hallucinations. Not stuck CAPTCHAs. The boring overlay on Tuesday morning.
Direct answer (verified 2026-05-11)
Vision-driven agents only see pixels. The operating system already names modals via the UIA IsDialog property (Windows 10 build 10.0.17763.0 onwards). A pixel-loop agent can't read that bit, so a save prompt looks like overlapping pixels instead of a modal interrupt. A tree-walking agent like Terminator reads it directly and lets you write role:Dialog as a selector.
Authoritative source: AutomationProperties.IsDialog on Microsoft Learn. Wired into Terminator at crates/terminator/src/platforms/windows/utils.rs line 189.
What this looks like in real footage
Picture Excel. Your agent is told to save the file. It opens the File menu, the model picks coordinates, the harness clicks. The dialog comes up: “Save as .xlsx or keep .csv?” The dialog dims the spreadsheet behind it. The agent takes a fresh screenshot and ships it back to the vision model. The model returns click coordinates. Sometimes the coordinates land on a button on the dialog. Often, they don't, because the model hasn't learned that the dim layer means “ignore me”. The click absorbs against the modal overlay. The harness takes another screenshot. The screen looks the same. The agent reports success. The file was never saved.
This is one trace. Multiply by ten modals per hour for an active desktop session and the failure mode is the rate-limiting step on every long-running agent.
Pixel-loop trace: a modal swallows the click
Every red line is a turn where the agent had no signal that something modal was on screen. The vision model is doing its job; the harness just never told it the foreground became uninteractable.
The bit the OS already publishes
Microsoft added a single boolean to the UI Automation property set in Windows 10 version 1809 (build 10.0.17763.0, October 2018): UIA_IsDialogPropertyId. When true, the element is semantically modal. Screen readers use this bit to change how they announce: title, focused control, content up to focused control. XAML's Flyout and ContentDialog default it to true. Win32 message boxes (class #32770) hit a class-name fallback in every UIA client and are treated as dialogs even when the property is absent.
Terminator's selector engine wires this directly into the grammar. The string "IsDialog" is mapped to UIProperty::IsDialog at this exact line:
pub(crate) fn string_to_ui_property(key: &str) -> Option<UIProperty> {
match key {
// ...
"IsControlElement" => Some(UIProperty::IsControlElement),
"IsRequiredForForm" => Some(UIProperty::IsRequiredForForm),
"IsDialog" => Some(UIProperty::IsDialog), // <- line 189
// ...
}
}The tree builder treats Window and Dialog as the same container kind at tree_builder.rs:323, so the selector role:Dialog resolves against the IsDialog-marked subtree and the legacy class-name path in one shot. You write the bit; Terminator does the COM call.
Same task, two loops
Toggle the tabs below. Same task: save a spreadsheet. The difference between the loops is one validation step that costs roughly a millisecond and zero model tokens.
Screenshot the desktop. Ship the PNG to the vision model. Receive click coordinates. SendInput at those coordinates. Screenshot again. If a modal appeared the previous click might have missed; the model is now staring at a dimmed spreadsheet with an overlay and has to decide on its own whether the overlay is the new target. Most modern vision models guess right most of the time; the failure rate compounds across long sessions because every screenshot costs an Anthropic round-trip plus model inference, and the agent has no primitive that says 'a modal is up, deal with it first'.
- no signal from OS that a modal exists
- every modal check costs a screenshot and a model call
- destructive defaults (Discard, Delete) are guessed at
- silent failure: next screenshot looks plausible, agent reports done
Tree-walking trace: the modal is named, not guessed
The three lines that handle a modal
This is the entire pattern, in TypeScript against the Terminator Node binding. Drop it before any action that touches an app you don't fully own.
import { Desktop } from "terminator.js";
const desktop = new Desktop();
// Before any action: did a modal appear since last frame?
const dialog = await desktop.locator("role:Dialog").validate(1000);
if (dialog.exists) {
// Inside the dialog, find a Button whose name maps to "continue task"
// (your call: Save, OK, Allow, Yes, Continue, Got it).
const ok = await dialog.element!
.locator("role:Button && (name:Save || name:OK || name:Allow)")
.first(2000);
await ok.invoke();
}
// Now do whatever you were trying to do.
await desktop
.locator("process:excel >> role:MenuItem && name:Save")
.first(3000)
.then((el) => el.invoke());Three things to note. First, validate() does not throw on absence; it returns { exists: false }. That keeps the happy path branchless. Second, invoke() on a UIA UIInvokePattern runs inside the target process. Your cursor stays where it is. You can keep typing in another window while the agent works. Third, the same selector grammar works for Win32 MessageBox, XAML ContentDialog, WPF dialogs, WinForms, and most Electron sheets because all of them advertise themselves through UIA.
What “everyday” actually covers
When people say computer-use agents work on benchmarks but fall over in production, this is usually what they mean. Benchmarks don't have a Tuesday-morning permission sheet on the second screen. Real desktops do, every hour. Here is the list a long session crosses:
The boring modals an agent meets every hour
- Save changes before closing? (every editor on every OS, every day)
- Windows UAC consent prompt for elevation
- macOS permission sheet: Screen Recording, Accessibility, Files and Folders
- Cookie banner overlay on a Chromium tab (first-party, third-party, both)
- Browser autofill drop-down covering the next form field
- Streaming service Are-you-still-watching prompt
- Slack join-call modal slid in from the right
- Word's document-recovery side panel on first open after crash
- Chrome unsaved-form confirmation on tab close
- JavaScript alert() and confirm() in any web app
None of these are interesting. All of them break a pixel-loop agent. All of them resolve through role:Dialog (Windows UIA), AXSheet / AXAlert / kAXDialogSubrole (macOS AX), or ROLE_DIALOG (AT-SPI on Linux). Same selector idea, three different OS backends, one harness primitive that the vision-loop crowd has chosen not to ship.
Why pixel-loop agents keep skipping the bit
The argument for vision-only is real: pixels work on everything, including apps that don't expose accessibility data, including games, including the sandbox of a remote VM. The argument against is that on the 95% of everyday desktop apps that do publish a UIA / AX tree, the harness is leaving a free signal on the table. The OS already paid the cost of marking modals; the agent is paying it again at the model layer, in tokens and screenshots.
Anthropic's own docs admit it indirectly: the fix for the everyday modal failure is to nudge the model with text (“ask Claude to press Enter or click the primary button”). That works until the primary button is destructive, the modal is stacked, or the prompt is on a monitor the agent isn't currently looking at. None of those edges exist for a tree-walking agent because role:Dialog traverses every desktop, every monitor, every top-level window, and returns the modals by structural role, not by which pixels are currently on screen.
Terminator's bet is that the right shape is to expose the OS primitives directly through an MCP server and let the model spend its inference budget on planning. The model picks what to do; the framework resolves the selector; the action runs as a COM call inside the target process. The everyday modal becomes a four-call MCP loop instead of a three-screenshot Anthropic round-trip.
Ship an agent that doesn't lose to the everyday modal.
If you're building a computer-use agent that has to survive real desktop sessions, the modal handling is the part nobody benchmarks. Bring a target app, we'll walk through how Terminator's role:Dialog primitive plugs into your loop.
Frequently asked questions
What does "everyday computer-use modal failure mode" actually mean?
A computer-use agent is mid-task. Something modal pops up: a save prompt, a permissions sheet, a cookie banner, an unsaved-changes confirmation, a Windows UAC prompt, a browser autofill suggestion, a JavaScript alert(). Anything that interrupts the foreground task. A vision-driven agent looking at a screenshot sees overlapping pixels. It does not see a semantic interrupt. It may keep clicking against the dimmed UI underneath, may guess at the primary button, may report the task as done because the next screenshot looks plausible. The agent is technically following its plan; the OS has already changed the rules and the agent never heard the change.
Doesn't a smart enough vision model just learn to recognize modals?
Sometimes. Anthropic's own computer-use docs note that 'error rates are higher with dynamically changing interfaces, pop-up dialogs, and complex multi-step authentication processes' and recommend you ask the model to 'press Enter or click the primary button' as a workaround. That is a prompt-engineering bandage, not a primitive. It does not survive (a) modals whose primary button is destructive (Discard), (b) modals where the primary button is below the fold on a narrow viewport, (c) modals stacked on top of other modals, or (d) modals on second monitors the agent isn't currently screenshotting. The OS has a one-bit answer to all of these; the agent is choosing not to read it.
What is the IsDialog property, exactly?
UI Automation exposes a per-element boolean property called IsDialog. When true, the element is semantically a modal dialog; assistive technologies use it to change how they announce the element. Microsoft introduced it in Windows 10 build 10.0.17763.0 (October 2018, the 1809 release). XAML controls Flyout and ContentDialog default to IsDialog=true. Terminator's selector engine maps the string 'IsDialog' onto UIProperty::IsDialog at crates/terminator/src/platforms/windows/utils.rs line 189, which means you can write the bit into a selector and Terminator will resolve it through the same COM call the screen-reader uses.
How does Terminator handle a modal in practice?
Two patterns. Defensive: at the top of every action, run desktop.locator('role:Dialog').validate(1000) and dismiss whatever you find before continuing. Reactive: catch ElementNotFoundError on your real target, then check for a Dialog, dismiss it, retry. Both are three lines of code, both work because role:Dialog resolves via UIA's class-name list plus the IsDialog property. The dismiss is usually a wait_for_element on role:Button && (name:OK || name:Cancel || name:Save || name:Discard) inside the Dialog. The whole loop happens in single-digit milliseconds, with no model call and no screenshot.
Why don't pixel-loop agents just call the same API?
They can. They mostly don't. The Anthropic computer_20251022 tool ships with screenshot + click(x,y) + type and nothing else; the model never receives the accessibility tree as a tool result. Gemini Computer Use is the same shape. OpenAI's CUA is the same shape. The whole bet is that the model is the ontology, so the harness stays minimal. Terminator's bet is the opposite: the OS already publishes the ontology (IsDialog, ControlType, AutomationId, Name), so the harness should expose it and let the model spend tokens on planning, not on pixel-reading. The MCP server exposes 35 selector-based tools for this reason.
Does this only apply to Windows?
The IsDialog property is Windows UIA specifically. macOS exposes the same semantics through a different shape: AXSheet for app-modal sheets, AXWindow with AXSubrole=kAXDialogSubrole for free-floating dialogs, AXAlert for alert panels. AT-SPI on Linux uses the role ROLE_DIALOG plus the modal state. The selector idea is the same in every case: you ask the OS what is on screen by structural role, not by pixel pattern. Terminator's selector grammar normalizes role:Dialog across platforms; the Windows backend resolves it via IsDialog and class-name fallback, the macOS backend resolves via AXSubrole and AXRole.
Which everyday modals does this catch?
The boring ones, which is the point. Save changes before closing. Windows UAC consent. macOS permission prompts (Files and Folders, Screen Recording, Accessibility). Cookie banner overlays in Chromium. Browser autofill drop-downs. The Are-you-still-watching prompt on streaming sites. The unsaved-form confirmation Chrome shows on tab close. The Word document-recovery panel. Slack's join-call notification. Almost every productivity app has 3 to 6 of these. A real desktop session probably crosses 10 modals per hour for an active user. A computer-use agent driving the same session crosses the same 10 modals, and every one is a chance to fail silently.
What's the proof Terminator's selector actually resolves IsDialog and not just classnames?
Two files. crates/terminator/src/platforms/windows/utils.rs line 189 maps the string 'IsDialog' to UIProperty::IsDialog, which is the Windows UIA enum entry that asks for UIA_IsDialogPropertyId. crates/terminator/src/platforms/windows/tree_builder.rs line 323 branches on 'Window' or 'Dialog' for container elements when loading smart attributes. The role:Dialog selector resolves through these two paths together: the tree builder identifies dialog containers, the property lookup confirms IsDialog=true. The Microsoft docs page for AutomationProperties.IsDialog confirms the property was introduced in Windows 10 version 1809 (build 10.0.17763.0) and that Flyout and ContentDialog default to true.
What if the modal is a legacy Win32 message box, not a XAML ContentDialog?
Same outcome, different path. The legacy MessageBox API produces a window with class name #32770, which UIA already knows to label as a dialog before IsDialog existed as a property. Terminator's tree builder treats Window and Dialog the same way at tree_builder.rs line 323. The role:Dialog selector matches both. The IsDialog property is the modern path; the class-name list is the fallback. Both produce a true value for role:Dialog at the selector layer, so your dismissal logic doesn't need to know which path matched.
Where do I start if I want to wire this into my own agent?
Install Terminator's MCP server: claude mcp add terminator 'npx -y terminator-mcp-agent@latest'. Then teach your agent one defensive step: before every action, call validate_element with role:Dialog and a one-second timeout. If it resolves, get the dialog's children, find the Button you want, click_element on it, then retry the original action. The whole loop is four MCP calls and zero screenshots. The Terminator repo at github.com/mediar-ai/terminator ships examples that do this for the most common modal classes.
On the same shape
Adjacent reading
Claude computer use: the pixel-coordinate loop and the selector alternative
Anthropic's native computer-use tool sends a screenshot per click. Terminator's MCP lets Claude click by role and name resolved against the UIA tree.
Accessibility tree vs PyAutoGUI: two clicks, two operations, two failure modes
Pattern invoke() runs inside the target process. SendInput synthesizes HID events. The difference shows up the first time a modal appears.
Accessibility-tree desktop agents: closing the browser-to-native gap
Playwright reads the DOM. Terminator reads the OS UIA / AX tree, including modals the DOM never knew about.
Open-source computer-use agent SDK: where the tree fits
An SDK that lets your agent ask the OS for the structure on screen, not the pixels.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.