Tool surface, not just selectors

Browser agents leaving the DOM, and the tool surface that does not break when they do

Every browser agent has the same failure mode. The workflow starts in the page, hits a download, a save dialog, an “Open With” handoff, or a desktop authenticator, and the tool surface ends. The fix is not to keep the agent in the DOM. The fix is one selector grammar that spans the page and the OS at the same time.

Matthew Diakonov, Written with AI

Published May 1, 20269 min read

Direct answer (verified May 2026)

One tool surface with two adapters.

A Manifest V3 Chrome extension that runs JavaScript in the active tab, plus the OS accessibility tree (UIAutomation on Windows) for everything outside the tab. Both adapters share one selector grammar (role:Button name:Save) so the agent does not change tools when the workflow crosses out of the page. In Terminator that is the terminator-mcp-agent npm package, with the bridge listening on ws://127.0.0.1:17373.

The failure trace, in five steps

A real workflow looks like this. The user has asked the agent to export a report from a web app, save it to disk, open it in Acrobat, and fill in two fields. The agent gets through step one and dies in step two. Almost every browser agent fails on the same boundary, in the same order.

Step 1, in the DOM, the easy part

The agent calls click_element with selector role:Button name:Export. The MCP server detects the target PID is a browser process (chrome, msedge, edge, firefox, brave, or opera based on browser_script.rs:20-32), routes the click through the UIA tree of the browser window, and the page handler fires.

Step 2, the page handler triggers a download, the DOM ends

Chrome's download flow takes over. The page does not get a new event. document.activeElement is still the Export button. A DOM-only agent would now look at a snapshot identical to its previous one and either retry or give up. The Save dialog is a separate top-level window owned by chrome.exe.

Step 3, the agent reads the new UIA tree

The agent calls get_window_tree on window:Save As. UIA returns the dialog tree: role:Edit name:File name:, role:ComboBox name:Save as type:, role:Button name:Save, role:Button name:Cancel. The tool is the same one the agent used to read the page tree two steps ago. The selector format is the same. There is no separate library, no separate runtime.

Step 4, type and click, the dialog goes away

type_into_element with role:Edit name:File name: and the path string. click_element with role:Button name:Save. The dialog closes. The download lands at the path the agent wrote. The browser is back in focus.

Step 5, the next app, still the same tools

open_application name:Acrobat. UIA wakes the new process. The agent walks Acrobat's tree exactly the way it walked the page and the dialog. If the agent needs the page DOM again at any point (to extract a JSON value the page has but the UIA tree does not), execute_browser_script is one tool call away.

What changes when the agent has both adapters live

The toggle below shows the same five-step task from two perspectives. The DOM-only agent only sees what the page tells it. The dual-adapter agent sees the page plus the OS, and treats both as one tree of role and name pairs.

Click the Export button, get a download URL, lose the agent. The Save dialog blocks the page, document.querySelectorAll returns the same tree as before, no new event fires in the page context. The agent retries the Export click, gets a second dialog, and stalls. The CDP downloads listener finally reports the file path, but the next step (open it in Acrobat, fill the form, save it back) is a tree the browser cannot see.

Save dialog is invisible from the page context
CDP downloads listener returns a path, not a UI
Next app (Acrobat, Excel, native auth) is fully outside scope
Recovery is vision fallback or human handoff

Where the page tells the agent it is on its own

These are the boundaries every browser agent crosses, sorted roughly by how often they break a real workflow. None of them are bugs in the agent. They are the design of the browser sandbox.

Download save dialog

Triggered by every link with content-disposition: attachment, plus every download() call. The page never sees the dialog. CDP returns the eventual file path; the dialog UI is unreachable from the agent.

Open With... handler

The OS picks the application that handles a custom protocol or a downloaded mime type. The browser hands the URL to chrome.exe's IPC and is done. Whatever the user clicks in the system dialog never returns to the page.

Print dialog

window.print() opens a Chrome-rendered preview that exposes only a thin DOM the page does not own. The Print and Cancel buttons live in the browser chrome's UIA tree, not the page DOM.

Permission and credential prompts

Geolocation, clipboard, USB, and password autofill prompts render as native chrome elements. document.activeElement still points at whatever the page focused last; the prompt is a sibling tree.

OAuth handoff to a desktop app

When an OAuth flow finishes by deep-linking into a native authenticator (Microsoft Authenticator, Okta Verify, 1Password), control leaves the browser entirely. The page sees only the redirect URL.

Downloaded file -> editor

The most common case for office workflows. The browser produced an Excel file. Now the agent has to open it, type values, and re-upload. The DOM is irrelevant for the middle of that.

The router, in one diagram

The agent talks to one MCP server. The MCP server picks an adapter per tool call. Page-shaped calls go through the Manifest V3 extension over a localhost WebSocket. OS-shaped calls go through UIAutomation. Every native dialog, every native app, every browser chrome element is reachable through the same tool surface.

One MCP, two adapters

A real call sequence

The first three messages live inside the browser chrome but outside the DOM. The middle two are the file save dialog. The last three are the page DOM via the extension. From the agent’s prompt the only differences are the selector strings.

Agent -> MCP -> adapter

The coordinate trick that makes it work

A subtle thing makes “same tool surface” actually true at runtime. DOM coordinates are CSS pixels relative to the viewport. OS coordinates are physical pixels relative to the screen. On a 4K monitor with a 1.5x DPI scale, those numbers do not match. getBoundingClientRect() says one thing. UIA says another. A click that mixes the two will land in the wrong place.

Terminator reconciles them in capture_browser_dom_elements at crates/terminator-mcp-agent/src/server.rs:838. First, it looks up the UIA element with role:Document for the focused tab and reads its screen-bounds (x, y) as the viewport offset. Second, it scales every DOM rect by window.devicePixelRatio on the way out of the JS context (server.rs:911-914). The result is that the DOM and the UIA tree report coordinates in the same space. A click computed from a DOM rect lands where a UIA click on the same element would land, and the agent can mix the two tools in one workflow without coordinate drift.

The exact set of boundaries the bridge covers

The list below is the operational answer to “what counts as leaving the DOM.” Every item is a place where DOM-only tools return nothing useful and UIA-based tools return a structured tree.

Outside the DOM, still inside the tool surface

Chrome download flow with "Ask where to save" enabled
Per-file Save dialog after a content-disposition response
OS-level Open With handler dialog
window.print() preview and the Print / Cancel buttons
Geolocation, clipboard, USB, and password permission prompts
OAuth deep-link into a native authenticator app
Downloaded Excel / PDF opened in Office or Acrobat
Files dragged from the page into a desktop app
Page reads from a desktop clipboard manager

Versus a CDP-only setup

The closest thing to this in the browser-agent world is a Playwright or CDP-driven loop with vision fallback. Here is what the two approaches do and do not cover, on the same tasks.

Feature	DOM-only browser agent	Terminator
DOM access (querySelector, eval JS)	Yes, primary surface (Playwright eval, CDP Runtime.evaluate, browser-use)	Yes, via execute_browser_script through Manifest V3 extension on ws://127.0.0.1:17373
OS file save / open dialog	No. CDP downloads listener returns a path, the dialog UI is unreachable	Yes. The dialog is a UIA tree (role:Edit, role:Button) the same tool grammar walks
Native app launched from a download	No. Out of scope; the browser process does not see it	Yes. open_application + UIA tree of the new process
OAuth that handoffs to a desktop authenticator	Partial. Page sees the redirect URL; the click in the desktop auth app needs a separate framework	Yes. Same MCP tools click the auth app's UIA tree, then return to the browser
Coordinate system reconciliation	DOM-only. CSS pixels relative to viewport; physical-pixel calls are out of scope	DOM coords scaled by window.devicePixelRatio plus UIA Document offset (server.rs:911-914)
Recovery when DOM goes silent	Vision fallback (LLM looks at a screenshot and guesses pixels)	Tree fallback. UIA tree is structured, role-typed, and always present for the focused window
Same selector grammar across surfaces	No. Page selectors and OS selectors are separate worlds	Yes. role:Button name:Save works in the page chrome, in the file dialog, and in Acrobat

Setup, in one line

For Claude Code, the MCP install is a single command. For Cursor, VS Code, and Windsurf, add the same command to the mcpServers block of the MCP config. The Chrome extension is bundled inside the npm package; the agent prints the load-unpacked path the first time you call execute_browser_script.

# Claude Code
claude mcp add terminator "npx -y terminator-mcp-agent@latest"

# Cursor / VS Code / Windsurf MCP config
{
  "mcpServers": {
    "terminator-mcp-agent": {
      "command": "npx",
      "args": ["-y", "terminator-mcp-agent@latest"]
    }
  }
}

Currently Windows-only on the binary side. The macOS Rust adapter was deleted on 2025-12-16 (commit 0c11011c) to focus on the Windows UIAutomation path where the team has the most depth. If you are building this on macOS today, Hammerspoon’s axuielement Lua module is the closest equivalent.

The honest version of the tradeoff

Two adapters mean two failure modes. The Manifest V3 extension can fall asleep (the service worker has a hard 30 second idle timeout in MV3), and the bridge has a 30-second reconnect loop in 500ms ticks before it gives up. UIA on Windows can return a stale tree if a window has just appeared and not finished its first paint, and the workaround there is the wait_for_element tool with a small timeout. Neither failure mode is silent: both surface as MCP tool errors with the cause in the message.

The other tradeoff is platform reach. Browser-only agents run on every desktop OS. UIA-based tools run on Windows. If your agent has to ship cross-OS today, you give up the post-DOM half of this story on macOS and Linux until the platform adapters return. That is a real cost, and it is the reason the broader page on macOS accessibility automation says what it says.

Building an agent that needs to leave the DOM?

Walk through your workflow with us. We will tell you which steps the bridge handles today and which ones still need work.

Frequently asked questions

What does "leaving the DOM" actually mean for a browser agent?

Any moment where the next thing the agent has to interact with is not a DOM node. The most common ones are the OS file save dialog after a download, the "Open With..." handler that picks a native app, the system print dialog, the credentials autofill dialog, the OS notification that says "app X wants to access Y", a 2FA code that lands in a desktop authenticator app, or the moment a downloaded file has to be opened in Excel or a PDF viewer to finish the task. The DOM ends, document.querySelector returns nothing useful, and the Playwright or CDP-style toolset goes silent. The agent did not lose the task. It lost the tool surface that the task requires for the next click.

Why can't a browser agent just call showSaveFilePicker and stay in the DOM?

Two reasons. First, showSaveFilePicker is gated behind a user-activation requirement: it only fires inside a synchronous handler for a real click or keypress, and it returns a different (security-restricted) UI in headless or scripted browsers. Second, every download triggered through a normal anchor tag, content-disposition response, or third-party site does not go through the File System Access API at all. It triggers Chrome's built-in download flow, which in turn surfaces the OS save dialog or the per-file prompt depending on the user's setting. Neither of those is reachable through the page's JavaScript execution context. The DOM does not own them.

How does Terminator keep the same tool surface working on both sides of the boundary?

It runs two adapters under the same MCP tool grammar. Inside the page, a Manifest V3 Chrome extension named "Terminator Bridge" (manifest at crates/terminator/browser-extension/manifest.json, version 0.24.32) holds a WebSocket connection to the MCP server on ws://127.0.0.1:17373 (extension_bridge.rs:32) and runs scripts in the active tab through the chrome.scripting and debugger APIs. Outside the page, the OS accessibility tree (UIAutomation on Windows) exposes every native element of every running app. The MCP tools (click_element, type_into_element, press_key, execute_browser_script, set_value, capture_screenshot) take a single selector format and dispatch to whichever adapter the target lives in. The agent does not call a different tool when it crosses from a Submit button in the page to the Save button in the file dialog.

What's special about the coordinate system when both adapters are live at once?

Browser agents and OS agents normally live in different coordinate spaces. The DOM gives you CSS pixels relative to the viewport. The OS gives you physical pixels relative to the screen. Terminator's capture_browser_dom_elements at crates/terminator-mcp-agent/src/server.rs:838 reconciles them in two steps. It looks up the UIA element with role:Document for the focused tab, reads its screen bounds, and uses the (x, y) of that element as the viewport offset. Then for every DOM element it serializes, it multiplies getBoundingClientRect's CSS pixel coords by window.devicePixelRatio (server.rs:911-914) so they are in physical pixels. The result is that a DOM element's reported coordinates and a UIA element's reported coordinates are in the same space. A click that is computed from the DOM lands in the same place a click computed from the UIA tree would land. Without that step, you cannot mix the two adapters in one workflow without drift on a 4K monitor or on a Retina display.

Which browsers does the bridge detect, and what happens for the rest?

Five browser process-name patterns are detected at crates/terminator/src/browser_script.rs:20-32: "chrome", "msedge" or "edge", "firefox", "brave", and "opera". The detection picks the right extension target so the WebSocket message is dispatched to the correct active tab. For any other browser the code falls through to "chrome" as the default. If the extension is not installed in the running browser, execute_browser_script will block on the connection handshake (extension_bridge.rs has a 30 second wait loop in 500ms ticks before giving up) and then return an error to the agent. UIA-based clicks (click_element, type_into_element) keep working in that browser regardless because they do not touch the extension at all.

How is this different from a Playwright agent that uses a CDP downloads listener?

A CDP downloads listener gives you the file path of a finished download, not control over the dialog. If the user has "Ask where to save each file" enabled, the dialog still pops up and the listener does not see it. If the next step in the workflow is to open that downloaded PDF in Acrobat, click "Edit PDF", and type a value into a text box, you are now fully outside the browser and CDP gives you nothing. A UIA-based tool grammar continues to expose every step. You write the same selector format (role:Button, name:Edit PDF, window:Acrobat) you would write to click a button in the browser chrome.

What about file dialogs the agent has to fill in, not just click through?

The Windows file save dialog is a UIA tree of its own. It exposes a role:Edit element named "File name:" and a role:Button named "Save". In Terminator's tool surface that is type_into_element with selector "role:Edit window:Save As" followed by click_element with selector "role:Button name:Save". The agent does not need to know that the previous step was a click on a DOM anchor and the current step is a click on a Win32 button. The MCP tools are the same; the dispatch happens internally. This is the same point as the "one tool surface" answer above, but it matters specifically for save dialogs because most browser-agent failures end on exactly this dialog.

Is this approach safe? The Manifest V3 extension asks for the debugger permission.

Yes, by design. The manifest at crates/terminator/browser-extension/manifest.json declares debugger, tabs, scripting, activeTab, webNavigation, alarms, and storage. The debugger permission is only needed for cases where the chrome.scripting API alone cannot reach a frame (cross-origin iframes that do not match the content-script origin pattern, mostly). The extension only accepts WebSocket messages from 127.0.0.1, so a remote site cannot connect to it. The local-only socket plus a single-origin handshake is the same trust model Playwright and Puppeteer use for their CDP endpoints. If you are running this on a shared workstation, you should still treat the listening port as a trust boundary and not run untrusted code on the same machine, but the same is true of any local automation framework.

Why is this a Windows-first story, and what does that mean for macOS?

Terminator currently ships Windows-only binaries for its Node.js, Python, and MCP server packages. The Windows UIAutomation tree is the platform of record for desktop automation: Microsoft maintains it, third-party apps target it for accessibility compliance, and the surface is reasonably stable across Windows 10 and 11. macOS support existed at the Rust core for several months and was deleted on 2025-12-16 (commit 0c11011c) to focus on the path with the most depth. macOS users today get the browser-side tools through the extension, but the "clicking native dialogs" half of this story does not work on macOS in Terminator yet. If you are building this on macOS, the right starting points are Hammerspoon's axuielement Lua module or MacPaw's macapptree.

What does it cost to add this to an existing Claude Code or Cursor agent?

One MCP install line. For Claude Code: claude mcp add terminator "npx -y terminator-mcp-agent@latest". For Cursor or VS Code, add the same command to the mcpServers block of the MCP config. The Chrome extension is bundled inside the npm package and the agent will print a load-unpacked path on first use. Once both are loaded, the agent gets every tool described above (click_element, type_into_element, execute_browser_script, navigate_browser, run_command, plus another roughly twenty action tools that accept ui_diff_before_after for delta-based tree updates). No code change in the agent's prompt is required; the new tools just appear.