Alternative / Opus 4.7 computer use
Claude Opus 4.7 computer use, without the screenshots
Anthropic's beta computer tool ships with one observation channel: a base64 PNG. Opus 4.7 looks at the screenshot, picks pixel coordinates, your code clicks, you screenshot again. The 2576px image input ceiling and 1:1 coordinate mapping make that loop smoother than it was on 4.6, but the loop is still pixels in, pixels out, pixels in, pixels out. There is a different contract that does not include the screenshot at all. It is what this page is about.
Direct answer (verified 2026-05-08)
Wire Opus 4.7 to Terminator's MCP server with claude mcp add terminator "npx -y terminator-mcp-agent@latest", then set ui_diff_before_after: true and tree_output_format: "compact_yaml" on every action tool. The server resolves selectors locally against the OS accessibility tree and returns the changed elements as YAML inside the same response Opus reads. No screenshot crosses the wire in either direction. The diff fields are defined in crates/terminator-mcp-agent/src/utils.rs at the DiffTreeOptions struct.
Native computer tool reference: platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool. Opus 4.7 release notes: anthropic.com/news/claude-opus-4-7.
Two contracts, one Opus 4.7
The shape of the bytes the model sees is what divides these two paths. Both let Opus 4.7 click a real button in a real app. They disagree on what comes back after the click. Toggle to see the literal JSON your code has to return as a tool result in each case.
Tool result for one click, two contracts
// What you must return after every Anthropic computer_20251124 action. // The tool result content includes a base64-encoded PNG screenshot. // Opus 4.7 needs this image to know what changed; there is no other channel. // Source: platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool // "Encode screenshots as base64 PNG or JPEG" // "Screenshot images (see Vision pricing)" { "role": "user", "content": [{ "type": "tool_result", "tool_use_id": "toolu_01A09q90qw90lq917835lq9", "content": [{ "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": "iVBORw0KGgoAAAANSUhEUgAA... // ~3.75MP at 2576px long edge ...several hundred KB of base64 here... ...truncated for sanity..." } }] }] } // Cost shape: image input billed via Anthropic vision pricing. // Latency shape: PNG encode + upload + model re-decode every action. // Failure shape: a tooltip moved one pixel; the next click misses; // you replay the loop from scratch.
- every action requires a screenshot to know what changed
- image input billed via vision pricing on every step
- model has to re-ground from raw pixels, every turn
The round-trip, drawn
Three messages versus one. The native computer tool needs a screenshot before the click and a fresh screenshot after, because the tool result for left_click is just a confirmation; there is no state in it. The MCP path bundles act-and-observe into one tool call.
Native computer_20251124: screenshot, act, screenshot
Terminator MCP click_element with ui_diff_before_after: true
What the diff actually carries
The compact YAML format prefixes every element that has bounds with a sequential index, then writes the role, accessible name, and a small set of state flags in parentheses. The same indices are cached server-side for the next tool call, which is what enables the second mode of click_element: the model can refer to elements by index and never write coordinates. The format definition lives in crates/terminator-mcp-agent/src/tree_formatter.rs at format_tree_as_compact_yaml (line 56) and format_node (line 82).
# Excerpt of a real diff after pressing Ctrl+S in Notepad. # This is what arrives in the tool_result content. # Indices are stable across the response; # the model uses them on the next click_element call. added: - "#1 [Window] Save As bounds: [800,400,600,420] focused" - " #2 [Edit] File name: value: 'untitled.txt' focusable" - " #3 [ComboBox] Save as type: value: 'Text Documents' focusable" - " #4 [Button] Save bounds: [1200,780,80,32] focusable" - " #5 [Button] Cancel bounds: [1290,780,80,32] focusable" removed: - "#7 [MenuItem] Save As... focused"
Three things are happening in those few hundred bytes that a screenshot does not get for free. Element identity is anchored to role and accessible name, not pixel position. State (focused, focusable, value) is structured, not implied. And every entry has a stable index, so the model's next message can be click_element({ index: 4 }) and the server resolves it without further model thinking.
Numbers that matter
Four numbers worth holding in your head when you choose between these contracts on Opus 4.7. The first two are Anthropic's, sourced from the computer use docs. The second two are Terminator's, sourced from crates/terminator-mcp-agent/src/server.rs.
735 input tokens per native tool definition is the Anthropic-published per-session overhead for adding the computer tool. That is small. The screenshot input on every action is what adds up; vision pricing on a 3.75MP base64 PNG, ten or twenty times per task, is where most of the bill goes. The MCP path replaces every one of those screenshot uploads with a few hundred bytes of YAML.
Wire it up
Four steps. The first three are once-per-environment; the fourth is the contract you give to Opus 4.7 in the system prompt or tool wrapper.
Opus 4.7 with no screenshots, end to end
- 1
Install the MCP server
claude mcp add terminator "npx -y terminator-mcp-agent@latest"
- 2
Get the tree once
Call get_window_tree at task start. The model receives indexed YAML.
- 3
Act with diff on
Every action takes ui_diff_before_after: true. No more screenshots.
- 4
Read the diff, repeat
Opus 4.7 sees added/removed elements and picks the next tool.
On Claude Desktop, Cursor, VS Code, or Windsurf, the registration command is the same: claude mcp add terminator "npx -y terminator-mcp-agent@latest". The first call to get_window_tree for a process returns the indexed YAML and seeds the server-side index cache. Every subsequent action with ui_diff_before_after: true updates the cache and gives Opus 4.7 the next set of indices. The screenshot tool stays available as capture_screenshot, but the model has to choose to call it. By default it does not.
What you give up by removing screenshots
The honest answer is: a few real things, on a small set of surfaces. The structural path is the right default for line-of-business apps, browsers, IDEs, and most native productivity tools. It is the wrong default for games and visual interpretation. The point is to make the screenshot opt-in instead of automatic.
Tradeoffs to know up front
- If your target surface is fully custom-rendered, like a game or a Direct3D editor, the accessibility tree may not have it. You still want a screenshot fallback there.
- Vision-based grounding (zoom, OCR, Omniparser, Gemini) is genuinely useful when an Electron app exposes a div soup with no role or name. Terminator's vision_type parameter routes there explicitly when needed; it is not the default.
- If you want the model to interpret a chart, an image embed, or anything visual that is not a control, you have to send pixels. Tree diffs do not describe images.
- What you do gain: every action returns a structured diff Opus 4.7 can reason about; image input becomes opt-in for the cases that genuinely need it; the same workflow runs the same way on a CI box and a developer laptop.
Why this fits Opus 4.7 specifically
Two of the changes Anthropic shipped with 4.7 push agents toward the tree-diff shape rather than the screenshot shape. Lower default tool-call frequency means Opus 4.7, left to its own taste, prefers to think more and act fewer times per turn. A screenshot loop that demands one model turn per click runs against that grain. A tool that returns a structured diff after each action lets the model think once, dispatch one tool, and read a meaningful response, which is the cadence Opus 4.7 already wants. The xhigh effort level Anthropic recommends for agentic work compounds this: deeper reasoning per step pays off when each step changes more state, which is exactly what an action plus a tree diff offers and what a single click does not.
Building an Opus 4.7 desktop agent and tired of the screenshot bill?
30 minutes. Walk through your agent loop, see if the tree-diff path fits, leave with a concrete next step.
Questions about driving Opus 4.7 without screenshots
What is the alternative to Anthropic's computer use tool for Claude Opus 4.7?
Terminator's MCP server, registered with one command: claude mcp add terminator "npx -y terminator-mcp-agent@latest". Once registered, Opus 4.7 sees 35 typed tools (click_element, type_into_element, press_key, scroll_element, set_value, validate_element, wait_for_element, navigate_browser, capture_screenshot, execute_sequence, and others) instead of one schema-less computer tool. Each tool resolves a selector against the OS accessibility tree (Windows UIA today, macOS AX in the core but not yet on the published binary) and returns a structured JSON result. The crucial flag is ui_diff_before_after: true on every action tool, which makes the response include the changed elements as compact YAML lines. Opus 4.7 reads that and picks the next tool without ever sending a screenshot.
How does Opus 4.7 know what happened on the screen if it never sees a screenshot?
Through the diff response. Terminator captures the UI tree before the action runs, captures it again after, computes the diff, and returns it inline as compact YAML. The format is defined in crates/terminator-mcp-agent/src/tree_formatter.rs at format_tree_as_compact_yaml around line 56. A typical diff looks like 'added: #1 [Window] Save As (focused, bounds: [800,400,600,420]) #2 [Button] Save (focusable)' and 'removed: #7 [MenuItem] Save As... (focused)'. The model has the role, accessible name, focused state, bounds, and a stable index for every element that changed. That is enough information to decide the next click. The screenshot loop exists because the native computer tool has no other channel to report state; the MCP server has the tree, so it does not need pixels.
Is this faster than the screenshot loop on Opus 4.7?
Yes, by two mechanisms. First, image input drops out. Anthropic charges screenshots through standard vision pricing (per their computer use docs), and a 2576px long-edge image is roughly 3.75MP, which is a substantial input on every step. The MCP path returns a few hundred bytes of YAML instead. Second, the inner loop avoids round-trips. Each native computer use action requires the model to emit a click, your code to execute it, your code to take a fresh screenshot, and the screenshot to ride back to Anthropic. The MCP path runs the click-and-diff inside one tool result; UIA's IUIAutomationElement.FindFirst and InvokePattern.Invoke calls finish in microseconds. Opus 4.7's documented preference for fewer tool calls per turn at the default effort level lines up with this shape: the model thinks once, dispatches one tool, reads the diff, thinks again.
Where does the index in the YAML diff come from, and can the model click by index?
The compact YAML formatter assigns sequential indices (#1, #2, #3, ...) to every element with bounds, and stores a server-side cache of (index → role, name, bounds, selector) for the most recent tree. That is exactly what click_element's Index mode consumes. The model can call click_element with { "index": 5 } and Terminator looks up the cached element and clicks it. No coordinates emitted, no selector authored, no screenshot rendered. The Index mode is documented in server.rs at the click_element handler around line 2486 (Mode 2 - Index) and is what makes the tree-diff loop self-contained: the model gets indices on every diff, refers to them on the next call.
When does the screenshot path still beat the tree-diff path?
Three honest cases. One: fully custom-painted surfaces. Games rendered through Direct3D, Office canvases, terminal emulators with custom GPU compositors, and many CAD tools expose one opaque accessibility node for an entire surface. The tree has nothing useful; you need vision. Two: visual interpretation tasks. If the user asks 'is this chart trending up?' or 'is the photo upside down?' the model has to look at pixels. The tree carries no semantics about images. Three: web pages where the structural DOM is hostile (Cloudflare interstitials, single-page apps that route tooltips into a portal far from the visible parent). Terminator handles these by exposing capture_screenshot as one of the 35 MCP tools and by adding a vision_type parameter on click_element with values UiTree, Ocr, Omniparser, Gemini, and Dom. The right pattern is structural by default, vision by exception, with the agent making the choice per call.
Does Anthropic's beta computer use tool work alongside Terminator's MCP server?
Yes. They are not mutually exclusive: one is a built-in Anthropic tool, the other is an MCP server. You can register both. The pragmatic split is to give Opus 4.7 access to Terminator's selector-and-index tools as the primary surface, and keep computer_20251124 around for the cases the accessibility tree does not cover. Opus 4.7 picks per task. The cost shape changes accordingly: structural tools cost a few hundred input tokens per call; the computer tool's first invocation per session adds roughly 735 input tokens for the schema, plus vision pricing on every screenshot. If 90 percent of the work hits the tree, 90 percent of the cost moves off image input.
What model versions of the computer tool does Opus 4.7 support, and which do I want?
Opus 4.7 supports computer_20251124 (with beta header computer-use-2025-11-24). The newer schema introduces enable_zoom and a zoom action that takes a region [x1, y1, x2, y2] and returns that region at full resolution. Zoom is useful when you genuinely need pixels and want to keep input cost down. The older computer_20250124 still works with prior models but does not give Opus 4.7 anything it does not already get from the newer schema. If you choose the screenshot path, use computer_20251124. If you choose the MCP path, the version of computer_20251124 is irrelevant because you do not enable the tool.
What about platforms? Does this work on macOS or only Windows?
The MCP path is Windows-only on the published binary today. Terminator's core Rust crate has the cross-platform AccessibilityEngine trait at crates/terminator/src/platforms/mod.rs line 86 with macOS AX scaffolding, but the main branch ends with a compile_error! at lines 319-320 stating 'Terminator only supports Windows'. The npm package terminator-mcp-agent ships a Windows binary. If you need Opus 4.7 driving macOS desktops without screenshots today, this is not the right tool yet. If you need macOS with screenshots, you can still use the native Anthropic computer tool and pair it with a thin click-by-coordinate executor like cliclick. Linux uses AT-SPI2 in the Rust core but is similarly not packaged as an MCP binary.
Why would the model trust an indexed list of elements more than a screenshot?
It is not about trust, it is about what the model is good at. Pixels are a noisy substrate for clicks: tooltips reflow, scroll bars introduce subpixel offsets, DPI scaling and theme changes shift everything by a few pixels. The model has to ground every step from raw pixels each time. The accessibility tree is what the OS itself uses to route assistive technology; it has stable role names (Button, Edit, Window), localized accessible names that survive theme changes, and a fixed parent-child structure. A model handed '#3 [Button] Save (focusable)' has unambiguous targeting. A model handed a 2576x1620 PNG has to find the Save button through computer vision every single step. The first failure mode is brittle clicks. The second failure mode is none.
Specific deep dives into the same stack
Adjacent reads
Claude Opus 4.7 desktop automation: fewer tool calls, bigger workflows
What changed in Opus 4.7 for desktop work: 2576px input, 1:1 coordinates, fewer default tool calls. Why execute_sequence is the right shape for it.
Claude computer use: the pixel-coordinate loop, and the selector-based path
Anthropic's native computer tool, broken open. Side-by-side with Terminator's selector tools. server.rs source, no marketing.
Accessibility API for computer use agents: the seven-mode click_element router
ClickMode (Selector, Index, Coordinates) plus VisionType (UiTree, Ocr, Omniparser, Gemini, Dom). One tool, seven grounding paths.