Matthew Diakonov

Published April 23, 202614 min read

Robotic desktop automation, without a Studio: the robot as an importable library your AI coding assistant calls

Every long-standing guide on this topic frames the robot the same way. A bot process you build by dragging activities onto a canvas inside a vendor Studio, published to an orchestrator, run on an attended or unattended endpoint. That shape made sense when the builder was a business analyst. It does not make sense when the builder is an AI coding assistant. This page is about the other shape: the robot is a library, the recorder writes YAML, and your IDE is the editor.

14 event variantsYAML, not XAMLMIT licensed

4.9from Developer framework. 14 recorder event types. MIT licensed.

TextInputCompleted bundles a typing burst into one replay step

Selectors are role:Button && name:Save, not pixels

MCP server plugs into Claude Code, Cursor, Codex, Windsurf

The robot is a library

Not a Studio, not a canvas, not an orchestrator

Install: cargo add, npm install, pip install.

Record: 14 event types captured.

Synthesize: raw keystrokes become one typeText.

Export: a YAML your AI assistant edits.

Replay: deterministic, CPU speed, no vision loop.

0:00 / 0:05

The shape every other guide assumes

Read any of the canonical writeups on this topic and you will see the same diagram. A user sits at a desktop, a bot process runs on the same machine, the bot was built in a Studio application, the Studio exported a package, an orchestrator schedules the package. That is a reasonable shape, but it assumes the builder is a human with a mouse.

What the existing playbooks miss

Legacy RDA guides discuss attended vs unattended as the primary axis
They assume the robot is a bot process you build in a vendor Studio
None document what a recorder should actually capture (14 event types)
None describe a synthesized typing event that replaces per-keystroke replay
None frame the recording as a file your AI pair programmer edits directly

The industry that popularized this shape:

UiPathAutomation AnywhereBlue PrismPower Automate DesktopNintex RPAWorkFusionKryonRedwoodLaiyePegasystems

Good products. Wrong shape for a world where your pair programmer is an LLM. An LLM cannot open UiPath Studio, cannot drag activities onto a canvas, cannot read an XAML package back into its context window, cannot make a pull request against a proprietary bot format. It can, however, read a YAML file, edit it, diff it, and commit it.

The anchor: 14 event variants in a Rust enum

Open crates/terminator-workflow-recorder/src/events.rs and scroll to line 475. There is a Rust enum called WorkflowEvent with 14 variants. That is the vocabulary the recorder writes into YAML, and the vocabulary a replay consumer has to understand. Nothing else exists at this layer. No visual flow, no activity library, no orchestrator package format.

Seven of the 14 are raw OS events (Mouse, Keyboard, Clipboard, TextSelection, DragDrop, Hotkey, PendingAction). Seven are high-level events synthesized by the recorder from patterns in the raw stream. The high-level ones are what a useful replay actually walks.

The fourteen variants, one by one

Each variant exists because replaying without it would be lossy or fragile. These are not Lego blocks in a visual IDE; they are the full surface area of a recording. If a kind of user action is not expressible as one of these 14 variants, the recorder does not capture it, and the replay cannot reconstruct it.

Mouse

Down, up, click, double-click, right-click, move, wheel, drag-start, drag-end, drop. Always paired with screen coordinates and the UIElement under the cursor at capture time.

Keyboard

Raw key code, key-up/key-down flag, scan code, and modifier flags (ctrl, alt, shift, win). Useful for hotkeys and shortcut detection, rarely useful for replay on its own.

Clipboard

Copy, cut, paste, clear. Includes content, format, size in bytes, and a truncated flag when the clipboard payload was too large to inline. The AI assistant sees what moved across apps.

TextSelection

Selected text, start and end positions in screen coordinates, and a SelectionMethod (MouseDrag, DoubleClick for word, TripleClick for line, KeyboardShortcut, ContextMenu). This is the signal that a user read something, not just clicked it.

DragDrop

Start and end positions, source UIElement, data type, dragged content. Lets a replay reconstruct a drag from one pane of a native app to another without a mouse path hack.

Hotkey

Combination string (Ctrl+C, Alt+Tab), action, global or app-specific flag, process name. Hotkeys need their own variant because replaying them as raw key events misses the application-level handler the OS actually runs.

TextInputCompleted

The whole reason this recorder exists. Text_value, field_name, field_type, typing_duration_ms, keystroke_count, input_method (Typed, Pasted, AutoFilled). 16 raw keystrokes collapse into one typeText replay step.

ApplicationSwitch

Method (AltTab, TaskbarClick, WindowsKeyShortcut, StartMenu, WindowClick, Other), from-process, to-process. Context for the replay to know whether to call activate on a window or to spawn a new process.

BrowserTabNavigation

URL, tab identifier, title. When the focused window is a Chromium browser, the recorder pulls this through the companion extension so navigation steps replay by URL rather than by clicking a tab whose position shifted.

Click

A high-level Click event built on top of Mouse. Carries the resolved UIElement selector, the button, and a CompletionHint so the replay knows whether to invoke() or click() the element.

BrowserClick

A click that landed inside a Chromium browser, with the DOM selector, the viewport-aligned bounds, and the iframe path alongside the usual UIA metadata. Replay can use the DOM selector when UIA returns an empty Pane.

BrowserTextInput

A typed input inside a browser. Mirrors TextInputCompleted but carries the DOM selector and the form field identifier. Replay calls executeBrowserScript for reliability on React, Vue, and Svelte inputs that ignore raw key events.

FileOpened

Detected via window-title change plus a filesystem search. Absolute path, application, detection method. The replay knows whether to call openFile or to recreate the file from content before re-opening.

PendingAction

Emitted immediately when a user begins an action, before the follow-up UI capture finishes. Used by the replay engine to short-circuit loops that would otherwise race with slow UIA tree snapshots on heavy applications.

0Event variants in WorkflowEvent enum

0 msRetry delay on text extraction failure

0xFaster than LLM-loop computer use agents

0%Observed replay success rate

The variant that makes the recorder worth using: TextInputCompleted

The six high-level events are all useful. One of them is the reason a recorder for an AI coding assistant has to exist in the first place. When a user types hello@world.com into an email field, the raw event stream looks like a wall. 16 keyboard-down events, 16 keyboard-up events, a handful of focus updates, three or four accessibility tree refreshes. Replaying that exact sequence breaks the moment the target field has autocomplete, autocorrect, or an input mask that consumes keystrokes in its own way.

TextInputCompleted replaces all of that with one event that carries the text_value verbatim, plus the field metadata and the timing. The replay calls typeText once. The downstream application sees a clean final value, whatever its input pipeline does.

16 to 1

“Raw keystroke replay is the single most common reason RPA flows break on app upgrades. Record the typing burst as a semantic event, not as keystrokes.”

Implementation note, TextInputCompleted design

How the synthesis actually works

The function that builds a TextInputCompletedEvent is get_completion_event in crates/terminator-workflow-recorder/src/recorder/windows/structs.rs. It reads the focused field's text once, sleeps 50 ms on failure, reads again, and falls back to a tagged error string rather than silently dropping the typing burst.

crates/terminator-workflow-recorder/src/recorder/windows/structs.rs

The 50 ms retry is not arbitrary. UIA's tree snapshot is eventually consistent; a focus change on a busy window sometimes lands before the property cache refreshes. A single retry at 50 ms catches almost every transient failure in practice without adding observable latency to the recording.

From hooks to YAML, step by step

The full pipeline, from the moment the user starts a recording to the moment an AI assistant reads the resulting YAML, runs six stages. Each stage has exactly one job and hands a fully-formed event or file to the next stage.

Recorder pipeline

Install the OS-level hooks

On Windows, SetWindowsHookEx with WH_MOUSE_LL and WH_KEYBOARD_LL for raw input, plus UIA Automation events for focus and structure changes. Browser context, if any, attaches through a companion Chrome extension.

Buffer raw events into a per-field typing session

Every time the focused element changes, the previous field's keystroke buffer closes. keystroke_count and has_typing_activity are tracked on a TypingSession struct. Whitespace-only sessions are dropped.

Read the final text, with one retry

Call element.text(0) on the just-defocused field. If the UIA read fails, sleep 50 ms and retry once. On a second failure, emit the event with '[Text extraction failed - N keystrokes]' so the recording never silently loses data.

Synthesize the high-level event

Build a TextInputCompletedEvent with text_value, field_name, field_type, input_method, typing_duration_ms, keystroke_count, process_name, and the resolved EventMetadata. Push it to the event channel.

Emit YAML on stop

The recorder's on_stop handler walks the event stream, keeps the high-level variants (TextInputCompleted, Click, Hotkey, ApplicationSwitch, BrowserTabNavigation, Clipboard, FileOpened) and collapses the rest. Writes workflow.yml.

Hand the YAML to the AI assistant

Claude, Cursor, or Codex reads workflow.yml through the filesystem or through the MCP get_workflow tool. It can edit steps, insert branches, replace hard-coded text with variables, and rerun any slice with --start-from and --end-at.

What the YAML looks like

Here is a real shape of a recording, lightly abridged. Two text inputs into an Excel sheet, a clipboard copy, a browser tab navigation to a Google Sheet. Notice what the recording does not contain: no pixel coordinates, no raw key codes, no position of the Excel window. The selectors are structured and portable.

workflow.yml

What the replay looks like

The replay is whatever code your AI coding assistant writes around the library. No canvas, no bot runtime. In TypeScript, it is ~40 lines that switch on the event kind and call the right library method. The TextInputCompleted branch is the point of all of this: one typeText call replaces the 16 raw keyboard events the user made.

replay.ts

record and replay

Studio vs library, side by side

The practical difference between the two shapes is not a feature list; it is whether the artifact your automation produces is something an AI assistant can read and modify on its own. Flip between the two views to see how the same work changes shape.

The same automation, two shapes

Open UiPath Studio, Power Automate Desktop, or Automation Anywhere Bot Creator. Drag activities onto a canvas. Connect them. Save a proprietary package. Publish to the orchestrator. Hope the orchestrator version matches the Studio version.

Editor is proprietary, not your IDE
File format is a binary or XML package
AI coding assistants cannot read or edit the flow
Runtime lives behind a licensing server
Version control is bolted on, not native

Where each approach actually differs

Not a feature shootout. The rows that matter are the ones that change who the editor is and what the file format looks like on disk.

Feature	Legacy RDA Studios	Terminator
Editor	Vendor Studio (UiPath, PAD, AA Bot Creator)	Your IDE with your AI coding assistant
Workflow file format	Proprietary XAML, XML, or binary package	Plain YAML, diffable in git
Event taxonomy in recordings	Mouse and keyboard only, maybe clipboard	14 variants, including TextInputCompleted synthesis
Typing burst replay	Raw per-keystroke, breaks on autocomplete	One typeText call from the synthesized event
Selector style	Visual picker, often pixel or brittle id	role:Button && name:Save, derived from UIA
AI pair programmer integration	None. Assistants cannot read the canvas	Assistant reads/writes YAML, calls MCP tools
License	Seat-based, proprietary, orchestrator tax	MIT, self-hosted, no orchestrator required
Attended and unattended	Split products with different price tiers	Same library, deployment choice only

Why one typeText beats 16 key events in the wild

The synthesis matters most on applications that add logic between keystrokes: address fields with autocomplete, IDE-style editors with suggestion popups, forms with input masks that insert separators as you type. A raw-keystroke replay fights those features and usually loses. A typeText call writes the final value and lets the field's input pipeline do whatever it would do if the user had pasted from the clipboard.

The other synthesis wins come from the remaining high-level events. ApplicationSwitch captures why a window change happened (AltTab vs TaskbarClick vs WindowClick), so the replay picks the right method to bring a window to front. Hotkey records the combination as a string, so the replay triggers the application-level handler instead of hoping raw key events line up. FileOpened carries the absolute path, so the replay can recreate the file before re-opening it if the path moved.

“The recorder is the piece that changed how we ship internal automations. We stopped treating RPA as a separate discipline. It is just code the AI wrote from a YAML the recorder produced.”

Internal team

On using the workflow recorder in production

Getting started in three commands

There is no installer, no license key, no orchestrator signup. The recorder is a Cargo dependency; the replay is a CLI; the AI-assistant integration is an MCP server. Each is a single command.

terminal

The MCP agent ships 35 tools exposed to your AI assistant through a standard stdio or HTTP transport. Add claude mcp add terminator "npx -y terminator-mcp-agent@latest" and Claude Code can call get_window_tree, invoke selectors, record a flow, and edit the resulting YAML, all without leaving your terminal.

A closing note on counting

The count of 0 variants is not marketing. It is whatever is defined in the enum the day you run grep -c "^ *\S.*(" crates/terminator-workflow-recorder/src/events.rs. If the project adds a fifteenth variant next month, you will see it in the YAML the moment you upgrade the crate. The only promise this page makes is that the vocabulary is a Rust enum, checked in, readable, editable, and strictly the source of truth for what a recorder can capture and what a replay can do.

Record the flow once, let the assistant do the rest

Book a 20 minute walkthrough and record a real workflow from your machine into YAML on the call.

Frequently asked questions

What does robotic desktop automation actually mean in a developer context?

The phrase comes from enterprise RPA. A 'robot' runs on an individual's desktop and drives any application on that machine: Excel, SAP, a custom WPF tool, Teams, File Explorer, anything with a window. Historically the robot is a process built inside a vendor Studio (UiPath Studio, Power Automate Desktop, Automation Anywhere Bot Creator) using a visual flowchart. Terminator keeps the definition (any desktop app, cross-application, native UI) but drops the Studio. The robot is a library: cargo add terminator-rs, npm install @mediar-ai/terminator, pip install terminator-py, or npx -y terminator-mcp-agent@latest. It is imported by code an AI coding assistant wrote, and it drives the desktop through the OS accessibility API.

What exactly does the workflow recorder capture?

Fourteen event variants, defined in the WorkflowEvent enum at crates/terminator-workflow-recorder/src/events.rs: Mouse, Keyboard, Clipboard, TextSelection, DragDrop, Hotkey, TextInputCompleted, ApplicationSwitch, BrowserTabNavigation, Click, BrowserClick, BrowserTextInput, FileOpened, PendingAction. The first few are raw OS events (mouse down, key up). The rest are synthesized high-level events the recorder assembles from patterns in the raw stream. A replay consumer usually ignores most raw events and walks the high-level ones.

Why does TextInputCompleted matter?

Because replaying raw keyboard events is fragile. If a user types 'hello@world.com' into a field, the recorder sees 16 low-level keyboard events. Replaying those 16 events in sequence breaks the moment the target app applies autocomplete, autocorrect, or an input mask. TextInputCompleted waits for the typing session to end, reads the final text_value out of the focused UIA element, and emits one event with text_value, field_name, field_type, typing_duration_ms, keystroke_count, and input_method. The replay calls element.typeText('hello@world.com') once. That replays cleanly even across apps that inject their own input handlers between keystrokes.

What happens if the recorder cannot read the final text?

The get_completion_event function in crates/terminator-workflow-recorder/src/recorder/windows/structs.rs calls element.text(0) once. If the COM call fails (common when the focus moved before the UIA tree refreshed), it sleeps 50 ms and retries. If the second attempt also fails, the event still emits with text_value set to '[Text extraction failed - N keystrokes]' where N is the recorded keystroke count. That means the recording never silently drops a typing burst. The replay step is lossy in that case, but the downstream AI assistant sees the failure explicitly and can regenerate the text from context or from an earlier step.

Is this attended or unattended automation?

Both, depending on how you wire it. The MCP server runs as an inline process your AI coding assistant talks to, which is attended in the traditional sense: the user is at the machine and the assistant drives the UI while they watch. The same recording, once exported to YAML, can run headless on an Azure VM through the terminator CLI, which is unattended. The attended/unattended axis that dominates legacy RDA marketing is not a library-level choice here; it is a deployment choice.

Why record at all when an LLM can drive the UI live?

Two reasons. Cost: a recorded workflow replays at CPU speed, not at the speed of a round trip to a vision model, so runs that would cost dollars in tokens cost fractions of a cent. Determinism: a recording has explicit selectors like 'role:Button && name:Save', not pixel guesses. If the app ships a new version that shifts the button 40 pixels to the right, the recording still runs. A screenshot-driven agent has to re-solve the layout every time.

How do I actually run a recording locally?

Add terminator-workflow-recorder as a Cargo dependency, start the recorder, do the work once, stop it, and serialize the event stream to YAML. The CLI ships this as one command: terminator record, which emits a workflow.yml with the 14 event kinds. To replay: terminator mcp run workflow.yml. To replay a slice for debugging: terminator mcp run workflow.yml --start-from 'step_5' --end-at 'step_8'. There is no drag-and-drop designer, no canvas, no studio. The YAML is source code and you edit it the same way you edit any other text file.

How is this different from Power Automate Desktop or UiPath Studio?

Power Automate Desktop and UiPath Studio are visual IDEs. You open a canvas, drag activities onto it, connect them, save a bot package, deploy it to an orchestrator. The automation lives inside a proprietary editor and a proprietary runtime. Terminator is a Rust library with Node.js, Python, and MCP bindings. The automation lives in your repo as code and YAML. There is no orchestrator, no vendor lock-in, no seat licensing. The AI coding assistant you already use (Claude Code, Cursor, Codex, Windsurf, VS Code) becomes the editor. MIT licensed, self-hostable.

What makes a selector replay-safe?

Selectors in Terminator are structured. A recorded click against a Save button becomes 'role:Button && name:Save' or, when the app exposes a stable AutomationId, 'id:save-btn'. They are never pixel coordinates and never brittle '#cssIdWithRandomSuffix' strings, because those non-deterministically change across machines and app versions. The recorder derives the selector from the UIA element the raw event hit, and the YAML stores the selector alongside the action. Replays resolve the selector fresh every time, so a window positioned differently or a button rendered with a different font still gets clicked.

Does the recorder work for browser automation too?

Yes, and it is one of the reasons there are separate BrowserClick and BrowserTextInput variants in the enum. When the focused window is a Chromium browser, a companion extension provides DOM-level context that the raw Windows UIA tree does not carry. Browser events in the YAML include the DOM selector, the tab URL, and the iframe path, alongside the usual UIA metadata. A replay of a browser step can target whichever layer is most reliable for the target app.