Full Windows desktop control. Mouse, keyboard, screenshots - interact with any Windows application like a human.
Install
Documentation
Windows Control Skill
Full desktop automation for Windows. Control mouse, keyboard, and screen like a human user.
Quick Start
All scripts are in skills/windows-control/scripts/
Screenshot
py screenshot.py > output.b64
Returns base64 PNG of entire screen.
Click
py click.py 500 300 # Left click at (500, 300)
py click.py 500 300 right # Right click
py click.py 500 300 left 2 # Double click
Type Text
py type_text.py "Hello World"
Types text at current cursor position (10ms between keys).
Press Keys
py key_press.py "enter"
py key_press.py "ctrl+s"
py key_press.py "alt+tab"
py key_press.py "ctrl+shift+esc"
Move Mouse
py mouse_move.py 500 300
Moves mouse to coordinates (smooth 0.2s animation).
Scroll
py scroll.py up 5 # Scroll up 5 notches
py scroll.py down 10 # Scroll down 10 notches
Window Management (NEW!)
py focus_window.py "Chrome" # Bring window to front
py minimize_window.py "Notepad" # Minimize window
py maximize_window.py "VS Code" # Maximize window
py close_window.py "Calculator" # Close window
py get_active_window.py # Get title of active window
Advanced Actions (NEW!)
Click by text (No coordinates needed!)
py click_text.py "Save" # Click "Save" button anywhere
py click_text.py "Submit" "Chrome" # Click "Submit" in Chrome only
Drag and Drop
py drag.py 100 100 500 300 # Drag from (100,100) to (500,300)
Robust Automation (Wait/Find)
py wait_for_text.py "Ready" "App" 30 # Wait up to 30s for text
py wait_for_window.py "Notepad" 10 # Wait for window to appear
py find_text.py "Login" "Chrome" # Get coordinates of text
py list_windows.py # List all open windows
Read Window Text
py read_window.py "Notepad" # Read all text from Notepad
py read_window.py "Visual Studio" # Read text from VS Code
py read_window.py "Chrome" # Read text from browser
Uses Windows UI Automation to extract actual text (not OCR). Much faster and more accurate than screenshots!
Read UI Elements (NEW!)
py read_ui_elements.py "Chrome" # All interactive elements
py read_ui_elements.py "Chrome" --buttons-only # Just buttons
py read_ui_elements.py "Chrome" --links-only # Just links
py read_ui_elements.py "Chrome" --json # JSON output
Returns buttons, links, tabs, checkboxes, dropdowns with coordinates for clicking.
Read Webpage Content (NEW!)
py read_webpage.py # Read active browser
py read_webpage.py "Chrome" # Target Chrome specifically
py read_webpage.py "Chrome" --buttons # Include buttons
py read_webpage.py "Chrome" --links # Include links with coords
py read_webpage.py "Chrome" --full # All elements (inputs, images)
py read_webpage.py "Chrome" --json # JSON output
Enhanced browser content extraction with headings, text, buttons, and links.
Handle Dialogs (NEW!)
List all open dialogs
py handle_dialog.py list
Read current dialog content
py handle_dialog.py read
py handle_dialog.py read --json
Click button in dialog
py handle_dialog.py click "OK"
py handle_dialog.py click "Save"
py handle_dialog.py click "Yes"
Type into dialog text field
py handle_dialog.py type "myfile.txt"
py handle_dialog.py type "C:\path\to\file" --field 0
Dismiss dialog (auto-finds OK/Close/Cancel)
py handle_dialog.py dismiss
Wait for dialog to appear
py handle_dialog.py wait --timeout 10
py handle_dialog.py wait "Save As" --timeout 5
Handles Save/Open dialogs, message boxes, alerts, confirmations, etc.
Click Element by Name (NEW!)
py click_element.py "Save" # Click "Save" anywhere
py click_element.py "OK" --window "Notepad" # In specific window
py click_element.py "Submit" --type Button # Only buttons
py click_element.py "File" --type MenuItem # Menu items
py click_element.py --list # List clickable elements
py click_element.py --list --window "Chrome" # List in specific window
Click buttons, links, menu items by name without needing coordinates.
Read Screen Region (OCR - Optional)
py read_region.py 100 100 500 300 # Read text from coordinates
Note: Requires Tesseract OCR installation. Use read_window.py instead for better results.
Workflow Pattern
1. Read window - Extract text from specific window (fast, accurate)
2. Read UI elements - Get buttons, links with coordinates
3. Screenshot (if needed) - See visual layout
4. Act - Click element by name or coordinates
5. Handle dialogs - Interact with popups/save dialogs
6. Read window - Verify changes
Screen Coordinates
- -Origin (0, 0) is top-left corner
- -Your screen: 2560x1440 (check with screenshot)
- -Use coordinates from screenshot analysis
Examples
Open Notepad and type
Press Windows key
py key_press.py "win"
Type "notepad"
py type_text.py "notepad"
Press Enter
py key_press.py "enter"
Wait a moment, then type
py type_text.py "Hello from AI!"
Save
py key_press.py "ctrl+s"
Click in VS Code
Read current VS Code content
py read_window.py "Visual Studio Code"
Click at specific location (e.g., file explorer)
py click.py 50 100
Type filename
py type_text.py "test.js"
Press Enter
py key_press.py "enter"
Verify new file opened
py read_window.py "Visual Studio Code"
Monitor Notepad changes
Read current content
py read_window.py "Notepad"
User types something...
Read updated content (no screenshot needed!)
py read_window.py "Notepad"
Text Reading Methods
Method 1: Windows UI Automation (BEST)- -Use
read_window.pyfor any window - -Use
read_ui_elements.pyfor buttons/links with coordinates - -Use
read_webpage.pyfor browser content with structure - -Gets actual text data (not image-based)
- -Use
click_element.pyto click buttons/links by name - -No coordinates needed - finds elements automatically
- -Works across all windows or target specific window
- -Use
handle_dialog.pyfor popups, save dialogs, alerts - -Read dialog content, click buttons, type text
- -Auto-dismiss with common buttons (OK, Cancel, etc.)
- -Take full screenshot
- -AI reads text visually
- -Slower but works for any content
- -Use
read_region.pywith Tesseract - -Requires additional installation
- -Good for images/PDFs with text
Safety Features
- -
pyautogui.FAILSAFE = True(move mouse to top-left to abort) - -Small delays between actions
- -Smooth mouse movements (not instant jumps)
Requirements
- -Python 3.11+
- -pyautogui (installed ✅)
- -pillow (installed ✅)
Tips
- -Always screenshot first to see current state
- -Coordinates are absolute (not relative to windows)
- -Wait briefly after clicks for UI to update
- -Use
ctrl+zfriendly actions when possible
---
Status: ✅ READY FOR USE (v2.0 - Dialog & UI Elements) Created: 2026-02-01 Updated: 2026-02-02Launch an agent with Windows Control on Termo.