12 KiB
name, description, description_zh, allowed-tools
| name | description | description_zh | allowed-tools |
|---|---|---|---|
| agent-browser | Browser automation for AI agents. Triggers on requests to open websites, fill forms, click buttons, take screenshots, scrape data, test web apps, login to sites, or automate browser tasks. | 浏览器自动化工具。当用户需要打开网站、填写表单、点击按钮、截图、抓取数据、测试 Web 应用、登录网站或执行浏览器自动化任务时使用。 | Bash(agent-browser:*) |
Browser Automation with agent-browser
First-Time Setup (Important!)
Before using agent-browser for the first time, you MUST install the browser binary:
agent-browser install
This downloads and installs Chromium. You only need to do this once.
Session Requirement (Critical!)
Always specify a session name when using agent-browser. Without it, you may encounter the error:
"Browser not launched. Call launch first."
There are two ways to specify a session:
- Environment variable (recommended for chained commands):
export AGENT_BROWSER_SESSION=mysession
agent-browser --headed open https://example.com
agent-browser snapshot -i
agent-browser click @e1
- Command-line flag:
agent-browser --session mysession --headed open https://example.com
agent-browser --session mysession snapshot -i
agent-browser --session mysession click @e1
The session name can be any string (e.g., taobao, google, test1). Using consistent session names allows you to maintain browser state across multiple commands.
Browser Mode
Mode Selection Rules:
- Virtual Machine / Remote Server: Use
--headlessmode (no display available) - Local Machine (Default): Use
--headedmode (visible browser window for better debugging) - User Request: Follow user's explicit preference (headless/无头模式 or headed/有头模式)
How to detect environment:
- Check
<env>section in system context for platform info - VM indicators: Linux without display, SSH session, Docker container, cloud server
- If uncertain, try
--headedfirst; if it fails with display error, fallback to--headless
Options:
--headed: Visible browser window (default for local machines)--headless: Invisible browser (default for VMs/servers, or when user requests)
Core Workflow
Every browser automation follows this pattern:
- Setup session:
export AGENT_BROWSER_SESSION=mysite - Navigate:
agent-browser --headed open <url> - Snapshot:
agent-browser snapshot -i(get element refs like@e1,@e2) - Interact: Use refs to click, fill, select
- Re-snapshot: After navigation or DOM changes, get fresh refs
# Set session first (or use --session flag on each command)
export AGENT_BROWSER_SESSION=mysite
agent-browser --headed open https://example.com/form
agent-browser snapshot -i
# Output: @e1 [input type="email"], @e2 [input type="password"], @e3 [button] "Submit"
agent-browser fill @e1 "user@example.com"
agent-browser fill @e2 "password123"
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i # Check result
Essential Commands
Note: All commands below assume you have set AGENT_BROWSER_SESSION or use --session <name>.
# First-time setup (run once)
agent-browser install # Install Chromium browser
# Navigation (ALWAYS use --headed by default)
agent-browser --headed open <url> # Navigate with visible browser (DEFAULT)
agent-browser --headless open <url> # Navigate headless (only if user explicitly requests)
agent-browser close # Close browser
# Snapshot
agent-browser snapshot -i # Interactive elements with refs (recommended)
agent-browser snapshot -s "#selector" # Scope to CSS selector
# Interaction (use @refs from snapshot)
agent-browser click @e1 # Click element
agent-browser fill @e2 "text" # Clear and type text
agent-browser type @e2 "text" # Type without clearing
agent-browser select @e1 "option" # Select dropdown option
agent-browser check @e1 # Check checkbox
agent-browser press Enter # Press key
agent-browser scroll down 500 # Scroll page
# Get information
agent-browser get text @e1 # Get element text
agent-browser get url # Get current URL
agent-browser get title # Get page title
# Wait
agent-browser wait @e1 # Wait for element
agent-browser wait --load networkidle # Wait for network idle
agent-browser wait --url "**/page" # Wait for URL pattern
agent-browser wait 2000 # Wait milliseconds
# Capture
agent-browser screenshot # Screenshot to temp dir
agent-browser screenshot --full # Full page screenshot
agent-browser pdf output.pdf # Save as PDF
Common Patterns
Form Submission
export AGENT_BROWSER_SESSION=signup
agent-browser --headed open https://example.com/signup
agent-browser snapshot -i
agent-browser fill @e1 "Jane Doe"
agent-browser fill @e2 "jane@example.com"
agent-browser select @e3 "California"
agent-browser check @e4
agent-browser click @e5
agent-browser wait --load networkidle
Authentication with State Persistence
# Login once and save state
export AGENT_BROWSER_SESSION=auth
agent-browser --headed open https://app.example.com/login
agent-browser snapshot -i
agent-browser fill @e1 "$USERNAME"
agent-browser fill @e2 "$PASSWORD"
agent-browser click @e3
agent-browser wait --url "**/dashboard"
agent-browser state save auth.json
# Reuse in future sessions
agent-browser state load auth.json
agent-browser --headed open https://app.example.com/dashboard
Data Extraction
export AGENT_BROWSER_SESSION=scrape
agent-browser --headed open https://example.com/products
agent-browser snapshot -i
agent-browser get text @e5 # Get specific element text
agent-browser get text body > page.txt # Get all page text
# JSON output for parsing
agent-browser snapshot -i --json
agent-browser get text @e1 --json
Parallel Sessions
agent-browser --headed --session site1 open https://site-a.com
agent-browser --headed --session site2 open https://site-b.com
agent-browser --session site1 snapshot -i
agent-browser --session site2 snapshot -i
agent-browser session list
Headless Mode (Only When Requested)
# Only use headless when user explicitly requests it
agent-browser --headless open https://example.com
Debugging Tools
agent-browser --headed open https://example.com
agent-browser highlight @e1 # Highlight element
agent-browser record start demo.webm # Record session
iOS Simulator (Mobile Safari)
# List available iOS simulators
agent-browser device list
# Launch Safari on a specific device
agent-browser -p ios --device "iPhone 16 Pro" open https://example.com
# Same workflow as desktop - snapshot, interact, re-snapshot
agent-browser -p ios snapshot -i
agent-browser -p ios tap @e1 # Tap (alias for click)
agent-browser -p ios fill @e2 "text"
agent-browser -p ios swipe up # Mobile-specific gesture
# Take screenshot
agent-browser -p ios screenshot mobile.png
# Close session (shuts down simulator)
agent-browser -p ios close
Requirements: macOS with Xcode, Appium (npm install -g appium && appium driver install xcuitest)
Real devices: Works with physical iOS devices if pre-configured. Use --device "<UDID>" where UDID is from xcrun xctrace list devices.
Ref Lifecycle (Important)
Refs (@e1, @e2, etc.) are invalidated when the page changes. Always re-snapshot after:
- Clicking links or buttons that navigate
- Form submissions
- Dynamic content loading (dropdowns, modals)
agent-browser click @e5 # Navigates to new page
agent-browser snapshot -i # MUST re-snapshot
agent-browser click @e1 # Use new refs
Semantic Locators (Alternative to Refs)
When refs are unavailable or unreliable, use semantic locators:
agent-browser find text "Sign In" click
agent-browser find label "Email" fill "user@test.com"
agent-browser find role button click --name "Submit"
agent-browser find placeholder "Search" type "query"
agent-browser find testid "submit-btn" click
Deep-Dive Documentation
| Reference | When to Use |
|---|---|
| references/commands.md | Full command reference with all options |
| references/snapshot-refs.md | Ref lifecycle, invalidation rules, troubleshooting |
| references/session-management.md | Parallel sessions, state persistence, concurrent scraping |
| references/authentication.md | Login flows, OAuth, 2FA handling, state reuse |
| references/video-recording.md | Recording workflows for debugging and documentation |
| references/proxy-support.md | Proxy configuration, geo-testing, rotating proxies |
Ready-to-Use Templates
| Template | Description |
|---|---|
| templates/form-automation.sh | Form filling with validation |
| templates/authenticated-session.sh | Login once, reuse state |
| templates/capture-workflow.sh | Content extraction with screenshots |
./templates/form-automation.sh https://example.com/form
./templates/authenticated-session.sh https://app.example.com/login
./templates/capture-workflow.sh https://example.com ./output
Troubleshooting
"Browser not launched. Call launch first."
This error can occur due to two reasons:
Reason 1: No session specified
Fix by setting the session environment variable or using the --session flag:
export AGENT_BROWSER_SESSION=mysession
agent-browser --headed open https://example.com
Or:
agent-browser --session mysession --headed open https://example.com
Reason 2: Stale/zombie sessions (more common)
If you've specified a session but still get this error, there may be stale sessions from previous runs where the browser process has terminated but the session record remains. This is the most common cause.
First, check for existing sessions:
agent-browser session list
If you see sessions listed, close them before starting a new one:
# Close all stale sessions first, then open
agent-browser close 2>/dev/null
agent-browser --session mysession close 2>/dev/null
sleep 1
agent-browser --session mysession --headed open https://example.com
Recommended pattern for reliable startup:
# Always close potential stale sessions before opening
agent-browser --session mysite close 2>/dev/null; sleep 1; agent-browser --session mysite --headed open https://example.com
Browser binary not found
Run the install command first:
agent-browser install
Other Issues
If agent-browser encounters issues (installation problems, display errors, compatibility issues), suggest the user to:
Alternative: Enable Browser MCP Server
The built-in Browser MCP server provides similar browser automation capabilities through MCP tools. Guide the user to:
- Open Settings > MCP Servers
- Find and enable the Browser MCP server
- Use the MCP browser tools (
mcp__browser-use__*) instead ofagent-browserCLI
This is especially useful when:
agent-browserCLI is not installed or has dependency issues- Running in environments where CLI tools are restricted
- User prefers MCP-based browser automation