Skip to content

Commit

Permalink
feat: support operator-browserbase (#132)
Browse files Browse the repository at this point in the history
* feat: support operator-browserbase

* chore: md

* chore: test

* chore: types

* release: publish beta packages

* chore: docs

* chore: typo

* refactor: pnpm

* chore: reset gui agent

* chore: types predictionParsed Close #136

* chore: types

* fix: onData end trigger twice

* chore: typo

* chore: mouse speed

* chore: copyright

* chore: type

* chore: publish
  • Loading branch information
ycjcl868 authored Feb 25, 2025
1 parent 5e7f016 commit 4e0883f
Show file tree
Hide file tree
Showing 75 changed files with 7,997 additions and 272 deletions.
9 changes: 9 additions & 0 deletions .changeset/fast-insects-flash.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
'@ui-tars/operator-browserbase': patch
'@ui-tars/operator-nut-js': patch
'@ui-tars/shared': patch
'@ui-tars/cli': patch
'@ui-tars/sdk': patch
---

chore: open-operator
9 changes: 6 additions & 3 deletions .changeset/pre.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,18 @@
"initialVersions": {
"ui-tars-desktop": "0.0.6",
"@ui-tars/action-parser": "1.2.0-beta.10",
"@ui-tars/cli": "1.2.0-beta.10",
"@ui-tars/cli": "1.2.0-beta.12",
"@ui-tars/electron-ipc": "1.2.0-beta.10",
"@ui-tars/operator-nut-js": "1.2.0-beta.10",
"@ui-tars/sdk": "1.2.0-beta.10",
"@ui-tars/operator-browserbase": "1.2.0-beta.11",
"@ui-tars/operator-nut-js": "1.2.0-beta.11",
"@ui-tars/sdk": "1.2.0-beta.11",
"@ui-tars/shared": "1.2.0-beta.10",
"@ui-tars/utio": "1.2.0-beta.10"
},
"changesets": [
"fast-insects-flash",
"selfish-humans-drive",
"short-shoes-tap",
"strange-schools-help",
"witty-points-rescue"
]
Expand Down
7 changes: 7 additions & 0 deletions .changeset/short-shoes-tap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
'@ui-tars/operator-browserbase': patch
'@ui-tars/operator-nut-js': patch
'@ui-tars/sdk': patch
---

chore: types
3 changes: 2 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ This is a [Monorepo](https://pnpm.io/workspaces) project including the following
│   ├── utio # UTIO (UI-TARS Insights and Observation)
│   ├── visualizer # Sharing HTML Visualization Reporter
│ └── operators # Automation operators
│   └── nut-js # Nut.js is a framework for building automation operators
│ ├── browserbase # Browserbase integration
│   └── nut-js # Nut.js integration
├── docs # Documentation of the project
├── rfcs # RFCs (Request for Comments) for the project
Expand Down
99 changes: 86 additions & 13 deletions docs/sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,17 +69,38 @@ Ok to proceed? (y) y
## Agent Execution Process

```mermaid
flowchart LR
User[Instruction] --> Agent[GUIAgent]
Agent --> Operator[Operator]
Operator --> Screenshot[Screenshot]
Screenshot --> Model[Model]
Model --> Prediction[Invoke]
Prediction --> Agent
Agent --> Operator
Operator --> Action[Execute]
sequenceDiagram
participant user as User
participant guiAgent as GUI Agent
participant model as UI-TARS Model
participant operator as Operator
user -->> guiAgent: "`instruction` + <br /> `Operator.MANUAL.ACTION_SPACES`"
activate user
activate guiAgent
loop status !== StatusEnum.RUNNING
guiAgent ->> operator: screenshot()
activate operator
operator -->> guiAgent: base64, Physical screen size
deactivate operator
guiAgent ->> model: instruction + actionSpaces + screenshots.slice(-5)
model -->> guiAgent: `prediction`: click(start_box='(27,496)')
guiAgent -->> user: prediction, next action
guiAgent ->> operator: execute(prediction)
activate operator
operator -->> guiAgent: success
deactivate operator
end
deactivate guiAgent
deactivate user
```


### Basic Usage

Basic usage is largely derived from package `@ui-tars/sdk`, here's a basic example of using the SDK:
Expand Down Expand Up @@ -259,22 +280,57 @@ interface ExecuteParams {
Advanced sdk usage is largely derived from package `@ui-tars/sdk/core`, you can create custom operators by extending the base `Operator` class:

```typescript
import { Operator, ScreenshotOutput, ExecuteParams } from '@ui-tars/sdk/core';
import {
Operator,
parseBoxToScreenCoords,
type ScreenshotOutput,
type ExecuteParams
type ExecuteOutput,
} from '@ui-tars/sdk/core';
import { Jimp } from 'jimp';

export class CustomOperator extends Operator {
// Define the action spaces and description for UI-TARS System Prompt splice
static MANUAL = {
ACTION_SPACES: [
'click(start_box="") # click on the element at the specified coordinates',
'type(content="") # type the specified content into the current input field',
'scroll(direction="") # scroll the page in the specified direction',
'finished() # finish the task',
// ...more_actions
],
};

public async screenshot(): Promise<ScreenshotOutput> {
// Implement screenshot functionality
const base64 = 'base64-encoded-image';
const buffer = Buffer.from(base64, 'base64');
const image = await sharp(buffer).toBuffer();

return {
base64: 'base64-encoded-image',
width: 1920,
height: 1080,
width: image.width,
height: image.height,
scaleFactor: 1
};
}

async execute(params: ExecuteParams): Promise<void> {
async execute(params: ExecuteParams): Promise<ExecuteOutput> {
const { parsedPrediction, screenWidth, screenHeight, scaleFactor } = params;
// Implement action execution logic

// if click action, get coordinates from parsedPrediction
const startBoxStr = parsedPrediction?.action_inputs?.start_box || '';
const { x: startX, y: startY } = parseBoxToScreenCoords({
boxStr: startBoxStr,
screenWidth,
screenHeight,
});

if (parsedPrediction?.action_type === 'finished') {
// finish the GUIAgent task
return { status: StatusEnum.END };
}
}
}
```
Expand All @@ -283,6 +339,23 @@ Required methods:
- `screenshot()`: Captures the current screen state
- `execute()`: Performs the requested action based on model predictions

Optional static properties:
- `MANUAL`: Define the action spaces and description for UI-TARS Model understanding
- `ACTION_SPACES`: Define the action spaces and description for UI-TARS Model understanding

Loaded into `GUIAgent`:

```ts
const guiAgent = new GUIAgent({
// ... other config
systemPrompt: `
// ... other system prompt
${CustomOperator.MANUAL.ACTION_SPACES.join('\n')}
`,
operator: new CustomOperator(),
});
```

### Custom Model Implementation

You can implement custom model logic by extending the `UITarsModel` class:
Expand Down
8 changes: 8 additions & 0 deletions examples/operator-browserbase/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# OpenAI API Configuration
UI_TARS_BASE_URL=your_ui_tars_base_url_here
UI_TARS_API_KEY=your_ui_tars_api_key_here
UI_TARS_MODEL=your_ui_tars_model_here

# Browserbase Configuration
BROWSERBASE_API_KEY=your_browserbase_api_key_here
BROWSERBASE_PROJECT_ID=your_browserbase_project_id_here
81 changes: 81 additions & 0 deletions examples/operator-browserbase/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Compiled source #
###################
*.com
*.class
*.dll
*.exe
*.o
*.so

# Packages #
############
# it's better to unpack these files and commit the raw source
# git has its own built in compression methods
*.7z
*.dmg
*.gz
*.iso
*.jar
*.rar
*.tar
*.zip

# Logs and databases #
######################
*.log
*.sql
*.sqlite

# OS generated files #
######################
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# IDE and Editor folders #
##########################
.idea/
.vscode/
*.swp
*.swo
*~

# Node.js #
###########
node_modules/
npm-debug.log
.next

# Python #
##########
*.py[cod]
__pycache__/
*.so

# Java #
########
*.class
*.jar
*.war
*.ear

# Gradle #
##########
.gradle
/build/

# Maven #
#########
target/

# Miscellaneous #
#################
*.bak
*.tmp
*.temp
.env
.env.local
92 changes: 92 additions & 0 deletions examples/operator-browserbase/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Open Operator

> [!WARNING]
> This is simply a proof of concept.
> Browserbase aims not to compete with web agents, but rather to provide all the necessary tools for anybody to build their own web agent. We strongly recommend you check out both [Browserbase](https://www.browserbase.com) and our open source project [Stagehand](https://www.stagehand.dev) to build your own web agent.
[![Deploy with Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Fbrowserbase%2Fopen-operator&env=OPENAI_API_KEY,BROWSERBASE_API_KEY,BROWSERBASE_PROJECT_ID&envDescription=API%20keys%20needed%20to%20run%20Open%20Operator&envLink=https%3A%2F%2Fgithub.com%2Fbrowserbase%2Fopen-operator%23environment-variables)

https://github.com/user-attachments/assets/354c3b8b-681f-4ad0-9ab9-365dbde894af

## Getting Started

First, install the dependencies for this repository. This requires [pnpm](https://pnpm.io/installation#using-other-package-managers).

<!-- This doesn't work with NPM, haven't tested with yarn -->

```bash
pnpm install
```

Next, copy the example environment variables:

```bash
cp .env.example .env.local
```

You'll need to set up your API keys:

1. Get your UI-TARS Service from [UI-TARS](https://github.com/bytedance/UI-TARS)
2. Get your Browserbase API key and project ID from [Browserbase](https://www.browserbase.com)

Update `.env.local` with your API keys:

- `UI_TARS_BASE_URL`: Your UI-TARS Base Url
- `UI_TARS_API_KEY`: Your UI-TARS API Key
- `UI_TARS_MODEL`: Your UI-TARS Model
- `BROWSERBASE_API_KEY`: Your Browserbase API key
- `BROWSERBASE_PROJECT_ID`: Your Browserbase project ID

Then, run the development server:

<!-- This doesn't work with NPM, haven't tested with yarn -->

```bash
pnpm dev
```

Open [http://localhost:3000](http://localhost:3000) with your browser to see Open Operator in action.

## How It Works

Building a web agent is a complex task. You need to understand the user's intent, convert it into headless browser operations, and execute actions, each of which can be incredibly complex on their own.

![public/agent_mess.png](public/agent_mess.png)

Stagehand is a tool that helps you build web agents. It allows you to convert natural language into headless browser operations, execute actions on the browser, and extract results back into structured data.

![public/stagehand_clean.png](public/stagehand_clean.png)

Under the hood, we have a very simple agent loop that just calls Stagehand to convert the user's intent into headless browser operations, and then calls Browserbase to execute those operations.

![public/agent_loop.png](public/agent_loop.png)

Stagehand uses Browserbase to execute actions on the browser, and OpenAI to understand the user's intent.

For more on this, check out the code at [this commit](https://github.com/browserbase/open-operator/blob/6f2fba55b3d271be61819dc11e64b1ada52646ac/index.ts).

### Key Technologies

- **[Browserbase](https://www.browserbase.com)**: Powers the core browser automation and interaction capabilities
- **[Stagehand](https://www.stagehand.dev)**: Handles precise DOM manipulation and state management
- **[Next.js](https://nextjs.org)**: Provides the modern web framework foundation
- **[OpenAI](https://openai.com)**: Enable natural language understanding and decision making

## Contributing

We welcome contributions! Whether it's:

- Adding new features
- Improving documentation
- Reporting bugs
- Suggesting enhancements

Please feel free to open issues and pull requests.

## License

Open Operator is open source software licensed under the MIT license.

## Acknowledgments

This project is inspired by OpenAI's Operator feature and builds upon various open source technologies including Next.js, React, Browserbase, and Stagehand.
Loading

0 comments on commit 4e0883f

Please sign in to comment.