Merge pull request #42 from fixie-ai/ben/add-websockets

Add websockets and data messages
fixie-ai · Nov 25, 2024 · 46d90ca · 46d90ca
2 parents 6bc2478 + 87dfc3f
commit 46d90ca
Show file tree

Hide file tree

Showing 5 changed files with 282 additions and 114 deletions.
diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs
@@ -37,9 +37,9 @@ export default defineConfig({
         label: 'Guides',
         collapsed: false,
         items: [
+          'guides/connectionoptions',
           'tools',
           'guides/stages',
-          'guides/telephony',
           'guides/clienttoolstutorial',
           'guides/callstagestutorial',
         ]
@@ -52,8 +52,13 @@ export default defineConfig({
       },
       {
         label: 'SDK',
-        link: 'sdk'
+        collapsed: false,
+        items: [
+          'sdk',
+          'datamessages'
+        ]
       },
+
     ],
     components: {}
   }), 

diff --git a/docs/src/content/docs/datamessages.mdx b/docs/src/content/docs/datamessages.mdx
@@ -0,0 +1,88 @@
+---
+title: "Data Messages"
+description: Protocol documentation for messages exchanged between client and server during Ultravox calls.
+---
+
+Data messages are used to communicate non-audio information between your client and an Ultravox server during calls. These messages work across WebRTC data channels and WebSocket connections.
+
+All messages are JSON objects with camelCase keys containing:
+- A required `type` field identifying the message type
+- Additional fields specific to each message type
+
+## Messages at a Glance
+This table provides all messages at a glance. Details on each message type appears below. Sender indicates client or server message. Client messages are sent from the client to the server. Server messages are sent from the server to the client.
+| Message | Sender | Description |
+| --------------------------------------------- | ------ | ---------------------------------------------------- |
+| [Ping](#ping)                                 | Client | Measures round-trip data latency.                    |
+| [Pong](#pong)                                 | Server | Server reply to a ping message.                      |
+| [State](#state)                               | Server | Indicates the server's current state.                |
+| [Transcript](#transcript)                     | Server | Contains text for an utterance made during the call.                  |
+| [InputTextMessage](#inputtextmessage)         | Client | Used to send a user message to the agent via text.   |
+| [SetOutputMedium](#setoutputmedium)           | Client | Sets server's output medium to text or voice.        |
+| [ClientToolInvocation](#clienttoolinvocation) | Server | Asks the client to invoke a client tool.             |
+| [ClientToolResult](#clienttoolresult)         | Client | Contains the result of a client tool invocation.     |
+| [Debug](#debug)                               | Server | Useful for application debugging.                    |
+| [PlaybackClearBuffer](#playbackclearbuffer)   | Server | Used to clear buffered output audio. WebSocket only. |
+
+
+## Ping
+A message sent by the client to measure round-trip data message latency.
+- `type: "ping"` 
+- `timestamp`: Float. Client timestamp for latency measurement.
+
+## Pong
+A message sent by the server in response to a PingMessage. The timestamp is copied from the PingMessage.
+- `type: "pong"`
+- `timestamp`: Float. Matching ping timestamp.
+
+## State
+A message sent by the server to indicate its current state.
+- `type: "state"`
+- `state`: Current session state
+
+## Transcript
+A message containing text transcripts of user and agent utterances.
+- `type: "transcript"`
+- `role`: "user" or "agent". Who emitted the utterance.
+- `medium`: "text" or "voice". The medium through which the utterance was emitted.
+- `text`: String. Full transcript text (exclusive with delta). The full text of the transcript so far. Either this or delta will be set.
+- `delta`: String. Incremental transcript update (exclusive with text). The additional transcript text added since the last agent transcript message.
+- `final`: Boolean. Whether more updates are expected for this utterance.
+- `ordinal`: int. Used for ordering transcripts within a call.
+
+## InputTextMessage
+A user message sent via text.
+- `type: "input_text_message"`
+- `text`: String. The content of the user message.
+
+## SetOutputMedium
+Message sent by the client to set the server's output medium.
+- `type: "set_output_medium"`
+- `medium`: Either "voice" or "text".
+
+## ClientToolInvocation
+Sent by the server to ask the client to invoke a client-implemented tool with the given parameters. The client is expected to send back a ClientToolResultMessage with a matching invocation_id.
+- `type: "client_tool_invocation"`
+- `tool_name`: String. Tool to invoke
+- `invocation_id`: String. Unique invocation ID
+- `parameters`: Dict[String, Any]. Tool parameters
+
+## ClientToolResult
+Contains the result of a client-implemented tool invocation.
+- `type: "client_tool_result"`
+- `invocation_id`: String. Matches corresponding invocation.
+- `result`: String. Tool execution result. Often a JSON string. May be omitted for errors.
+- `response_type`: String. Defaults to "tool-response".
+- `error_type`: Optional string. Should be omitted when result is set. Otherwise, should be "undefined" if the a tool with the given name does not exist or "implementation-error" otherwise.
+- `error_message`: String. Error details if failed (optional).
+
+## Debug
+A message sent by the server to communicate debug information.
+- `type: "debug"`
+- `message`: String. Debug information
+- Disabled by default
+
+## PlaybackClearBuffer
+Message sent by our server to clear buffered output audio. Integrators should drop as much unplayed output audio as possible in order for interruptions to function properly.
+- `type: "playback_clear_buffer"`
+- WebSocket connections only
diff --git a/docs/src/content/docs/guides/connectionoptions.mdx b/docs/src/content/docs/guides/connectionoptions.mdx
@@ -0,0 +1,182 @@
+---
+title: "Connection Options: WebRTC, Telephony, and WebSockets"
+description: Use Ultravox to make and receive calls using WebRTC, via Twilio, or over direct WebSocket connections.
+---
+
+import { Steps, Tabs, TabItem } from '@astrojs/starlight/components';
+
+The Ultravox API allows you to create AI-powered voice applications that can interact through various protocols:
+
+- **WebRTC** → Default protocol for browser and mobile applications.
+- **Regular Phone Numbers** → Receive incoming or make outgoing phone calls (via Twilio).
+- **WebSockets** → Direct server-to-server integration.
+
+## Choosing a Protocol
+Choose your integration method based on your needs:
+
+- **WebRTC**: Best for most integrations, especially for any client deployment (for example, browsers or mobile clients). This is the default. Get started with the Ultravox client [SDK](../sdk).
+- **Twilio**: For traditional phone network integration.
+- **WebSocket**: For server-to-server integration, especially when you already have high-bandwidth connections between your server and clients.
+
+## Twilio Integration
+Ultravox integrates with Twilio. This enables the creation of powerful AI-driven voice applications that interact with regular phone networks. This enables you to build AI agents that can make outgoing calls and answer incoming calls. This opens up a wide range of possibilities for customer service, automated outreach, and other voice-based AI applications.
+
+For more information on Twilio, refer to the [Twilio documentation](https://www.twilio.com/docs).
+
+:::note[Twilio Support]
+We currently integrate with Twilio. Please let us know if there's another integration you'd like to see.
+:::
+
+### Creating a Phone Call with Twilio
+:::tip[Prerequisites]
+Make sure you have:
+1. An active Twilio account
+1. A phone number purchased from Twilio
+:::
+
+Creating an Ultravox call that works with Twilio is just like creating a WebRTC call, but there are two parameters to the [Create Call](./api/calls/#create-call) command worth special attention:
+
+<table class="w-full">
+    <tr class="w-full">
+        <th class="w-1/12"></th>
+        <th class="w-1/12"></th>
+        <th class="w-10/12"></th>
+    </tr>
+    <tr>
+        <td class="font-mono">medium</td>
+        <td>object</td>
+        <td>Tells Ultravox which protocol to use. <br />For Twilio, must be set to `{"twilio": {}}` and sets the call to use Twilio [Media Streams](https://www.twilio.com/docs/voice/media-streams). Defaults to `{"webRtc": {}}` which sets the protocol to WebRTC.</td>
+    </tr>
+    <tr>
+        <td class="font-mono">firstSpeaker</td>
+        <td>string</td>
+        <td>Tells Ultravox who should speak first. For outgoing calls, typically set to `"FIRST_SPEAKER_USER"`. The default is `"FIRST_SPEAKER_AGENT"`.</td>
+    </tr>
+</table>
+
+Adding these to the request body when creating the call would look like this:
+
+```javascript
+{
+  "systemPrompt": "You are a helpful assistant...",
+  ...
+  "medium": {
+    "twilio": {}
+  },
+  "firstSpeaker": "FIRST_SPEAKER_USER"
+}
+```
+
+Ultravox will return a `joinUrl` that can then be used with Twilio for outgoing or incoming calls.
+
+### Outgoing Calls
+
+It only takes two steps to make an outgoing call to regular phone numbers through Twilio:
+<Steps>
+1. **Create an Ultravox Call** → Create a new call (see [above](#creating-a-phone-call-with-twilio)), and get a `joinUrl`.
+
+1. **Initiate Twilio Phone call** → Use the `joinUrl` with a Twilio [`<Stream>`](https://www.twilio.com/docs/voice/twiml/stream).
+
+    ```javascript
+    // Example using the twilio node library
+    const call = await client.calls.create({
+        twiml: `<Response>
+                    <Connect>
+                        <Stream url="${joinUrl}"/>
+                    </Connect>
+                </Response>`,
+        to: phoneNumber, // the number you are calling
+        from: twilioPhoneNumber // your twilio number
+    });
+    ```
+</Steps>
+
+See the [twilio-outgoing-call](https://github.com/fixie-ai/ultradox/tree/main/examples/twilio-outgoing-call) example for more.
+
+This example shows one of the many options Twilio provides for making outgoing calls. Consult the [Twilio docs](https://www.twilio.com/docs) for more details.
+
+### Incoming Calls
+Incoming calls require essentially the same two steps as outgoing calls:
+
+<Steps>
+1. **Create an Ultravox Call** → Create a new call (see [above](#creating-a-phone-call-with-twilio)), and get a `joinUrl`. *Note: for incoming calls you will want to keep `firstSpeaker` set to the default ("FIRST_SPEAKER_AGENT").*
+
+1. **Receive Inbound Twilio Phone call** → Use the `joinUrl` with a Twilio [`<Stream>`](https://www.twilio.com/docs/voice/twiml/stream).
+
+    ```xml
+    <!-- Example TwiML Response -->
+    <?xml version="1.0" encoding="UTF-8"?>
+    <Response>
+        <Connect>
+            <Stream url="your_ultravox_join_url" />
+        </Connect>
+    </Response>
+
+    ```
+</Steps>
+
+The above shows how to create a TwiML response and use that for handling the inbound call. Consult the [Twilio docs](https://www.twilio.com/docs) for more on all the options Twilio provides for handling phone calls.
+
+## WebSocket Integration
+
+:::caution[Server-to-Server Only]
+WebSocket connections are designed for server-to-server communication. For browser or mobile applications, use our client SDKs with WebRTC for optimal performance. WebSocket connections over TCP can experience audio blocking and ordering constraints that make them unsuitable for direct client use.
+:::
+
+### Creating a WebSocket Call
+
+Creating a WebSocket-based call with Ultravox requires setting `medium` to `serverWebSocket` and passing in parameters for sample rates and buffer size.
+
+- **inputSampleRate** (required): Sample rate for input (user) audio (e.g., 48000).
+- **outputSampleRate** (optional): Sample rate for output (agent) audio (defaults to inputSampleRate).
+- **clientBufferSizeMs** (optional): Size of the client-side audio buffer in milliseconds. Smaller buffers allow for faster interruptions but may cause audio underflow if network latency fluctuates too greatly. For the best of both worlds, set this to some large value (e.g. 30000) and implement support for [PlaybackClearBuffer](../datamessages#websocket-specific) messages. (Defaults to 60).
+
+### Example: Creating an Ultravox Call with WebSockets
+```javascript
+    const response = await fetch('https://api.ultravox.ai/api/calls', {
+        method: 'POST',
+        headers: {
+            'X-API-Key': 'your_api_key',
+            'Content-Type': 'application/json'
+        },
+        body: JSON.stringify({
+            systemPrompt: "You are a helpful assistant...",
+            model: "fixie-ai/ultravox",
+            voice: "Mark",
+            medium: {
+                serverWebSocket: {
+                    inputSampleRate: 48000,
+                    outputSampleRate: 48000,
+                    clientBufferSizeMs: 30000
+                }
+            }
+        })
+    });
+
+    const { joinUrl } = await response.json();
+```
+
+### Example: Joining a Call with Websockets
+See [Data Messages](../datamessages) for more information on all available messages.
+```python
+import websockets
+
+socket = await websockets.connect(join_url)
+audio_send_task = asyncio.create_task(_send_audio(socket))
+async for message in socket:
+    if isinstance(message, bytes):
+        # Handle agent audio data
+    else:
+        # Handle data message. See "Data Messages"
+
+...
+
+async def _send_audio(socket: websockets.WebSocketClientProtocol):
+    async for chunk in some_audio_source:
+        # chunk should be a bytes object containing s16le PCM audio from the user
+        self._socket.send(chunk)
+```
+
+:::note[Data Messages]
+WebSocket connections use the same message format as WebRTC data channels. See our [Data Messages](../datamessages) documentation for detailed message specifications.
+:::