Add Speech to Speech example notebook, closes #789

neuml · Sep 27, 2024 · 67321b0 · 67321b0
1 parent f3138c2
commit 67321b0
Show file tree

Hide file tree

Showing 4 changed files with 265 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -137,7 +137,7 @@ A novel feature of txtai is that it can provide both an answer and source citati
 | [Build RAG pipelines with txtai](https://github.com/neuml/txtai/blob/master/examples/52_Build_RAG_pipelines_with_txtai.ipynb) | Guide on retrieval augmented generation including how to create citations | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/52_Build_RAG_pipelines_with_txtai.ipynb) |
 | [How RAG with txtai works](https://github.com/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) | Create RAG processes, API services and Docker instances | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) |
 | [Advanced RAG with graph path traversal](https://github.com/neuml/txtai/blob/master/examples/58_Advanced_RAG_with_graph_path_traversal.ipynb) | Graph path traversal to collect complex sets of data for advanced RAG | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/58_Advanced_RAG_with_graph_path_traversal.ipynb) |
-| [Advanced RAG with guided generation](https://github.com/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) | Retrieval Augmented and Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) |
+| [Speech to Speech RAG](https://github.com/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) [▶️](https://www.youtube.com/watch?v=tH8QWwkVMKA) | Full cycle speech to speech workflow with RAG | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) |
 
 ### Language Model Workflows
 

diff --git a/docs/examples.md b/docs/examples.md
@@ -39,6 +39,7 @@ LLM chains, retrieval augmented generation (RAG), chat with your data, pipelines
 | [Advanced RAG with guided generation](https://github.com/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) | Retrieval Augmented and Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) |
 | [RAG with llama.cpp and external API services](https://github.com/neuml/txtai/blob/master/examples/62_RAG_with_llama_cpp_and_external_API_services.ipynb) | RAG with additional vector and LLM frameworks | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/62_RAG_with_llama_cpp_and_external_API_services.ipynb) |
 | [How RAG with txtai works](https://github.com/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) | Create RAG processes, API services and Docker instances | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) |
+| [Speech to Speech RAG](https://github.com/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) [▶️](https://www.youtube.com/watch?v=tH8QWwkVMKA) | Full cycle speech to speech workflow with RAG | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) |
 
 ## Pipelines
 

diff --git a/docs/usecases.md b/docs/usecases.md
@@ -54,7 +54,7 @@ A novel feature of txtai is that it can provide both an answer and source citati
 | [Build RAG pipelines with txtai](https://github.com/neuml/txtai/blob/master/examples/52_Build_RAG_pipelines_with_txtai.ipynb) | Guide on retrieval augmented generation including how to create citations | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/52_Build_RAG_pipelines_with_txtai.ipynb) |
 | [How RAG with txtai works](https://github.com/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) | Create RAG processes, API services and Docker instances | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/63_How_RAG_with_txtai_works.ipynb) |
 | [Advanced RAG with graph path traversal](https://github.com/neuml/txtai/blob/master/examples/58_Advanced_RAG_with_graph_path_traversal.ipynb) | Graph path traversal to collect complex sets of data for advanced RAG | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/58_Advanced_RAG_with_graph_path_traversal.ipynb) |
-| [Advanced RAG with guided generation](https://github.com/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) | Retrieval Augmented and Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/60_Advanced_RAG_with_guided_generation.ipynb) |
+| [Speech to Speech RAG](https://github.com/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) [▶️](https://www.youtube.com/watch?v=tH8QWwkVMKA) | Full cycle speech to speech workflow with RAG | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/65_Speech_to_Speech_RAG.ipynb) |
 
 ## Language Model Workflows
 

diff --git a/examples/65_Speech_to_Speech_RAG.ipynb b/examples/65_Speech_to_Speech_RAG.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Speech to Speech RAG\n",
+    "\n",
+    "[txtai](https://github.com/neuml/txtai) is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.\n",
+    "\n",
+    "There are many articles, notebooks and examples covering how to perform vector search and/or retrieval augmented generation (RAG) with txtai. A lesser known component of txtai is it's built-in workflow component.\n",
+    "\n",
+    "Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches. This allows large volumes of data to be processed efficiently.\n",
+    "\n",
+    "This notebook will demonstrate how to to build a Speech to Speech (S2S) workflow with txtai.\n",
+    "\n",
+    "_Note: This process is intended to run on local machines due to it's use of input and output audio devices._"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Install dependencies\n",
+    "\n",
+    "Install `txtai` and all dependencies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-audio] autoawq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Define the S2S RAG Workflow\n",
+    "\n",
+    "The next section defines the Speech to Speech (S2S) RAG workflow. The objective of this workflow is to respond to a user request in near real-time.\n",
+    "\n",
+    "txtai supports workflow definitions in Python and with YAML. We'll cover both methods.\n",
+    "\n",
+    "The S2S workflow below starts with a microphone pipeline, which streams and processes input audio. The microphone pipeline has voice activity detection (VAD) built-in. When speech is detected, the pipeline returns the captured audio data. Next, the speech is transcribed to text and then passed to a RAG pipeline prompt. Finally, the RAG result is run through a text to speech (TTS) pipeline and streamed to an output audio device."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "\n",
+    "from txtai import Embeddings, RAG\n",
+    "from txtai.pipeline import AudioStream, Microphone, TextToSpeech, Transcription\n",
+    "from txtai.workflow import Workflow, StreamTask, Task\n",
+    "\n",
+    "# Enable DEBUG logging\n",
+    "logging.basicConfig()\n",
+    "logging.getLogger().setLevel(logging.DEBUG)\n",
+    "\n",
+    "# Microphone\n",
+    "microphone = Microphone()\n",
+    "\n",
+    "# Transcription\n",
+    "transcribe = Transcription(\"distil-whisper/distil-large-v3\")\n",
+    "\n",
+    "# Embeddings database\n",
+    "embeddings = Embeddings()\n",
+    "embeddings.load(provider=\"huggingface-hub\", container=\"neuml/txtai-wikipedia\")\n",
+    "\n",
+    "# Define prompt template\n",
+    "template = \"\"\"\n",
+    "Answer the following question using only the context below. Only include information\n",
+    "specifically discussed. Answer the question without explaining how you found the answer.\n",
+    "\n",
+    "question: {question}\n",
+    "context: {context}\"\"\"\n",
+    "\n",
+    "# Create RAG pipeline\n",
+    "rag = RAG(\n",
+    "    embeddings,\n",
+    "    \"hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4\",\n",
+    "    system=\"You are a friendly assistant. You answer questions from users.\",\n",
+    "    template=template,\n",
+    "    context=10\n",
+    ")\n",
+    "\n",
+    "# Text to speech\n",
+    "tts = TextToSpeech(\"neuml/vctk-vits-onnx\")\n",
+    "\n",
+    "# Audio stream\n",
+    "audiostream = AudioStream()\n",
+    "\n",
+    "# Define speech to speech workflow\n",
+    "workflow = Workflow(tasks=[\n",
+    "    Task(action=microphone),\n",
+    "    Task(action=transcribe, unpack=False),\n",
+    "    StreamTask(action=lambda x: rag(x, maxlength=4096, stream=True), batch=True),\n",
+    "    StreamTask(action=lambda x: tts(x, stream=True, speaker=15), batch=True),\n",
+    "    StreamTask(action=audiostream, batch=True)\n",
+    "])\n",
+    "\n",
+    "while True:\n",
+    "    print(\"Waiting for input...\")\n",
+    "    list(workflow([None]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given that the input and outputs are audio, you'll have to use your imagination if you're reading this as an article.\n",
+    "\n",
+    "[Check out this video](https://www.youtube.com/watch?v=tH8QWwkVMKA) to see the workflow in action! The following examples are run:\n",
+    "\n",
+    "- Tell me about the Roman Empire\n",
+    "- Explain how faster than light travel could work\n",
+    "- Write a short poem about the Vikings\n",
+    "- Tell me about the Roman Empire in French"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# S2S Workflow in YAML\n",
+    "\n",
+    "A crucial feature of txtai workflows is that they can be defined with YAML. This enables building workflows in a low-code and/or no-code setting. These YAML workflows can then be \"dockerized\" and run.\n",
+    "\n",
+    "Let's define the same workflow below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile s2s.yml\n",
+    "# Microphone\n",
+    "microphone:\n",
+    "\n",
+    "# Transcription\n",
+    "transcription:\n",
+    "  path: distil-whisper/distil-large-v3\n",
+    "\n",
+    "# Embeddings database\n",
+    "cloud:\n",
+    "  provider: huggingface-hub\n",
+    "  container: neuml/txtai-wikipedia\n",
+    "\n",
+    "embeddings:\n",
+    "\n",
+    "# RAG\n",
+    "rag:\n",
+    "  path: \"hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4\"\n",
+    "  system: You are a friendly assistant. You answer questions from users.\n",
+    "  template: |\n",
+    "    Answer the following question using only the context below. Only include information\n",
+    "    specifically discussed. Answer the question without explaining how you found the answer.\n",
+    "\n",
+    "    question: {question}\n",
+    "    context: {context}\n",
+    "  context: 10\n",
+    "\n",
+    "# TTS\n",
+    "texttospeech:\n",
+    "  path: neuml/vctk-vits-onnx\n",
+    "\n",
+    "# AudioStream\n",
+    "audiostream:\n",
+    "\n",
+    "# Speech to Speech Chat workflow\n",
+    "workflow:\n",
+    "  s2s:\n",
+    "    tasks:\n",
+    "      - microphone\n",
+    "      - action: transcription\n",
+    "        unpack: False\n",
+    "      - task: stream\n",
+    "        action: rag\n",
+    "        args:\n",
+    "          maxlength: 4096\n",
+    "          stream: True\n",
+    "        batch: True\n",
+    "      - task: stream\n",
+    "        action: texttospeech\n",
+    "        args:\n",
+    "          stream: True\n",
+    "          speaker: 15\n",
+    "        batch: True\n",
+    "      - task: stream\n",
+    "        action: audiostream\n",
+    "        batch: True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from txtai import Application\n",
+    "\n",
+    "app = Application(\"s2s.yml\")\n",
+    "while True:\n",
+    "    print(\"Waiting for input...\")\n",
+    "    list(app.workflow(\"s2s\", [None]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once again, the same idea, just a different way to do it. In the video demo, the following query was asked.\n",
+    "\n",
+    "- As a Patriots fan, who would you guess is my favorite quarterback of all time is?\n",
+    "- I'm tall and run fast, what do you think the best soccer position for me is?\n",
+    "- I run slow, what do you think the best soccer position for me is?\n",
+    "\n",
+    "With YAML workflows, it's possible to fully define the process outside of code such as with a web interface. Perhaps someday we'll see this with [txtai.cloud](https://txtai.cloud) 😀"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Wrapping up\n",
+    "\n",
+    "This notebook demonstrated how to build a Speech to Speech (S2S) workflow with txtai. While the workflow uses an off-the-shelf embeddings database, a custom embeddings database can easily be swapped in. From there, we have S2S with our own data!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "local",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.20"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}