Merge pull request #1492 from jinhonglin-ryan/master

tutorial 1: s3+langchain+milvus; tutorial 2: s3+vectorETL+milvus
milvus-io · Feb 6, 2025 · 53ad6de · 53ad6de
2 parents 6b1f7b6 + 650b331
commit 53ad6de
Show file tree

Hide file tree

Showing 2 changed files with 745 additions and 0 deletions.
diff --git a/bootcamp/tutorials/integration/ETL_using_vectorETL.ipynb b/bootcamp/tutorials/integration/ETL_using_vectorETL.ipynb
@@ -0,0 +1,282 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/milvus-io/bootcamp/blob/master/bootcamp/tutorials/integration/ETL_using_vectorETL.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>   <a href=\"https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/integration/ETL_using_vectorETL.ipynb\" target=\"_blank\">\n",
+    "    <img src=\"https://img.shields.io/badge/View%20on%20GitHub-555555?style=flat&logo=github&logoColor=white\" alt=\"GitHub Repository\"/>\n",
+    "    "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "647b44d2c4bbfeff"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Efficient Data Loading into Milvus with VectorETL\n",
+    "\n",
+    "In this tutorial, we'll explore how to efficiently load data into Milvus using [VectorETL](https://github.com/ContextData/VectorETL), a lightweight ETL framework designed for vector databases. VectorETL simplifies the process of extracting data from various sources, transforming it into vector embeddings using AI models, and storing it in Milvus for fast and scalable retrieval. By the end of this tutorial, you'll have a working ETL pipeline that allows you to integrate and manage vector search systems with ease. Let’s dive in!"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "8d69e7741bcaee96"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Preparation\n",
+    "\n",
+    "### Dependency and Environment"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "8bf54f650a2417fc"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "! pip install --upgrade vector-etl pymilvus"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "8d09d8c98ac830b6",
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "27644806ccfc3b80"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "VectorETL supports multiple data sources, including Amazon S3, Google Cloud Storage, Local File, etc. You can check out the full list of supported sources [here](https://github.com/ContextData/VectorETL?tab=readme-ov-file#source-configuration). In this tutorial, we’ll focus on Amazon S3 as a data source example.  \n",
+    "\n",
+    "We will load documents from Amazon S3. Therefore, you need to prepare `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` as environment variables to securely access your S3 bucket. Additionally, we will use OpenAI's `text-embedding-ada-002` embedding model to generate embeddings for the data. You should also prepare the [api key](https://platform.openai.com/docs/quickstart) `OPENAI_API_KEY` as an environment variable.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "ddf9b65fea467b12"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-api-key\"\n",
+    "os.environ[\"AWS_ACCESS_KEY_ID\"] = \"your-aws-access-key-id\"\n",
+    "os.environ[\"AWS_SECRET_ACCESS_KEY\"] = \"your-aws-secret-access-key\""
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2025-02-05T23:16:00.803349Z",
+     "start_time": "2025-02-05T23:16:00.799852Z"
+    }
+   },
+   "id": "d338d2506d913ae4",
+   "execution_count": 1
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Workflow\n",
+    "\n",
+    "### Defining the Data Source (Amazon S3)\n",
+    "\n",
+    "In this case, we are extracting documents from an Amazon S3 bucket. VectorETL allows us to specify the bucket name, the path to the files, and the type of data we are working with. "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "bbf3e94e386225b"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "source = {\n",
+    "    \"source_data_type\": \"Amazon S3\",\n",
+    "    \"bucket_name\": \"my-bucket\",\n",
+    "    \"key\": \"path/to/files/\",\n",
+    "    \"file_type\": \".csv\",\n",
+    "    \"aws_access_key_id\": os.environ[\"AWS_ACCESS_KEY_ID\"],\n",
+    "    \"aws_secret_access_key\": os.environ[\"AWS_SECRET_ACCESS_KEY\"],\n",
+    "}"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a3e009d70023cd67"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Configuring the Embedding Model (OpenAI)\n",
+    "\n",
+    "Once we have our data source set up, we need to define the embedding model that will transform our textual data into vector embeddings. Here, we use OpenAI’s `text-embedding-ada-002` in this example."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "66b7731a565958fe"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "embedding = {\n",
+    "    \"embedding_model\": \"OpenAI\",\n",
+    "    \"api_key\": os.environ[\"OPENAI_API_KEY\"],\n",
+    "    \"model_name\": \"text-embedding-ada-002\",\n",
+    "}"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "c96302937c219b5e"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Setting Up Milvus as the Target Database\n",
+    "\n",
+    "We need to store the generated embeddings in Milvus. Here, we define our Milvus connection parameters using Milvus Lite. "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "e8b7f7c360de5795"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "target = {\n",
+    "    \"target_database\": \"Milvus\",\n",
+    "    \"host\": \"./milvus.db\",  # os.environ[\"ZILLIZ_CLOUD_PUBLIC_ENDPOINT\"] if using Zilliz Cloud\n",
+    "    \"api_key\": \"\",  # os.environ[\"ZILLIZ_CLOUD_TOKEN\"] if using Zilliz Cloud\n",
+    "    \"collection_name\": \"my_collection\",\n",
+    "    \"vector_dim\": 1536,  # 1536 for text-embedding-ada-002\n",
+    "}"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "c0012154b331aee4"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "> For the `host` and `api_key`:\n",
+    "> - Setting the `host` as a local file, e.g.`./milvus.db`, and leave `api_key` empty is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
+    "> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `host` and leave `api_key` empty.\n",
+    "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `host` and `api_key`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "5737e7eeb2a17a4c"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Specifying Columns for Embedding\n",
+    "\n",
+    "Now, we need to specify which columns from our CSV files should be converted into embeddings. This ensures that only the relevant text fields are processed, optimizing both efficiency and storage."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "d49e589854a5b06f"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "initial_id",
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "embed_columns = [\"col_1\", \"col_2\", \"col_3\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Creating and Executing the VectorETL Pipeline\n",
+    "\n",
+    "With all configurations in place, we now initialize the ETL pipeline, set up the data flow, and execute it."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "f36dc52eb8473024"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "from vector_etl import create_flow\n",
+    "\n",
+    "flow = create_flow()\n",
+    "flow.set_source(source)\n",
+    "flow.set_embedding(embedding)\n",
+    "flow.set_target(target)\n",
+    "flow.set_embed_columns(embed_columns)\n",
+    "\n",
+    "# Execute the flow\n",
+    "flow.execute()"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "d2e41a45912b36f7"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "By following this tutorial, we have successfully built an end-to-end ETL pipeline to move documents from Amazon S3 to Milvus using VectorETL. VectorETL is flexible in data sources, so you can choose whatever data sources you like based on your specific application needs. With VectorETL’s modular design, you can easily extend this pipeline to support other data sources, embedding models, making it a powerful tool for AI and data engineering workflows! "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "22d6ae0349563f71"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}