From 845eed264b8124f9657fff523dc36b063b751278 Mon Sep 17 00:00:00 2001 From: ChengZi Date: Fri, 10 Jan 2025 17:40:09 +0800 Subject: [PATCH 1/3] full_text_search_with_langchain Signed-off-by: ChengZi --- .../full_text_search_with_langchain.ipynb | 675 ++++++++++++++++++ 1 file changed, 675 insertions(+) create mode 100644 bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb diff --git a/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb new file mode 100644 index 000000000..e277b02d8 --- /dev/null +++ b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb @@ -0,0 +1,675 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "\"Open \n", + " \"GitHub\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using full-text search with LangChain and Milvus\n", + "\n", + "[Full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) retrieves documents with specific terms or phrases in text datasets and ranks results by relevance. It overcomes semantic search limitations to provide accurate, context-relevant results. Also, it simplifies vector searches, accepting raw text and automatically converting it into sparse embeddings without manual generation. By integrating full-text search with semantic-based dense vector search, you can enhance the accuracy and relevance of search results.\n", + "\n", + "BM25 is an important ranking algorithm in full-text search. Using the BM25 algorithm for relevance scoring, this feature is particularly valuable in retrieval-augmented generation (RAG) scenarios, where it prioritizes documents that closely match specific search terms. \n", + "\n", + "Milvus 2.5 introduced the full-text search [feature](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md). As a further layer of framework, LangChain's Milvus integration has also launched this feature, making it easy to integrate full-text search into your application.\n", + "\n", + "In this tutorial, we will show you how to use LangChain and Milvus to use full-text search into your application.\n", + "\n", + "> - Full text search is available in Milvus Standalone and Milvus Distributed but not Milvus Lite, although adding it to Milvus Lite is on the roadmap.\n", + "> - Before reading this tutorial, you need to have a basic understanding of [full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search). In addition, you also need to know the [basic usage](https://milvus.io/docs/basic_usage_langchain.md) of LangChain Milvus integration.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Before running this notebook, make sure you have the following dependencies installed:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "! pip install --upgrade --quiet langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai langchain-voyageai bs4" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "We will use the models from OpenAI and VoyageAI. You should prepare the environment variables `OPENAI_API_KEY` from [OpenAI](https://platform.openai.com/docs/quickstart) and `VOYAGE_API_KEY` from [VoyageAI](https://docs.voyageai.com/docs/api-key-and-installation)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\n", + "os.environ[\"VOYAGE_API_KEY\"] = \"pa-***********\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Install and start the Milvus server following this [guide](https://milvus.io/docs/install_standalone-docker-compose.md). And set your Milvus server `URI` (or optional `TOKEN`)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "URI = \"http://localhost:19530\"\n", + "# TOKEN = ..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Prepare some examples documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.documents import Document\n", + "\n", + "docs = [\n", + " Document(page_content=\"I like apple\", metadata={\"foo\": \"bar\"}),\n", + " Document(page_content=\"I like banana\", metadata={\"foo\": \"baz\"}),\n", + " Document(page_content=\"I like orange\", metadata={\"foo\": \"qux\"}),\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization with BM25 Function\n", + "### Hybrid Search\n", + "\n", + "Unlike simply passing an embedding to the `VectorStore`, the Milvus VectorStore provides a `builtin_function` parameter. Through this parameter, you can pass an instance of the BM25 function.\n", + "\n", + "Here is a simple example of combining OpenAI embeddings with the BM25 function from Milvus:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "from langchain_milvus import Milvus, BM25BuiltInFunction\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "\n", + "vectorstore = Milvus.from_documents(\n", + " documents=docs,\n", + " embedding=OpenAIEmbeddings(),\n", + " builtin_function=BM25BuiltInFunction(),\n", + " # `dense` is for OpenAI embeddings, `sparse` is the output field of BM25 function\n", + " vector_field=[\"dense\", \"sparse\"],\n", + " connection_args={\n", + " \"uri\": URI,\n", + " },\n", + " consistency_level=\"Strong\",\n", + " drop_old=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the code above, we define an instance of `BM25BuiltInFunction` and pass it to the `Milvus` object. `BM25BuiltInFunction` is a lightweight wrapper class for the [`Function`](https://milvus.io/docs/manage-collections.md#Function) in Milvus.\n", + "\n", + "You can specify the input and output fields for this function in the parameters of the `BM25BuiltInFunction` instance by passing the following two field parameters:\n", + "- `input_field_names` (str): The name of the input field, default is `text`. It indicates which field this function reads as input.\n", + "- `output_field_names` (str): The name of the output field, default is `sparse`. It indicates which field this function outputs the computed result to.\n", + "\n", + "Note that in the Milvus initialization parameters mentioned above, we also specify `vector_field=[\"dense\", \"sparse\"]`. Since the `sparse` field is the output field defined by the `BM25BuiltInFunction`, the other `dense` field will be automatically assigned to the output field of OpenAIEmbeddings.\n", + "\n", + "In practice, especially when combining multiple embeddings or functions, we recommend clearly specifying the input and output fields for each function to avoid confusion.\n", + "\n", + "In the following example, it specifies the input and output fields of BM25BuiltInFunction, and three vector fields, which makes it clear which field each built-in function and each vector embedding.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['dense1', 'dense2', 'sparse']" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_voyageai import VoyageAIEmbeddings\n", + "\n", + "embedding1 = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "embedding2 = VoyageAIEmbeddings(model=\"voyage-3\")\n", + "\n", + "vectorstore = Milvus.from_documents(\n", + " documents=docs,\n", + " embedding=[embedding1, embedding2],\n", + " builtin_function=BM25BuiltInFunction(\n", + " input_field_names=\"text\", output_field_names=\"sparse\"\n", + " ),\n", + " text_field=\"text\", # `text` is the input field name of BM25BuiltInFunction\n", + " # `sparse` is the output field name of BM25BuiltInFunction, and `dense1` and `dense2` are the output field names of OpenAIEmbeddings and VoyageAIEmbeddings\n", + " vector_field=[\"dense1\", \"dense2\", \"sparse\"],\n", + " connection_args={\n", + " \"uri\": URI,\n", + " },\n", + " consistency_level=\"Strong\",\n", + " drop_old=True,\n", + ")\n", + "\n", + "vectorstore.vector_fields" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for `OpenAIEmbeddings` and `VoyageAIEmbeddings`, respectively. \n", + "\n", + "In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, enabling hybrid search.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When performing hybrid search, we just pass in the query text and optionally set the topK and reranker parameters. The `vectorstore` instance will automatically handle the vector embeddings and built-in functions and finally use a reranker to refine the results. From the user's end, we don't need to care about the underlying implementation details of the searching process." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'foo': 'qux', 'pk': 454646931479251686}, page_content='I like orange')]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vectorstore.similarity_search(\n", + " \"Do I like oranges?\", k=1\n", + ") # , ranker_type=\"weighted\", ranker_params={\"weights\":[0.3, 0.3, 0.4]})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more information about how to use the hybrid search, you can refer to the [Hybrid Search introduction](https://milvus.io/docs/multi-vector-search.md#Hybrid-Search) and this [LangChain Milvus hybrid search tutorial](https://milvus.io/docs/milvus_hybrid_search_retriever.md) ." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### BM25 search without embedding\n", + "\n", + "If you want to perform lexical frequency-based full-text search using only a single BM25 function without using any embedding-based semantic similarity search, you can set the embedding parameter input to `None` and keep only the `builtin_function` parameter input as the BM25 function instance. For example: " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['sparse']" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vectorstore = Milvus.from_documents(\n", + " documents=docs,\n", + " embedding=None,\n", + " builtin_function=BM25BuiltInFunction(\n", + " output_field_names=\"sparse\",\n", + " ),\n", + " vector_field=\"sparse\",\n", + " connection_args={\n", + " \"uri\": URI,\n", + " },\n", + " consistency_level=\"Strong\",\n", + " drop_old=True,\n", + ")\n", + "\n", + "vectorstore.vector_fields" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Customize analyzer\n", + "\n", + "Analyzers are essential tools in text processing that convert raw text into structured, searchable formats. They play a key role in enabling efficient indexing and retrieval by breaking down input text into tokens and refining these tokens through a combination of tokenizers and filters. For more information, you can refer [this guide](https://milvus.io/docs/analyzer-overview.md#Analyzer-Overview) to learn more about analyzers in Milvus.\n", + "\n", + "Milvus supports two types of analyzers: **Built-in Analyzers** and **Custom Analyzers**. By default, the `BM25BuiltInFunction` will use the [default standard analyzer](https://milvus.io/docs/standard-analyzer.md), which makes it effective for most languages. \n", + "\n", + "However, if you want to use a different analyzer or customize the analyzer, you can pass in the `analyzer_params` parameter in the `BM25BuiltInFunction` initialization.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "analyzer_params_custom = {\n", + " \"tokenizer\": \"standard\",\n", + " \"filter\": [\n", + " \"lowercase\", # Built-in filter\n", + " {\"type\": \"length\", \"max\": 40}, # Custom filter\n", + " {\"type\": \"stop\", \"stop_words\": [\"of\", \"to\"]}, # Custom filter\n", + " ],\n", + "}\n", + "\n", + "\n", + "vectorstore = Milvus.from_documents(\n", + " documents=docs,\n", + " embedding=OpenAIEmbeddings(),\n", + " builtin_function=BM25BuiltInFunction(\n", + " output_field_names=\"sparse\",\n", + " enable_match=True,\n", + " analyzer_params=analyzer_params_custom,\n", + " ),\n", + " vector_field=[\"dense\", \"sparse\"],\n", + " connection_args={\n", + " \"uri\": URI,\n", + " },\n", + " consistency_level=\"Strong\",\n", + " drop_old=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can take a look at the schema of the Milvus collection and make sure the customized analyzer is set up correctly." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': , 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': , 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': , 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': , 'is_function_output': True}, {'name': 'foo', 'description': '', 'type': , 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_7c99f463', 'description': '', 'type': , 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vectorstore.col.schema" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more concept details, e.g., `analyzer`, `tokenizer`, `filter`, `enable_match`, `analyzer_params`, please refer to the [analyzer documentation](https://milvus.io/docs/analyzer-overview.md)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best practice of RAG\n", + "We have learned how to use the basic BM25 build-in function in LangChain and Milvus. Let's introduce the best practice of RAG in combination with this usage.\n", + "\n", + "\n", + "![](../../../../images/advanced_rag/hybrid_and_rerank.png)\n", + "\n", + "This diagram shows the Hybrid Retrieve & Reranking process, combining BM25 for keyword matching and vector search for semantic retrieval. Results from both methods are merged, reranked, and passed to an LLM to generate the final answer.\n", + "\n", + "Hybrid search balances precision and semantic understanding, improving accuracy and robustness for diverse queries. It retrieves candidates with BM25 full-text search and vector search, ensuring both semantic, context-aware, and accurate retrieval.\n", + "\n", + "Let's get started with an example." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Prepare the data\n", + "\n", + "We use the Langchain WebBaseLoader to load documents from web sources and split them into chunks using the RecursiveCharacterTextSplitter.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "USER_AGENT environment variable not set, consider setting it to identify your requests.\n" + ] + }, + { + "data": { + "text/plain": [ + "Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.\\nComponent One: Planning#\\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\\nTask Decomposition#\\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.\\nTask decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.\\nAnother quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.\\nSelf-Reflection#')" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import bs4\n", + "from langchain_community.document_loaders import WebBaseLoader\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "# Create a WebBaseLoader instance to load documents from web sources\n", + "loader = WebBaseLoader(\n", + " web_paths=(\n", + " \"https://lilianweng.github.io/posts/2023-06-23-agent/\",\n", + " \"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/\",\n", + " ),\n", + " bs_kwargs=dict(\n", + " parse_only=bs4.SoupStrainer(\n", + " class_=(\"post-content\", \"post-title\", \"post-header\")\n", + " )\n", + " ),\n", + ")\n", + "# Load documents from web sources using the loader\n", + "documents = loader.load()\n", + "# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)\n", + "\n", + "# Split the documents into chunks using the text_splitter\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "# Let's take a look at the first document\n", + "docs[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the document into Milvus vector store\n", + "As the introduction above, we initialize and load the prepared documents into Milvus vector store, which contains two vector fields: `dense` is for the OpenAI embedding and `sparse` is for the BM25 function." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "vectorstore = Milvus.from_documents(\n", + " documents=docs,\n", + " embedding=OpenAIEmbeddings(),\n", + " builtin_function=BM25BuiltInFunction(),\n", + " vector_field=[\"dense\", \"sparse\"],\n", + " connection_args={\n", + " \"uri\": URI,\n", + " },\n", + " consistency_level=\"Strong\",\n", + " drop_old=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Build RAG chain\n", + "We prepare the LLM instance and prompt, then conbine them into a RAG pipeline using the LangChain Expression Language." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_core.prompts import PromptTemplate\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "# Initialize the OpenAI language model for response generation\n", + "llm = ChatOpenAI(model_name=\"gpt-4o\", temperature=0)\n", + "\n", + "# Define the prompt template for generating AI responses\n", + "PROMPT_TEMPLATE = \"\"\"\n", + "Human: You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.\n", + "Use the following pieces of information to provide a concise answer to the question enclosed in tags.\n", + "If you don't know the answer, just say that you don't know, don't try to make up an answer.\n", + "\n", + "{context}\n", + "\n", + "\n", + "\n", + "{question}\n", + "\n", + "\n", + "The response should be specific and use statistics or numbers when possible.\n", + "\n", + "Assistant:\"\"\"\n", + "\n", + "# Create a PromptTemplate instance with the defined template and input variables\n", + "prompt = PromptTemplate(\n", + " template=PROMPT_TEMPLATE, input_variables=[\"context\", \"question\"]\n", + ")\n", + "# Convert the vector store to a retriever\n", + "retriever = vectorstore.as_retriever()\n", + "\n", + "\n", + "# Define a function to format the retrieved documents\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(doc.page_content for doc in docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "Use the LCEL(LangChain Expression Language) to build a RAG chain." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation\n", + "rag_chain = (\n", + " {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "\n", + "# rag_chain.get_graph().print_ascii()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Invoke the RAG chain with a specific question and retrieve the response" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, effectively decoupling complex computation and reasoning. PAL and PoT rely on language models with strong coding skills to perform these tasks.'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "query = \"What is PAL and PoT?\"\n", + "res = rag_chain.invoke(query)\n", + "res" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "Congratulations! You have built a hybrid(dense vector + sparse bm25 function) search RAG chain powered by Milvus and LangChain." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From ea49f33afa2d2fc283241097402f4a6ff07c588e Mon Sep 17 00:00:00 2001 From: ChengZi Date: Mon, 13 Jan 2025 16:34:59 +0800 Subject: [PATCH 2/3] optimize full_text_search_with_langchain Signed-off-by: ChengZi --- .../full_text_search_with_langchain.ipynb | 99 +++++++++---------- 1 file changed, 46 insertions(+), 53 deletions(-) diff --git a/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb index e277b02d8..0f78d360a 100644 --- a/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb +++ b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb @@ -18,19 +18,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Using full-text search with LangChain and Milvus\n", + "# Using Full-Text Search with LangChain and Milvus\n", "\n", - "[Full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) retrieves documents with specific terms or phrases in text datasets and ranks results by relevance. It overcomes semantic search limitations to provide accurate, context-relevant results. Also, it simplifies vector searches, accepting raw text and automatically converting it into sparse embeddings without manual generation. By integrating full-text search with semantic-based dense vector search, you can enhance the accuracy and relevance of search results.\n", + "[Full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) is a traditional method for retrieving documents that contain specific terms or phrases by directly matching keywords within the text. It ranks results based on relevance, typically determined by factors such as term frequency and proximity. While semantic search excels at understanding intent and context, full-text search provides precision for exact keyword matching, making it a valuable complementary tool. The BM25 algorithm is a popular ranking method for full-text search, particularly useful in Retrieval-Augmented Generation (RAG).\n", "\n", - "BM25 is an important ranking algorithm in full-text search. Using the BM25 algorithm for relevance scoring, this feature is particularly valuable in retrieval-augmented generation (RAG) scenarios, where it prioritizes documents that closely match specific search terms. \n", + "Since [Milvus 2.5](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md), full-text search is natively supported through the `Sparse-BM25` approach, by representing the BM25 algorithm as sparse vectors. Milvus accepts raw text as input and automatically converts it into sparse vectors stored in a specified field, eliminating the need for manual sparse embedding generation.\n", "\n", - "Milvus 2.5 introduced the full-text search [feature](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md). As a further layer of framework, LangChain's Milvus integration has also launched this feature, making it easy to integrate full-text search into your application.\n", + "LangChain's integration with Milvus has also introduced this feature, simplifying the process of incorporating full-text search into RAG applications. By combining full-text search with semantic search with dense vectors, you can achieve a hybrid approach that leverages both semantic context from dense embeddings and precise keyword relevance from word matching. This integration enhances the accuracy, relevance, and user experience of search systems.\n", "\n", - "In this tutorial, we will show you how to use LangChain and Milvus to use full-text search into your application.\n", + "This tutorial will show how to use LangChain and Milvus to implement full-text search in your application.\n", "\n", - "> - Full text search is available in Milvus Standalone and Milvus Distributed but not Milvus Lite, although adding it to Milvus Lite is on the roadmap.\n", - "> - Before reading this tutorial, you need to have a basic understanding of [full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search). In addition, you also need to know the [basic usage](https://milvus.io/docs/basic_usage_langchain.md) of LangChain Milvus integration.\n", - "\n" + "> - Full-text search is available in Milvus Standalone and Milvus Distributed, but not in Milvus Lite, although it is on the roadmap for future inclusion. It will also be available in Zilliz Cloud (fully-managed Milvus) soon. Please reach out to support@zilliz.com for more information.\n", + "> - Before proceeding with this tutorial, ensure you have a basic understanding of [full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) and the [basic usage](https://milvus.io/docs/basic_usage_langchain.md) of LangChain Milvus integration." ] }, { @@ -48,7 +47,7 @@ "metadata": {}, "outputs": [], "source": [ - "! pip install --upgrade --quiet langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai langchain-voyageai bs4" + "! pip install --upgrade --quiet langchain langchain-core langchain-community langchain-text-splitters langchain-milvus langchain-openai bs4 #langchain-voyageai" ] }, { @@ -72,12 +71,12 @@ } }, "source": [ - "We will use the models from OpenAI and VoyageAI. You should prepare the environment variables `OPENAI_API_KEY` from [OpenAI](https://platform.openai.com/docs/quickstart) and `VOYAGE_API_KEY` from [VoyageAI](https://docs.voyageai.com/docs/api-key-and-installation)." + "We will use the models from OpenAI. You should prepare the environment variables `OPENAI_API_KEY` from [OpenAI](https://platform.openai.com/docs/quickstart)." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { @@ -91,8 +90,7 @@ "source": [ "import os\n", "\n", - "os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\n", - "os.environ[\"VOYAGE_API_KEY\"] = \"pa-***********\"" + "os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"" ] }, { @@ -104,7 +102,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -121,16 +119,16 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from langchain_core.documents import Document\n", "\n", "docs = [\n", - " Document(page_content=\"I like apple\", metadata={\"foo\": \"bar\"}),\n", - " Document(page_content=\"I like banana\", metadata={\"foo\": \"baz\"}),\n", - " Document(page_content=\"I like orange\", metadata={\"foo\": \"qux\"}),\n", + " Document(page_content=\"I like apple\", metadata={\"category\": \"fruit\"}),\n", + " Document(page_content=\"I like swimming\", metadata={\"category\": \"sport\"}),\n", + " Document(page_content=\"I like dogs\", metadata={\"category\": \"pets\"}),\n", "]" ] }, @@ -141,14 +139,14 @@ "## Initialization with BM25 Function\n", "### Hybrid Search\n", "\n", - "Unlike simply passing an embedding to the `VectorStore`, the Milvus VectorStore provides a `builtin_function` parameter. Through this parameter, you can pass an instance of the BM25 function.\n", + "For full-text search Milvus VectorStore accepts a `builtin_function` parameter. Through this parameter, you can pass in an instance of the `BM25BuiltInFunction`. This is different than semantic search which usually passes dense embeddings to the `VectorStore`, \n", "\n", - "Here is a simple example of combining OpenAI embeddings with the BM25 function from Milvus:" + "Here is a simple example of hybrid search in Milvus with OpenAI dense embedding for semantic search and BM25 for full-text search:" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { @@ -198,7 +196,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -207,16 +205,18 @@ "['dense1', 'dense2', 'sparse']" ] }, - "execution_count": 6, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "from langchain_voyageai import VoyageAIEmbeddings\n", + "# from langchain_voyageai import VoyageAIEmbeddings\n", + "\n", + "embedding1 = OpenAIEmbeddings(model=\"text-embedding-ada-002\")\n", + "embedding2 = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "# embedding2 = VoyageAIEmbeddings(model=\"voyage-3\") # You can also use embedding from other embedding model providers, e.g VoyageAIEmbeddings\n", "\n", - "embedding1 = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", - "embedding2 = VoyageAIEmbeddings(model=\"voyage-3\")\n", "\n", "vectorstore = Milvus.from_documents(\n", " documents=docs,\n", @@ -225,7 +225,7 @@ " input_field_names=\"text\", output_field_names=\"sparse\"\n", " ),\n", " text_field=\"text\", # `text` is the input field name of BM25BuiltInFunction\n", - " # `sparse` is the output field name of BM25BuiltInFunction, and `dense1` and `dense2` are the output field names of OpenAIEmbeddings and VoyageAIEmbeddings\n", + " # `sparse` is the output field name of BM25BuiltInFunction, and `dense1` and `dense2` are the output field names of embedding1 and embedding2\n", " vector_field=[\"dense1\", \"dense2\", \"sparse\"],\n", " connection_args={\n", " \"uri\": URI,\n", @@ -241,7 +241,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for `OpenAIEmbeddings` and `VoyageAIEmbeddings`, respectively. \n", + "In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for the two `OpenAIEmbeddings` models. \n", "\n", "In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, enabling hybrid search.\n" ] @@ -255,23 +255,23 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(metadata={'foo': 'qux', 'pk': 454646931479251686}, page_content='I like orange')]" + "[Document(metadata={'pk': 454646931479251826, 'category': 'fruit'}, page_content='I like apple')]" ] }, - "execution_count": 7, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorstore.similarity_search(\n", - " \"Do I like oranges?\", k=1\n", + " \"Do I like apple?\", k=1\n", ") # , ranker_type=\"weighted\", ranker_params={\"weights\":[0.3, 0.3, 0.4]})" ] }, @@ -293,7 +293,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -302,7 +302,7 @@ "['sparse']" ] }, - "execution_count": 8, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -341,7 +341,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -381,16 +381,16 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': , 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': , 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': , 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': , 'is_function_output': True}, {'name': 'foo', 'description': '', 'type': , 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_7c99f463', 'description': '', 'type': , 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}" + "{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': , 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': , 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': , 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': , 'is_function_output': True}, {'name': 'category', 'description': '', 'type': , 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_333d45a1', 'description': '', 'type': , 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}" ] }, - "execution_count": 10, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -435,23 +435,16 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 16, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "USER_AGENT environment variable not set, consider setting it to identify your requests.\n" - ] - }, { "data": { "text/plain": [ "Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.\\nComponent One: Planning#\\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\\nTask Decomposition#\\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.\\nTask decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.\\nAnother quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.\\nSelf-Reflection#')" ] }, - "execution_count": 11, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -495,7 +488,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -522,7 +515,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 12, "metadata": { "collapsed": false, "jupyter": { @@ -586,7 +579,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 13, "metadata": { "collapsed": false, "jupyter": { @@ -618,16 +611,16 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, effectively decoupling complex computation and reasoning. PAL and PoT rely on language models with strong coding skills to perform these tasks.'" + "'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively perform these tasks.'" ] }, - "execution_count": 16, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } From 4fd8ad2e594de49a92b073d810ab383febeacd53 Mon Sep 17 00:00:00 2001 From: ChengZi Date: Mon, 13 Jan 2025 19:37:53 +0800 Subject: [PATCH 3/3] refine full_text_search_with_langchain Signed-off-by: ChengZi --- .../full_text_search_with_langchain.ipynb | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb index 0f78d360a..7765ac29c 100644 --- a/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb +++ b/bootcamp/tutorials/integration/langchain/full_text_search_with_langchain.ipynb @@ -22,7 +22,7 @@ "\n", "[Full-text search](https://milvus.io/docs/full-text-search.md#Full-Text-Search) is a traditional method for retrieving documents that contain specific terms or phrases by directly matching keywords within the text. It ranks results based on relevance, typically determined by factors such as term frequency and proximity. While semantic search excels at understanding intent and context, full-text search provides precision for exact keyword matching, making it a valuable complementary tool. The BM25 algorithm is a popular ranking method for full-text search, particularly useful in Retrieval-Augmented Generation (RAG).\n", "\n", - "Since [Milvus 2.5](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md), full-text search is natively supported through the `Sparse-BM25` approach, by representing the BM25 algorithm as sparse vectors. Milvus accepts raw text as input and automatically converts it into sparse vectors stored in a specified field, eliminating the need for manual sparse embedding generation.\n", + "Since [Milvus 2.5](https://milvus.io/blog/introduce-milvus-2-5-full-text-search-powerful-metadata-filtering-and-more.md), full-text search is natively supported through the Sparse-BM25 approach, by representing the BM25 algorithm as sparse vectors. Milvus accepts raw text as input and automatically converts it into sparse vectors stored in a specified field, eliminating the need for manual sparse embedding generation.\n", "\n", "LangChain's integration with Milvus has also introduced this feature, simplifying the process of incorporating full-text search into RAG applications. By combining full-text search with semantic search with dense vectors, you can achieve a hybrid approach that leverages both semantic context from dense embeddings and precise keyword relevance from word matching. This integration enhances the accuracy, relevance, and user experience of search systems.\n", "\n", @@ -97,7 +97,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Install and start the Milvus server following this [guide](https://milvus.io/docs/install_standalone-docker-compose.md). And set your Milvus server `URI` (or optional `TOKEN`)" + "Specify your Milvus server `URI` (and optionally the `TOKEN`). For how to install and start the Milvus server following this [guide](https://milvus.io/docs/install_standalone-docker-compose.md). " ] }, { @@ -126,7 +126,7 @@ "from langchain_core.documents import Document\n", "\n", "docs = [\n", - " Document(page_content=\"I like apple\", metadata={\"category\": \"fruit\"}),\n", + " Document(page_content=\"I like this apple\", metadata={\"category\": \"fruit\"}),\n", " Document(page_content=\"I like swimming\", metadata={\"category\": \"sport\"}),\n", " Document(page_content=\"I like dogs\", metadata={\"category\": \"pets\"}),\n", "]" @@ -181,17 +181,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In the code above, we define an instance of `BM25BuiltInFunction` and pass it to the `Milvus` object. `BM25BuiltInFunction` is a lightweight wrapper class for the [`Function`](https://milvus.io/docs/manage-collections.md#Function) in Milvus.\n", + "In the code above, we define an instance of `BM25BuiltInFunction` and pass it to the `Milvus` object. `BM25BuiltInFunction` is a lightweight wrapper class for [`Function`](https://milvus.io/docs/manage-collections.md#Function) in Milvus.\n", "\n", - "You can specify the input and output fields for this function in the parameters of the `BM25BuiltInFunction` instance by passing the following two field parameters:\n", + "You can specify the input and output fields for this function in the parameters of the `BM25BuiltInFunction`:\n", "- `input_field_names` (str): The name of the input field, default is `text`. It indicates which field this function reads as input.\n", "- `output_field_names` (str): The name of the output field, default is `sparse`. It indicates which field this function outputs the computed result to.\n", "\n", - "Note that in the Milvus initialization parameters mentioned above, we also specify `vector_field=[\"dense\", \"sparse\"]`. Since the `sparse` field is the output field defined by the `BM25BuiltInFunction`, the other `dense` field will be automatically assigned to the output field of OpenAIEmbeddings.\n", + "Note that in the Milvus initialization parameters mentioned above, we also specify `vector_field=[\"dense\", \"sparse\"]`. Since the `sparse` field is taken as the output field defined by the `BM25BuiltInFunction`, the other `dense` field will be automatically assigned to the output field of OpenAIEmbeddings.\n", "\n", - "In practice, especially when combining multiple embeddings or functions, we recommend clearly specifying the input and output fields for each function to avoid confusion.\n", + "In practice, especially when combining multiple embeddings or functions, we recommend explicitly specifying the input and output fields for each function to avoid ambiguity.\n", "\n", - "In the following example, it specifies the input and output fields of BM25BuiltInFunction, and three vector fields, which makes it clear which field each built-in function and each vector embedding.\n" + "In the following example, we specify the input and output fields of `BM25BuiltInFunction` explicitly, making it clear which field the built-in function is for.\n" ] }, { @@ -241,16 +241,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for the two `OpenAIEmbeddings` models. \n", + "In this example, we have three vector fields. Among them, `sparse` is used as the output field for `BM25BuiltInFunction`, while the other two, `dense1` and `dense2`, are automatically assigned as the output fields for the two `OpenAIEmbeddings` models (based on the order). \n", "\n", - "In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, enabling hybrid search.\n" + "In this way, you can define multiple vector fields and assign different combinations of embeddings or functions to them, to implement hybrid search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "When performing hybrid search, we just pass in the query text and optionally set the topK and reranker parameters. The `vectorstore` instance will automatically handle the vector embeddings and built-in functions and finally use a reranker to refine the results. From the user's end, we don't need to care about the underlying implementation details of the searching process." + "When performing hybrid search, we just need to pass in the query text and optionally set the topK and reranker parameters. The `vectorstore` instance will automatically handle the vector embeddings and built-in functions and finally use a reranker to refine the results. The underlying implementation details of the searching process are hidden from the user." ] }, { @@ -261,7 +261,7 @@ { "data": { "text/plain": [ - "[Document(metadata={'pk': 454646931479251826, 'category': 'fruit'}, page_content='I like apple')]" + "[Document(metadata={'category': 'fruit', 'pk': 454646931479251897}, page_content='I like this apple')]" ] }, "execution_count": 6, @@ -271,7 +271,7 @@ ], "source": [ "vectorstore.similarity_search(\n", - " \"Do I like apple?\", k=1\n", + " \"Do I like apples?\", k=1\n", ") # , ranker_type=\"weighted\", ranker_params={\"weights\":[0.3, 0.3, 0.4]})" ] }, @@ -279,7 +279,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about how to use the hybrid search, you can refer to the [Hybrid Search introduction](https://milvus.io/docs/multi-vector-search.md#Hybrid-Search) and this [LangChain Milvus hybrid search tutorial](https://milvus.io/docs/milvus_hybrid_search_retriever.md) ." + "For more information about hybrid search, you can refer to the [Hybrid Search introduction](https://milvus.io/docs/multi-vector-search.md#Hybrid-Search) and this [LangChain Milvus hybrid search tutorial](https://milvus.io/docs/milvus_hybrid_search_retriever.md) ." ] }, { @@ -288,7 +288,7 @@ "source": [ "### BM25 search without embedding\n", "\n", - "If you want to perform lexical frequency-based full-text search using only a single BM25 function without using any embedding-based semantic similarity search, you can set the embedding parameter input to `None` and keep only the `builtin_function` parameter input as the BM25 function instance. For example: " + "If you want to perform only full-text search with BM25 function without using any embedding-based semantic search, you can set the embedding parameter to `None` and keep only the `builtin_function` specified as the BM25 function instance. The vector field only has \"sparse\" field. For example: " ] }, { @@ -331,11 +331,11 @@ "source": [ "## Customize analyzer\n", "\n", - "Analyzers are essential tools in text processing that convert raw text into structured, searchable formats. They play a key role in enabling efficient indexing and retrieval by breaking down input text into tokens and refining these tokens through a combination of tokenizers and filters. For more information, you can refer [this guide](https://milvus.io/docs/analyzer-overview.md#Analyzer-Overview) to learn more about analyzers in Milvus.\n", + "Analyzers are essential in full-text search by breaking the sentence into tokens and performing lexical analysis like stemming and stop word removal. Analyzers are usually language-specific. You can refer to [this guide](https://milvus.io/docs/analyzer-overview.md#Analyzer-Overview) to learn more about analyzers in Milvus.\n", "\n", - "Milvus supports two types of analyzers: **Built-in Analyzers** and **Custom Analyzers**. By default, the `BM25BuiltInFunction` will use the [default standard analyzer](https://milvus.io/docs/standard-analyzer.md), which makes it effective for most languages. \n", + "Milvus supports two types of analyzers: **Built-in Analyzers** and **Custom Analyzers**. By default, the `BM25BuiltInFunction` will use the [standard built-in analyzer](https://milvus.io/docs/standard-analyzer.md), which is the most basic analyzer that tokenizes the text with punctuation. \n", "\n", - "However, if you want to use a different analyzer or customize the analyzer, you can pass in the `analyzer_params` parameter in the `BM25BuiltInFunction` initialization.\n", + "If you want to use a different analyzer or customize the analyzer, you can pass in the `analyzer_params` parameter in the `BM25BuiltInFunction` initialization.\n", "\n" ] }, @@ -387,7 +387,7 @@ { "data": { "text/plain": [ - "{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': , 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': , 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': , 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': , 'is_function_output': True}, {'name': 'category', 'description': '', 'type': , 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_333d45a1', 'description': '', 'type': , 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}" + "{'auto_id': True, 'description': '', 'fields': [{'name': 'text', 'description': '', 'type': , 'params': {'max_length': 65535, 'enable_match': True, 'enable_analyzer': True, 'analyzer_params': {'tokenizer': 'standard', 'filter': ['lowercase', {'type': 'length', 'max': 40}, {'type': 'stop', 'stop_words': ['of', 'to']}]}}}, {'name': 'pk', 'description': '', 'type': , 'is_primary': True, 'auto_id': True}, {'name': 'dense', 'description': '', 'type': , 'params': {'dim': 1536}}, {'name': 'sparse', 'description': '', 'type': , 'is_function_output': True}, {'name': 'category', 'description': '', 'type': , 'params': {'max_length': 65535}}], 'enable_dynamic_field': False, 'functions': [{'name': 'bm25_function_de368e79', 'description': '', 'type': , 'input_field_names': ['text'], 'output_field_names': ['sparse'], 'params': {}}]}" ] }, "execution_count": 9, @@ -410,8 +410,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Best practice of RAG\n", - "We have learned how to use the basic BM25 build-in function in LangChain and Milvus. Let's introduce the best practice of RAG in combination with this usage.\n", + "## Using Hybrid Search and Reranking in RAG\n", + "We have learned how to use the basic BM25 build-in function in LangChain and Milvus. Let's introduce an optimized RAG implementation with hybrid search and reranking.\n", "\n", "\n", "![](../../../../images/advanced_rag/hybrid_and_rerank.png)\n", @@ -617,7 +617,7 @@ { "data": { "text/plain": [ - "'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively perform these tasks.'" + "'PAL (Program-aided Language models) and PoT (Program of Thoughts prompting) are approaches that involve using language models to generate programming language statements to solve natural language reasoning problems. This method offloads the solution step to a runtime, such as a Python interpreter, allowing for complex computation and reasoning to be handled externally. PAL and PoT rely on language models with strong coding skills to effectively generate and execute these programming statements.'" ] }, "execution_count": 15,