See the app live on here: Aesop Review Chatbot
This project employs a novel approach to handling and analyzing transcribed YouTube product reviews by leveraging Retrieval Augmented Generation (RAG) with an emphasis on speed and efficiency. Instead of using traditional document embeddings, which can be computationally intensive and slow for real-time applications, the project adopts a text splitting strategy. This method significantly increases the processing speed, making the system more responsive and efficient for end-users.
- Text Splitting: The project utilizes the
RecursiveCharacterTextSplitter
from theLangChain
library to split the transcribed text into smaller chunks. Specifically, the text is divided into chunks of 400 characters, with an overlap of 40 characters and using specific separators such as new lines and spaces. This fine-grained splitting strategy ensures that the context is preserved while making the retrieval process faster and more efficient. - Embeddings: For converting text chunks into vector representations, the project uses
HuggingFaceEmbeddings
with the model"all-MiniLM-L6-v2"
. These embeddings are known for their balance between performance and computational efficiency, making them an ideal choice for real-time applications where speed is crucial.
The vector representations of the text chunks are stored in a FAISS
vector database. FAISS
is an efficient library for similarity search and clustering of dense vectors, which further contributes to the speed and efficiency of the retrieval process. By using FAISS
, the project can quickly perform similarity searches among the chunks, enabling the RAG model to retrieve relevant information in response to user queries.
The decision to use text splitting and FAISS
over traditional document embeddings is driven by the need for speed. Document embeddings, while effective for capturing the semantic meaning of entire documents, can be slow to compute and cumbersome to work with in real-time applications. In contrast, splitting the text into smaller chunks and using efficient embeddings allow for faster processing and retrieval, significantly improving the responsiveness of the chatbot interface.
- Efficient Data Handling: By splitting texts instead of using traditional embeddings, the project achieves faster processing times.
- Lightweight Model Usage:
llama.cpp
offers an efficient way to use large language models without the need for extensive hardware resources.
-You can also compare it's performance by comparing with Llama-2-7b-chat-hf with the "LLAMA2-7B_Aesop_Reviewer.ipynb" file. -This chatbot has conversational memory and can hold follow up conversations within the same session. -It runs on Mac M2 pro.
You will also need to change how you install llama-cpp-python
package depending on your OS and whether
you are planning on using a GPU or not.
- Install llama-cpp-python
- Install langchain
- Install streamlit
- Install beautifulsoup
- Install sentence-transformers
- Install docarray
- Install pydantic 1.10.8
- Download Mistral from HuggingFace from TheBloke's repo: mistral-7b-instruct-v0.1.Q5_0.gguf
- Place model file in the
models
subfolder - Run streamlit
The setup assumes you have python
already installed and venv
module available.
- Download the code or clone the repository.
- Inside the root folder of the repository, initialize a python virtual environment:
python -m venv venv
- Activate the python envitonment:
source venv/bin/activate
- Install required modules (
langchain
,llama.cpp
, andstreamlit
along withbeautifulsoup4
,pymypdf
,sentence-transformers
,docarray
, andpydantic 1.10.8
):
pip install -r requirements.txt
- Create a subdirectory to place the models in:
mkdir -p models
- Download the
Mistral7b
quantized model fromhuggingface
from the following link: mistral-7b-instruct-v0.1.Q5_0.gguf - Start
streamlit
:
streamlit run main.py