Merge branch 'main' into main

StarlightSearch · Sep 11, 2024 · 13ce368 · 13ce368
2 parents 3ea2c6a + 903973f
commit 13ce368
Show file tree

Hide file tree

Showing 8 changed files with 210 additions and 95 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,31 +1,56 @@
-## Quick Start
+## 🚀 Getting Started
+To get started, check the [Issues Section] for tasks labeled "Good First Issue" or "Help Needed". These issues are perfect for new contributors or those looking to make a valuable impact quickly.
 
-You can quickly get started with contributing by searching for issues with the labels **"Good First Issue"** or **"Help Needed"** in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.  
+If you find an issue you want to tackle:
 
-To set up your development environment, please follow the steps mentioned below : 
+Comment on the issue to let us know you’d like to work on it.
+Wait for confirmation—an admin will assign the issue to you.
+💻 Setting Up Your Development Environment
+To start working on the project, follow these steps:
 
-1. Fork the repository from dev, We don't allow direct contribution to main
+1. Fork the Repository: Begin by forking the repository from the dev branch. We do not allow direct contributions to the main branch.</b>
+2. Clone Your Fork: After forking, clone the repository to your local machine.</b>
+3. Create a New Branch: For each contribution, create a new branch following the naming convention: feature/your-feature-name or bugfix/your-bug-name.</b>
 
+## 🛠️ Contributing Guidelines
+🔍 Reporting Bugs
+If you find a bug, here’s how to report it effectively:
 
-## Contributing Guidelines 
-
-### 🔍 Reporting Bugs
+Title: Use a clear and descriptive title, with appropriate labels.
+Description: Provide a detailed description of the issue, including:
+Steps to reproduce the problem.
+Expected and actual behavior.</b>
 
+Any relevant logs, screenshots, or additional context.
+Submit the Bug Report: Open a new issue in the [Issues Section] and include all the details. This helps us understand and resolve the problem faster.
 
-1. Title describing the issue clearly and concisely with relevant labels
-2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
-3. Include any relevant logs, screenshots, or other helpful information supporting the issue.
+## 🐍 Contributing to Python Code
+If you're contributing to the Python codebase, follow these steps:
 
+1. Create an Independent File: Write your code in a new file within the python folder. </b>
+2. Build with Maturin: After writing your code, use maturin build to build the package. </b>
+3. Import and Call the Function: 
+4. Use the following import syntax:
+from embed_anything.<Library_name> import * </b>
+5. Then, call the function using:
+from embed_anything import <function_name> </b>
+Feel free to open an issue if you encounter any problems during the process.
 
-If you are Contributing in python;
-1. Please write an independent file, in the python folder given.
-2. Then you can maturin build
-3. Then call it with from embed_anything.Library_name import *
-4. then call the function, from embed_anything import the function_name
+🧩 Contributing to Adapters
+To contribute to adapters, follow these guidelines:
 
-Please open any issues if you have.
+1. Implement Adapter Class: Create an Adapter class that supports the create, add, and delete operations for your specific use case. </b>
+2. Check Existing Adapters: Use the existing Pinecone and Weaviate adapters as references to maintain consistency in structure and functionality. </b>
+3. Testing: Ensure your adapter is tested thoroughly before submitting a pull request.
 
-## To contribute in Adapters,
 
-All you need is to add Adapter class for your create, add, delete indexes. Kindly check the already existing pinecone and weaviate file.
+### 🔄 Submitting a Pull Request </b>
+Once your contribution is ready: </b>
 
+Push Your Branch: Push your branch to your forked repository.</b>
+
+Submit a Pull Request (PR): Open a PR from your branch to the dev branch of the main repository. Ensure your PR includes:</b>
+
+1. A clear description of the changes.</b>
+2. Any relevant issue numbers (e.g., "Closes #123").</b>
+3. Wait for Review: A maintainer will review your PR. Please be responsive to any feedback or requested changes.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 <div align="center">
 
   <p align="center">
-    <b>Generate and Stream your embeddings with minimalist and lightning fast framework built in rust 🦀</b>
+    <b>Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀</b>
     <br />
     <a href="https://starlightsearch.github.io/EmbedAnything/references/"><strong>Explore the docs »</strong></a>
     <br />
@@ -98,10 +98,11 @@ We support a range of models, that can be supported by Candle, We have given a s
 
 ## How to add custom model and Chunk Size.
 ```python
-jina_config = JinaConfig(
-    model_id="Custom link given below", revision="main", chunk_size=100
+model = EmbeddingModel.from_pretrained_hf(
+    WhichModel.Bert, model_id="model link from huggingface"
 )
-embed_config = EmbedConfig(jina=jina_config)
+config = TextEmbedConfig(chunk_size=200, batch_size=32)
+data = embed_anything.embed_file("file_address", embeder=model, config=config)
 ```
 
 
@@ -126,7 +127,79 @@ embed_config = EmbedConfig(jina=jina_config)
 pip install embed-anything`
 
 
-## Usage
+# Usage
+
+
+
+## ➡️ Usage For 0.3 and later version
+
+
+### To use local embedding: we support Bert and Jina
+
+```python
+model = EmbeddingModel.from_pretrained_local(
+    WhichModel.Bert, model_id="Hugging_face_link"
+)
+data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+```
+
+
+
+## For multimodal embedding: we support CLIP
+### Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc
+
+```python
+import embed_anything
+from embed_anything import EmbedData
+model = embed_anything.EmbeddingModel.from_pretrained_local(
+    embed_anything.WhichModel.Clip,
+    model_id="openai/clip-vit-base-patch16",
+    # revision="refs/pr/15",
+)
+data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
+embeddings = np.array([data.embedding for data in data])
+query = ["Photo of a monkey?"]
+query_embedding = np.array(
+    embed_anything.embed_query(query, embeder=model)[0].embedding
+)
+similarities = np.dot(embeddings, query_embedding)
+max_index = np.argmax(similarities)
+Image.open(data[max_index].text).show()
+```
+
+## Audio Embedding using Whisper
+### requirements:  Audio .wav files.
+
+
+```python
+import embed_anything
+from embed_anything import (
+    AudioDecoderModel,
+    EmbeddingModel,
+    embed_audio_file,
+    TextEmbedConfig,
+)
+# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
+audio_decoder = AudioDecoderModel.from_pretrained_hf(
+    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
+)
+embeder = EmbeddingModel.from_pretrained_hf(
+    embed_anything.WhichModel.Bert,
+    model_id="sentence-transformers/all-MiniLM-L6-v2",
+    revision="main",
+)
+config = TextEmbedConfig(chunk_size=200, batch_size=32)
+data = embed_anything.embed_audio_file(
+    "test_files/audio/samples_hp0.wav",
+    audio_decoder=audio_decoder,
+    embeder=embeder,
+    text_embed_config=config,
+)
+print(data[0].metadata)
+
+```
+
+## ➡️ Usage For 0.2
 
 ### To use local embedding: we support Bert and Jina
 
@@ -191,6 +264,7 @@ print("Time taken: ", end_time - start_time)
 
 
 
+
 ## 🚧 Contributing to EmbedAnything
 
 
@@ -216,35 +290,15 @@ One of the aims of EmbedAnything is to allow AI engineers to easily use state of
 ✅Custom chunk size <br />
 ✅Pinecone Adapter, to directly save it on it. <br />
 ✅Zero-shot application <br />
-✅Vector database integration via streaming adapters
+✅Vector database integration via streaming adapters <br />
+✅Refactoring for intuitive functions
 
 Yet to do be done <br />
 ☑️Introducing chunkwise streaming instead of file <br />
 ☑️Graph embedding -- build deepwalks embeddings depth first and word to vec <br />
-
-
-## ✔️ Code of Conduct:
-
-Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.
-
-## Quick Start
-
-You can quickly get started with contributing by searching for issues with the labels **"Good First Issue"** or **"Help Needed"** in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.  
-
-To set up your development environment, please follow the steps mentioned below : 
-
-1. Fork the repository from dev, We don't allow direct contribution to main
-
-
-## Contributing Guidelines 
-
-### 🔍 Reporting Bugs
-
-
-1. Title describing the issue clearly and concisely with relevant labels
-2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
-3. Include any relevant logs, screenshots, or other helpful information supporting the issue.
-
+☑️Video Embedding
+☑️ Yolo Clip
+☑️ Add more Vector Database Adapters
 
 
 

diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,15 @@
+# Security Policy
+
+## Supported Versions
+
+Use this section to tell people about which versions of your project are
+currently being supported with security updates.
+
+| Version | Supported          |
+| ------- | ------------------ |
+| 0.2.x   | :white_check_mark: |
+
+## Reporting a Vulnerability
+
+
+
diff --git a/docs/blog.md b/docs/blog.md
@@ -0,0 +1,9 @@
+## Blog
+
+# `Embed-anything== 0.3.0 🎉`
+
+1. `Code Refactored: All the major functions are refactored, making calling models more intuitive and optimized. Check out our docs and usage.`
+2. `Better folder management for Python and Rust. In the past, we have seen confusion regarding how to contribute to Python. Here is a guide on how to do it.`
+3. `Async and fix image streaming`
+4. `Vector Streaming by chunks allows you to stream embeddings as a set of chunks.`
+5. `Adapters Examples for Weaviate, Pinecone, and Elastic and adding more…`
diff --git a/python/Cargo.toml b/python/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "embed_anything_python"
-version = "0.2.2"
+version = "0.3.0"
 edition = "2021"
 
 [lib]
@@ -16,4 +16,4 @@ tokio = { version = "1.39.0", features = ["rt-multi-thread"]}
 extension-module = ["pyo3/extension-module"]
 mkl = ["embed_anything/mkl"]
 accelerate = ["embed_anything/accelerate"]
-cuda = ["embed_anything/cuda"]
+cuda = ["embed_anything/cuda"]
diff --git a/python/python/embed_anything/__init__.py b/python/python/embed_anything/__init__.py
@@ -14,45 +14,50 @@
 
 ```python
 import embed_anything
+from embed_anything import EmbedData
 
-# Create a config
-config = embed_anything.EmbedConfig(
-    jina=embed_anything.JinaConfig(
-        model_id="jinaai/jina-embeddings-v2-small-en",
-        revision="main",
-        chunk_size=100
-    )
+#For text files
+
+model = EmbeddingModel.from_pretrained_local(
+    WhichModel.Bert, model_id="Hugging_face_link"
 )
+data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+
 
-# Embed a file
-data = embed_anything.embed_file("test_files/test.pdf",
-                embeder="Jina",
-                config=config)
-
-# Embed a directory
-data = embed_anything.embed_directory("test_files",
-                embeder="Jina",
-                config=config)
-
-# Embed Audio
-audio_decoder_config = embed_anything.AudioDecoderConfig(
-    decoder_model_id="openai/whisper-tiny.en",
-    decoder_revision="main",
-    model_type="tiny-en",
-    quantized=False,
+#For images
+model = embed_anything.EmbeddingModel.from_pretrained_local(
+    embed_anything.WhichModel.Clip,
+    model_id="openai/clip-vit-base-patch16",
+    # revision="refs/pr/15",
+)
+data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
+embeddings = np.array([data.embedding for data in data])
+query = ["Photo of a monkey?"]
+query_embedding = np.array(
+    embed_anything.embed_query(query, embeder=model)[0].embedding
 )
-jina_config = embed_anything.JinaConfig(
-    model_id="jinaai/jina-embeddings-v2-small-en",
+# For audio files
+from embed_anything import (
+    AudioDecoderModel,
+    EmbeddingModel,
+    embed_audio_file,
+    TextEmbedConfig,
+)
+# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
+audio_decoder = AudioDecoderModel.from_pretrained_hf(
+    "openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
+)
+embeder = EmbeddingModel.from_pretrained_hf(
+    embed_anything.WhichModel.Bert,
+    model_id="sentence-transformers/all-MiniLM-L6-v2",
     revision="main",
-    chunk_size=100
 )
-
-config = embed_anything.EmbedConfig(jina=jina_config,
-            audio_decoder=audio_decoder_config)
-data = embed_anything.embed_file(
+config = TextEmbedConfig(chunk_size=200, batch_size=32)
+data = embed_anything.embed_audio_file(
     "test_files/audio/samples_hp0.wav",
-    embeder="Audio",
-    config=config
+    audio_decoder=audio_decoder,
+    embeder=embeder,
+    text_embed_config=config,
 )
 
 ```
@@ -78,17 +83,24 @@
 
 # Initialize the PineconeEmbedder class
 
-pinecone_adapter.create_index(
-        dimension=1536,
-        metric="cosine",
-        index_name=index_name
-        )
-
-data = embed_anything.embed_file(
-        "test_files/test.pdf",
-        embeder="OpenAI",
-        adapter=pinecone_adapter
-        )
+pinecone_adapter.create_index(dimension=512, metric="cosine")
+
+# bert_model = EmbeddingModel.from_pretrained_hf(
+#     WhichModel.Bert, "sentence-transformers/all-MiniLM-L12-v2", revision="main"
+# )
+
+clip_model = EmbeddingModel.from_pretrained_hf(
+    WhichModel.Clip, "openai/clip-vit-base-patch16", revision="main"
+)
+
+embed_config = TextEmbedConfig(chunk_size=512, batch_size=32)
+
+
+data = embed_anything.embed_image_directory(
+    "test_files",
+    embeder=clip_model,
+    adapter=pinecone_adapter,
+    # config=embed_config,
 ```