Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
akshayballal95 authored Sep 11, 2024
2 parents 3ea2c6a + 903973f commit 13ce368
Show file tree
Hide file tree
Showing 8 changed files with 210 additions and 95 deletions.
61 changes: 43 additions & 18 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,56 @@
## Quick Start
## 🚀 Getting Started
To get started, check the [Issues Section] for tasks labeled "Good First Issue" or "Help Needed". These issues are perfect for new contributors or those looking to make a valuable impact quickly.

You can quickly get started with contributing by searching for issues with the labels **"Good First Issue"** or **"Help Needed"** in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.
If you find an issue you want to tackle:

To set up your development environment, please follow the steps mentioned below :
Comment on the issue to let us know you’d like to work on it.
Wait for confirmation—an admin will assign the issue to you.
💻 Setting Up Your Development Environment
To start working on the project, follow these steps:

1. Fork the repository from dev, We don't allow direct contribution to main
1. Fork the Repository: Begin by forking the repository from the dev branch. We do not allow direct contributions to the main branch.</b>
2. Clone Your Fork: After forking, clone the repository to your local machine.</b>
3. Create a New Branch: For each contribution, create a new branch following the naming convention: feature/your-feature-name or bugfix/your-bug-name.</b>

## 🛠️ Contributing Guidelines
🔍 Reporting Bugs
If you find a bug, here’s how to report it effectively:

## Contributing Guidelines

### 🔍 Reporting Bugs
Title: Use a clear and descriptive title, with appropriate labels.
Description: Provide a detailed description of the issue, including:
Steps to reproduce the problem.
Expected and actual behavior.</b>

Any relevant logs, screenshots, or additional context.
Submit the Bug Report: Open a new issue in the [Issues Section] and include all the details. This helps us understand and resolve the problem faster.

1. Title describing the issue clearly and concisely with relevant labels
2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
3. Include any relevant logs, screenshots, or other helpful information supporting the issue.
## 🐍 Contributing to Python Code
If you're contributing to the Python codebase, follow these steps:

1. Create an Independent File: Write your code in a new file within the python folder. </b>
2. Build with Maturin: After writing your code, use maturin build to build the package. </b>
3. Import and Call the Function:
4. Use the following import syntax:
from embed_anything.<Library_name> import * </b>
5. Then, call the function using:
from embed_anything import <function_name> </b>
Feel free to open an issue if you encounter any problems during the process.

If you are Contributing in python;
1. Please write an independent file, in the python folder given.
2. Then you can maturin build
3. Then call it with from embed_anything.Library_name import *
4. then call the function, from embed_anything import the function_name
🧩 Contributing to Adapters
To contribute to adapters, follow these guidelines:

Please open any issues if you have.
1. Implement Adapter Class: Create an Adapter class that supports the create, add, and delete operations for your specific use case. </b>
2. Check Existing Adapters: Use the existing Pinecone and Weaviate adapters as references to maintain consistency in structure and functionality. </b>
3. Testing: Ensure your adapter is tested thoroughly before submitting a pull request.

## To contribute in Adapters,

All you need is to add Adapter class for your create, add, delete indexes. Kindly check the already existing pinecone and weaviate file.
### 🔄 Submitting a Pull Request </b>
Once your contribution is ready: </b>

Push Your Branch: Push your branch to your forked repository.</b>

Submit a Pull Request (PR): Open a PR from your branch to the dev branch of the main repository. Ensure your PR includes:</b>

1. A clear description of the changes.</b>
2. Any relevant issue numbers (e.g., "Closes #123").</b>
3. Wait for Review: A maintainer will review your PR. Please be responsive to any feedback or requested changes.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

114 changes: 84 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
<div align="center">

<p align="center">
<b>Generate and Stream your embeddings with minimalist and lightning fast framework built in rust 🦀</b>
<b>Generate and stream your embeddings with minimalist and lightning fast framework built in rust 🦀</b>
<br />
<a href="https://starlightsearch.github.io/EmbedAnything/references/"><strong>Explore the docs »</strong></a>
<br />
Expand Down Expand Up @@ -98,10 +98,11 @@ We support a range of models, that can be supported by Candle, We have given a s

## How to add custom model and Chunk Size.
```python
jina_config = JinaConfig(
model_id="Custom link given below", revision="main", chunk_size=100
model = EmbeddingModel.from_pretrained_hf(
WhichModel.Bert, model_id="model link from huggingface"
)
embed_config = EmbedConfig(jina=jina_config)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_file("file_address", embeder=model, config=config)
```


Expand All @@ -126,7 +127,79 @@ embed_config = EmbedConfig(jina=jina_config)
pip install embed-anything`


## Usage
# Usage



## ➡️ Usage For 0.3 and later version


### To use local embedding: we support Bert and Jina

```python
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
```



## For multimodal embedding: we support CLIP
### Requirements Directory with pictures you want to search for example we have test_files with images of cat, dogs etc

```python
import embed_anything
from embed_anything import EmbedData
model = embed_anything.EmbeddingModel.from_pretrained_local(
embed_anything.WhichModel.Clip,
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
embed_anything.embed_query(query, embeder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
Image.open(data[max_index].text).show()
```

## Audio Embedding using Whisper
### requirements: Audio .wav files.


```python
import embed_anything
from embed_anything import (
AudioDecoderModel,
EmbeddingModel,
embed_audio_file,
TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
print(data[0].metadata)

```

## ➡️ Usage For 0.2

### To use local embedding: we support Bert and Jina

Expand Down Expand Up @@ -191,6 +264,7 @@ print("Time taken: ", end_time - start_time)




## 🚧 Contributing to EmbedAnything


Expand All @@ -216,35 +290,15 @@ One of the aims of EmbedAnything is to allow AI engineers to easily use state of
✅Custom chunk size <br />
✅Pinecone Adapter, to directly save it on it. <br />
✅Zero-shot application <br />
✅Vector database integration via streaming adapters
✅Vector database integration via streaming adapters <br />
✅Refactoring for intuitive functions

Yet to do be done <br />
☑️Introducing chunkwise streaming instead of file <br />
☑️Graph embedding -- build deepwalks embeddings depth first and word to vec <br />


## ✔️ Code of Conduct:

Please read our [Code of Conduct] to understand the expectations we have for all contributors participating in this project. By participating, you agree to abide by our Code of Conduct.

## Quick Start

You can quickly get started with contributing by searching for issues with the labels **"Good First Issue"** or **"Help Needed"** in the [Issues Section]. If you think you can contribute, comment on the issue and we will assign it to you.

To set up your development environment, please follow the steps mentioned below :

1. Fork the repository from dev, We don't allow direct contribution to main


## Contributing Guidelines

### 🔍 Reporting Bugs


1. Title describing the issue clearly and concisely with relevant labels
2. Provide a detailed description of the problem and the necessary steps to reproduce the issue.
3. Include any relevant logs, screenshots, or other helpful information supporting the issue.

☑️Video Embedding
☑️ Yolo Clip
☑️ Add more Vector Database Adapters



Expand Down
15 changes: 15 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Security Policy

## Supported Versions

Use this section to tell people about which versions of your project are
currently being supported with security updates.

| Version | Supported |
| ------- | ------------------ |
| 0.2.x | :white_check_mark: |

## Reporting a Vulnerability



9 changes: 9 additions & 0 deletions docs/blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Blog

# `Embed-anything== 0.3.0 🎉`

1. `Code Refactored: All the major functions are refactored, making calling models more intuitive and optimized. Check out our docs and usage.`
2. `Better folder management for Python and Rust. In the past, we have seen confusion regarding how to contribute to Python. Here is a guide on how to do it.`
3. `Async and fix image streaming`
4. `Vector Streaming by chunks allows you to stream embeddings as a set of chunks.`
5. `Adapters Examples for Weaviate, Pinecone, and Elastic and adding more…`
4 changes: 2 additions & 2 deletions python/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "embed_anything_python"
version = "0.2.2"
version = "0.3.0"
edition = "2021"

[lib]
Expand All @@ -16,4 +16,4 @@ tokio = { version = "1.39.0", features = ["rt-multi-thread"]}
extension-module = ["pyo3/extension-module"]
mkl = ["embed_anything/mkl"]
accelerate = ["embed_anything/accelerate"]
cuda = ["embed_anything/cuda"]
cuda = ["embed_anything/cuda"]
98 changes: 55 additions & 43 deletions python/python/embed_anything/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,45 +14,50 @@
```python
import embed_anything
from embed_anything import EmbedData
# Create a config
config = embed_anything.EmbedConfig(
jina=embed_anything.JinaConfig(
model_id="jinaai/jina-embeddings-v2-small-en",
revision="main",
chunk_size=100
)
#For text files
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
# Embed a file
data = embed_anything.embed_file("test_files/test.pdf",
embeder="Jina",
config=config)
# Embed a directory
data = embed_anything.embed_directory("test_files",
embeder="Jina",
config=config)
# Embed Audio
audio_decoder_config = embed_anything.AudioDecoderConfig(
decoder_model_id="openai/whisper-tiny.en",
decoder_revision="main",
model_type="tiny-en",
quantized=False,
#For images
model = embed_anything.EmbeddingModel.from_pretrained_local(
embed_anything.WhichModel.Clip,
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
embed_anything.embed_query(query, embeder=model)[0].embedding
)
jina_config = embed_anything.JinaConfig(
model_id="jinaai/jina-embeddings-v2-small-en",
# For audio files
from embed_anything import (
AudioDecoderModel,
EmbeddingModel,
embed_audio_file,
TextEmbedConfig,
)
# choose any whisper or distilwhisper model from https://huggingface.co/distil-whisper or https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
embeder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
chunk_size=100
)
config = embed_anything.EmbedConfig(jina=jina_config,
audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
embeder="Audio",
config=config
audio_decoder=audio_decoder,
embeder=embeder,
text_embed_config=config,
)
```
Expand All @@ -78,17 +83,24 @@
# Initialize the PineconeEmbedder class
pinecone_adapter.create_index(
dimension=1536,
metric="cosine",
index_name=index_name
)
data = embed_anything.embed_file(
"test_files/test.pdf",
embeder="OpenAI",
adapter=pinecone_adapter
)
pinecone_adapter.create_index(dimension=512, metric="cosine")
# bert_model = EmbeddingModel.from_pretrained_hf(
# WhichModel.Bert, "sentence-transformers/all-MiniLM-L12-v2", revision="main"
# )
clip_model = EmbeddingModel.from_pretrained_hf(
WhichModel.Clip, "openai/clip-vit-base-patch16", revision="main"
)
embed_config = TextEmbedConfig(chunk_size=512, batch_size=32)
data = embed_anything.embed_image_directory(
"test_files",
embeder=clip_model,
adapter=pinecone_adapter,
# config=embed_config,
```
Expand Down
Loading

0 comments on commit 13ce368

Please sign in to comment.