GitHub - Thytu/SMIT: SMIT: A Simple Modality Integration Tool

SMIT

A Simple Modality Integration Tool

Explore the docs

More about SMIT · Report Bug or Request Feature

Table of Contents

About The Project
Getting Started
How it works
Contributing
Acknowledgments
Contact

About The Project

SMIT is a versatile tool designed to streamline the integration of audio modality into your LLMs. Currently, SMIT exclusively supports audio as a new modality. However, our goal is to expand its capabilities to accommodate any new modality seamlessly. We welcome contributions from the open-source community to help us achieve this aim.

(back to top)

Getting Started

Welcome to SMIT! Follow these simple steps to get started:

Begin by cloning the SMIT repository to your local machine using Git:

git clone https://github.com/Thytu/SMIT/
cd SMIT

We highly recommend using a virtual environment to manage dependencies and prevent conflicts. Create and activate a virtual environment using your preferred tool (e.g., virtualenv, conda):

# Example using virtualenv
virtualenv venv
source venv/bin/activate

Once inside the project directory and your virtual environment is activated, install the required dependencies listed in requirements.txt using pip:

pip install -r requirements.txt

Run the Example

You can quickly run the default example provided in SMI by executing the following command:

python src/main.py

This will train the amazing abacaj/phi-2-super model to do ASR using the librispeech_asr dataset and facebook/hubert-large-ls960-ft as speech encoder, reproducing the Thytu/phi-2-audio-super model.

Important

It's essential to ensure a minimum of 30GB of available VRAM to execute this command successfully. For users with >=80GB of VRAM, it's recommended to deactivate quantization while decreasing the batch size to expedite the training process. You can achieve this by running:

python src/main.py ~model.decoder.quantization_config ++training.training_args.per_device_train_batch_size=1

Customize Your Model

To customize your own Language Model (LLM), create a configuration file. You can use the provided config file template as a starting point. Then, use Hydra syntax to provide your configuration file:

python src/main.py model=my_config

Hydra offers extensive options for parameter overriding, allowing you to tailor the model according to your specific requirements. Refer to Hydra documentation for more details on customization options.

Inference

Once your model is trained, you can effortlessly load it for inference:

model = SMIT.from_pretrained("path_to_your_safetensor")

For inference tasks, you can utilize the generate method:

model.generate("Tell me how to add a modality to my model")

To employ the generate method with multiple modalities, follow this approach:

model.generate(
    prompt=[
        "Tell me how to add a modality to my model",
        "Transcribe this audio from speech to text {audio}",
    ],
    raw_speech=[None, you_audio],
)

Note

When providing multiple prompts, ensure that the length of raw_speech matches the length of prompt.

(back to top)

How it works

SMIT simplifies the process of enhancing your LLM with audio capabilities, following the principles outlined in the this paper. By linking a speech encoder to an decoder using a trainable linear projector adding to your LLM the audio modality. SLMA automates the integration process by making it as easy as configuring a single file.

To use SMIT, simply define your desired configurations in the provided config file, it will then handle the rest, seamlessly incorporating the audio modality into your models.

(back to top)

Contributing

There are mutliple ways to contribute to that projects, either regarding the UX (i.e doc / even making the example faster) or regarding the core product itself (i.e handling Vision modality). Any contributions you make are greatly appreciated, if you have a suggestion that would make this better feel free to tell me :D You can also check the open issues for more things to improve.

Don't forget to give the project a star! 🌟 Thanks again!

(back to top)

Acknowledgments

This project draws significant inspiration from the An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper. I thank the authors for sharing their expertise. Huge thanks to the CoolKids for their help in debugging some pesky issues I ran into. And last but definitely not the least, a massive thank you to Oursin – this project simply wouldn't exist without you!

(back to top)

Contact

Hey, I'm Valentin De Matos, passionate about AI and always working on some new side project.

You can reach me out at [email protected] and if you want more information you can always

Check my website thytu.com
Follow me on twitter @ThytuVDM

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/workflows		.github/workflows
conf		conf
docs		docs
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMIT

About The Project

Getting Started

Run the Example

Customize Your Model

Inference

How it works

Contributing

Acknowledgments

Contact

About

Releases

Packages

Languages

License

Thytu/SMIT

Folders and files

Latest commit

History

Repository files navigation

SMIT

About The Project

Getting Started

Run the Example

Customize Your Model

Inference

How it works

Contributing

Acknowledgments

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages