Skip to content

Official repo of M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Notifications You must be signed in to change notification settings

William-wAng618/M2PT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official Repo of M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

(👉Under construction! The key code is uploaded. However, there are several redundancies in the current version, and the commands/instructions are not perfectly ready for formal release. I will gradually update it! Please stay tuned.)

This repository contains the official PyTorch implementation for M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning. Our work is based on LLaVA, and we thank the great work of them.

Figure1: Overview of our M2PT approach. Here, visual prompts are embedded into each layer of the Visual Encoder, and textual prompts are embedded into each layer of the LLM. These prompts facilitate the extraction and alignment of features across modalities (e.g., vision, language). The cross-modality interaction between visual and textual features is enhanced through layered integration, ultimately improving the model's capability in zero-shot instruction learning tasks.

Install

  1. Clone this repository and navigate to LLaVA folder
[email protected]:William-wAng618/M2PT.git
cd M2PT
  1. Install Package
conda create -n M2PT python=3.10 -y
conda activate M2PT
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Stage-one LLaVA_align Weights

The weigth for stage-1 Align is liuhaotian/llava-pretrain-vicuna-7b-v1.3 and lmsys/vicuna-7b-v1.3 please download it for M2PT.

M2PT-emnlp2024

  1. Prepare data.

Please download the annotation of the Vision-Flan 191k data and place it in playground.

├── M2PT
│   └── playground
|       └──Vision-Flan (unzip here)
  1. Start training.

There are several parameter need to be notice in ==\scripts\PT_full_schedule.sh==

  • --PT_len_llm: The num of textual prompts add in LLM.
  • --PT_len_vision_encoder: The num of visual prompts add in Vision encoder.

Then run:

bash scripts/PT_full_schedule.sh
  1. Evaluation. For evaluation, please use:
./M2PT/eval/model_vqa_loader_PT_mme.py

About

Official repo of M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published