Official Repo of M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
(👉Under construction! The key code is uploaded. However, there are several redundancies in the current version, and the commands/instructions are not perfectly ready for formal release. I will gradually update it! Please stay tuned.)
This repository contains the official PyTorch implementation for M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning. Our work is based on LLaVA, and we thank the great work of them.
Figure1: Overview of our M2PT approach. Here, visual prompts are embedded into each layer of the Visual Encoder, and textual prompts are embedded into each layer of the LLM. These prompts facilitate the extraction and alignment of features across modalities (e.g., vision, language). The cross-modality interaction between visual and textual features is enhanced through layered integration, ultimately improving the model's capability in zero-shot instruction learning tasks.
- Clone this repository and navigate to LLaVA folder
[email protected]:William-wAng618/M2PT.git
cd M2PT
- Install Package
conda create -n M2PT python=3.10 -y
conda activate M2PT
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
The weigth for stage-1 Align is liuhaotian/llava-pretrain-vicuna-7b-v1.3 and lmsys/vicuna-7b-v1.3 please download it for M2PT.
- Prepare data.
Please download the annotation of the Vision-Flan 191k data and place it in playground.
├── M2PT
│ └── playground
| └──Vision-Flan (unzip here)
- Start training.
There are several parameter need to be notice in ==\scripts\PT_full_schedule.sh==
--PT_len_llm
: The num of textual prompts add in LLM.--PT_len_vision_encoder
: The num of visual prompts add in Vision encoder.
Then run:
bash scripts/PT_full_schedule.sh
- Evaluation. For evaluation, please use:
./M2PT/eval/model_vqa_loader_PT_mme.py