🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Zirui Wang^{1, 3} · Zhizhou Sha^{2, 3} · Zheng Ding³ · Yilin Wang^{2, 3} · Zhuowen Tu³

¹Princeton University · ²Tsinghua University · ³University of California, San Diego

CVPR 2024

Project done while Zirui Wang, Zhizhou Sha and Yilin Wang interned at UC San Diego.

Project Page | arXiv | X (Twitter)

Updates

If you use our method and/or model for your research project, we are happy to provide cross-reference here in the updates. :)

[04/04/2024] 🔥 Our training methodology is incorporated into CoMat which shows enhanced text-to-image attribute assignments.
[02/26/2024] 🔥 TokenCompose is accepted to CVPR 2024!
[02/20/2024] 🔥 TokenCompose is used as a base model from the RealCompo paper for enhanced compositionality.

video.mp4

A Stable Diffusion model finetuned with token-level consistency terms for enhanced multi-category instance composition and photorealism.

Method	Multi-category Instance Composition									Photorealism		Efficiency
	Object Accuracy	COCO				ADE20K				FID (COCO)	FID (Flickr30K)	Latency
	Object Accuracy	MG2	MG3	MG4	MG5	MG2	MG3	MG4	MG5	FID (COCO)	FID (Flickr30K)	Latency
SD 1.4	29.86	90.72_1.33	50.74_0.89	11.68_0.45	0.88_0.21	89.81_0.40	53.96_1.14	16.52_1.13	1.89_0.34	20.88	71.46	7.54_0.17
Composable	27.83	63.33_0.59	21.87_1.01	3.25_0.45	0.23_0.18	69.61_0.99	29.96_0.84	6.89_0.38	0.73_0.22	-	75.57	13.81_0.15
Layout	43.59	93.22_0.69	60.15_1.58	19.49_0.88	2.27_0.44	96.05_0.34	67.83_0.90	21.93_1.34	2.35_0.41	-	74.00	18.89_0.20
Structured	29.64	90.40_1.06	48.64_1.32	10.71_0.92	0.68_0.25	89.25_0.72	53.05_1.20	15.76_0.86	1.74_0.49	21.13	71.68	7.74_0.17
Attn-Exct	45.13	93.64_0.76	65.10_1.24	28.01_0.90	6.01_0.61	91.74_0.49	62.51_0.94	26.12_0.78	5.89_0.40	-	71.68	25.43_4.89
TokenCompose (Ours)	52.15	98.08_0.40	76.16_1.04	28.81_0.95	3.28_0.48	97.75_0.34	76.93_1.09	33.92_1.47	6.21_0.62	20.19	71.13	7.56_0.14

🆕 Models

Stable Diffusion Version	Checkpoint 1	Checkpoint 2
v1.4	TokenCompose_SD14_A	TokenCompose_SD14_B
v2.1	TokenCompose_SD21_A	TokenCompose_SD21_B

Our finetuned models do not contain any extra modules and can be directly used in a standard diffusion model library (e.g., HuggingFace's Diffusers) by replacing the pretrained U-Net with our finetuned U-Net in a plug-and-play manner. We provide a demo jupyter notebook which uses our model checkpoint to generate images.

You can also use the following code to download our checkpoints and generate images:

import torch
from diffusers import StableDiffusionPipeline

model_id = "mlpc-lab/TokenCompose_SD14_A"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
pipe = pipe.to(device)

prompt = "A cat and a wine glass"
image = pipe(prompt).images[0]  
    
image.save("cat_and_wine_glass.png")

📊 MultiGen

See MultiGen for details.

Method	COCO				ADE20K
Method	MG2	MG3	MG4	MG5	MG2	MG3	MG4	MG5
SD 1.4	90.72_1.33	50.74_0.89	11.68_0.45	0.88_0.21	89.81_0.40	53.96_1.14	16.52_1.13	1.89_0.34
Composable	63.33_0.59	21.87_1.01	3.25_0.45	0.23_0.18	69.61_0.99	29.96_0.84	6.89_0.38	0.73_0.22
Layout	93.22_0.69	60.15_1.58	19.49_0.88	2.27_0.44	96.05_0.34	67.83_0.90	21.93_1.34	2.35_0.41
Structured	90.40_1.06	48.64_1.32	10.71_0.92	0.68_0.25	89.25_0.72	53.05_1.20	15.76_0.86	1.74_0.49
Attn-Exct	93.64_0.76	65.10_1.24	28.01_0.90	6.01_0.61	91.74_0.49	62.51_0.94	26.12_0.78	5.89_0.40
Ours	98.08_0.40	76.16_1.04	28.81_0.95	3.28_0.48	97.75_0.34	76.93_1.09	33.92_1.47	6.21_0.62

💻 Environment Setup

For those who want to use our codebase to train your own diffusion models with token-level objectives, follow the below instructions:

conda create -n TokenCompose python=3.8.5
conda activate TokenCompose
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

We have verified the environment setup using this specific package versions, but we expect that it will also work for newer versions too!

🛠️ Dataset Setup

If you want to use your own data, please refer to preprocess_data for details.

If you want to use our training data as examples or for research purposes, please follow the below instructions:

1. Setup the COCO Image Data

cd train/data
# download COCO train2017
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
rm train2017.zip
bash coco_data_setup.sh

After this step, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...

2. Setup Token-wise Grounded Segmentation Maps

Download COCO segmentation data from Google Drive and put it under train/data directory.

After this step, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...
    coco_gsam_seg.tar

Then, run the following command to unzip the segmentation data:

cd train/data
tar -xvf coco_gsam_seg.tar
rm coco_gsam_seg.tar

After the setup, you should have the following structure under the train/data directory:

train/data/
    coco_gsam_img/
        train/
            000000000142.jpg
            000000000370.jpg
            ...
    coco_gsam_seg/
        000000000142/
            mask_000000000142_bananas.png
            mask_000000000142_bread.png
            ...
        000000000370/
            mask_000000000370_bananas.png
            mask_000000000370_bread.png
            ...
        ...

📈 Training

We use wandb to log some curves and visualizations. Login to wandb before running the scripts.

wandb login

Then, to run TokenCompose, use the following command:

cd train
bash train.sh

The results will be saved under train/results directory.

🏷️ License

This repository is released under the Apache 2.0 license.

🙏 Acknowledgement

Our code is built upon diffusers, prompt-to-prompt, VISOR, Grounded-Segment-Anything, and CLIP. We thank all these authors for their nicely open sourced code and their great contributions to the community.

📝 Citation

If you find our work useful, please consider citing:

@InProceedings{Wang2024TokenCompose,
    author    = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen},
    title     = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {8553-8564}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
multigen		multigen
notebooks		notebooks
preprocess_data		preprocess_data
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
teaser.jpg		teaser.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Project Page | arXiv | X (Twitter)

Updates

🆕 Models

📊 MultiGen

💻 Environment Setup

🛠️ Dataset Setup

1. Setup the COCO Image Data

2. Setup Token-wise Grounded Segmentation Maps

📈 Training

🏷️ License

🙏 Acknowledgement

📝 Citation

About

Releases

Packages

Contributors 2

Languages

License

mlpc-ucsd/TokenCompose

Folders and files

Latest commit

History

Repository files navigation

🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Project Page | arXiv | X (Twitter)

Updates

🆕 Models

📊 MultiGen

💻 Environment Setup

🛠️ Dataset Setup

1. Setup the COCO Image Data

2. Setup Token-wise Grounded Segmentation Maps

📈 Training

🏷️ License

🙏 Acknowledgement

📝 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages