MaskBit: Embedding-free Image Generation via Bit Tokens

This repository contains an implementation of the paper "MaskBit: Embedding-free Image Generation via Bit Tokens" accepted to TMLR with featured and reproducibility certifications.

We present a modernized VQGAN+ and a novel image generation framework leveraging bit tokens. As a result, MaskBit uses a shared representation in the tokenizer and generator, which yields state-of-the-art results (at time of publication) while using a significantly smaller model compared to autoregressive models.

🚀 Contributions

We study the key ingredients of recent closed-source VQGAN tokenizers and develop a publicly available, reproducible, and high-performing VQGAN model, called VQGAN+, achieving a significant improvement of 6.28 rFID over the original VQGAN developed three years ago.

Building on our improved tokenizer framework, we leverage modern Lookup-Free Quantization (LFQ). We analyze the latent representation and observe that embedding-free bit token representation exhibits highly structured semantics.

Motivated by these discoveries, we develop a novel embedding-free generation framework, MaskBit, which builds on top of the bit tokens and achieves state-of-the-art performance on the ImageNet 256×256 class-conditional image generation benchmark.

Updates

12/06/2024: Code release and tokenizer models.
12/01/2024: Accepted to TMLR with featured and reproducibility certifications.
09/24/2024: The tech report of MaskBit is available.

Model Zoo

All models are trained on ImageNet with an input shape of 256x256. All models downsample the images to a spatial size of 16x16, leading to a latent representation of 16x16xK bits per image.

Model	Link	reconstruction FID	config
VQGAN+ (10 bits, from the paper)	checkpoint	1.67	config
VQGAN+ (10 bits)	TODO	1.52
VQGAN+ (12 bits)	checkpoint	1.39	config
-------------	-------------	-------------	-------------
MaskBit-Tokenizer (10 bits)	checkpoint	1.76	config
MaskBit-Tokenizer (12 bits)	checkpoint	1.52	config
MaskBit-Tokenizer (14 bits)	checkpoint	1.37	config
MaskBit-Tokenizer (16 bits)	checkpoint	1.29	config
MaskBit-Tokenizer (18* bits)	checkpoint	1.16	config
-------------	-------------	-------------	-------------
Taming-VQGAN (10 bits)	checkpoint	7.96	config
MaskGIT-Tokenizer (10 bits)	checkpoint	1.96	config

*In practice only 17 bits are used, as one bit does not change. We did not put any effort into fixing "dead" bits, as such large vocabulary was not needed for ImageNet.

Since the initial release of the paper, we have made some small changes and to the training recipe to improve the reconstruction quality of the tokenizer.

Please note that these models are trained only on limited academic dataset ImageNet, and they are only for research purposes. We will release the Stage-II models soon.

Installation

The codebase was tested with Python 3.9 and Pytorch 2.2.2. After setting up pytorch, you can use the following script to install additional requirements.

pip3 install -r requirements.txt

Training

Tokenizer (Stage-I)

Please first follow the install guide and the data preparation doc.

We use the accelerate library for multi-device training, which means the following command needs to be started on each worker:

PYTHONPATH=./ WORKSPACE=./ accelerate launch --num_machines=1  --machine_rank=0 --main_process_ip=127.0.0.1 --main_process_port=9999 --same_network scripts/train_tokenizer.py config=./configs/tokenizer/maskbit_tokenizer_10bit.yaml

For more instructions on how to use the accelerate library, we refer to their website. Moreover, run specific config changes can also be done by passing the config changes on the command line. For example training.per_gpu_batch_size=32 would use a batchsize of 32 for this run.

Generator (Stage-II)

Please first follow the install guide and the data preparation doc.

We use the accelerate library for multi-device training, which means the following command needs to be started on each worker:

PYTHONPATH=./ WORKSPACE=./ accelerate launch --num_machines=1  --machine_rank=0 --main_process_ip=127.0.0.1 --main_process_port=9999 --same_network scripts/train_maskbit.py config=./configs/generator/maskbit_generator_10bit.yaml

For more instructions on how to use the accelerate library, we refer to their website. Moreover, run specific config changes can also be done by passing the config changes on the command line. For example training.per_gpu_batch_size=32 would use a batchsize of 32 for this run.

We will release checkpoints and configs for the generator soon.

Testing on ImageNet-1K Benchmark

Tokenizer (Stage-I)

Please first follow the install guide and the data preparation doc.

After choosing the model config and checkpoint, the following command will run the evaluation:

PYTHONPATH=./ python3 scripts/eval_tokenizer.py config=./configs/tokenizer/maskbit_tokenizer_12bit.yaml experiment.vqgan_checkpoint=/PATH_TO_MODEL/maskbit_tokenizer_12bit.bin

Generator (Stage-II)

Coming soon.

Detailed Results

Model	reconstruction FID	Inception Score	PSNR	SSIM	Codebook Usage
VQGAN+ (10 bits, from the paper)	1.67	186.5	20.9	0.53	1.0
VQGAN+ (10 bits)	1.52	182.4	21.1	0.54	1.0
VQGAN+ (12 bits)	1.39	193.9	21.0	0.55	1.0
-------------	-------	-------	-------	-------	-------
MaskBit-Tokenizer (10 bits)	1.76	177.6	20.8	0.53	1.0
MaskBit-Tokenizer (12 bits)	1.52	184.3	21.2	0.55	1.0
MaskBit-Tokenizer (14 bits)	1.37	190.3	21.5	0.56	1.0
MaskBit-Tokenizer (16 bits)	1.29	193.6	21.8	0.58	1.0
MaskBit-Tokenizer (18* bits)	1.16	197.8	22.0	0.59	0.5
-------------	-------	-------	-------	-------	-------
Taming-VQGAN (10 bits)	7.96	115.9	20.18	0.52	1.0
MaskGIT-Tokenizer (10 bits)	1.96	178.3	18.6	0.47	0.45

Citing

If you use our work in your research, please use the following BibTeX entry.

@article{weber2024maskbit,
  author    = {Mark Weber and Lijun Yu and Qihang Yu and Xueqing Deng and Xiaohui Shen and Daniel Cremers and Liang-Chieh Chen},
  title     = {MaskBit: Embedding-free Image Generation via Bit Tokens},
  journal   = {arXiv:2409.16211},
  year      = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
configs		configs
data		data
docs		docs
evaluator		evaluator
metrics		metrics
modeling		modeling
pretrained		pretrained
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaskBit: Embedding-free Image Generation via Bit Tokens

🚀 Contributions

We study the key ingredients of recent closed-source VQGAN tokenizers and develop a publicly available, reproducible, and high-performing VQGAN model, called VQGAN+, achieving a significant improvement of 6.28 rFID over the original VQGAN developed three years ago.

Building on our improved tokenizer framework, we leverage modern Lookup-Free Quantization (LFQ). We analyze the latent representation and observe that embedding-free bit token representation exhibits highly structured semantics.

Motivated by these discoveries, we develop a novel embedding-free generation framework, MaskBit, which builds on top of the bit tokens and achieves state-of-the-art performance on the ImageNet 256×256 class-conditional image generation benchmark.

Updates

Model Zoo

Installation

Training

Tokenizer (Stage-I)

Generator (Stage-II)

Testing on ImageNet-1K Benchmark

Tokenizer (Stage-I)

Generator (Stage-II)

Detailed Results

Citing

About

Languages

License

markweberdev/maskbit

Folders and files

Latest commit

History

Repository files navigation

MaskBit: Embedding-free Image Generation via Bit Tokens

🚀 Contributions

We study the key ingredients of recent closed-source VQGAN tokenizers and develop a publicly available, reproducible, and high-performing VQGAN model, called VQGAN+, achieving a significant improvement of 6.28 rFID over the original VQGAN developed three years ago.

Building on our improved tokenizer framework, we leverage modern Lookup-Free Quantization (LFQ). We analyze the latent representation and observe that embedding-free bit token representation exhibits highly structured semantics.

Motivated by these discoveries, we develop a novel embedding-free generation framework, MaskBit, which builds on top of the bit tokens and achieves state-of-the-art performance on the ImageNet 256×256 class-conditional image generation benchmark.

Updates

Model Zoo

Installation

Training

Tokenizer (Stage-I)

Generator (Stage-II)

Testing on ImageNet-1K Benchmark

Tokenizer (Stage-I)

Generator (Stage-II)

Detailed Results

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages