Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md for better readability, formatting and grammar #41

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 18 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

## 💻 Links: [DeepFloyd.AI](https://deepfloyd.ai) | [Discord](https://discord.gg/umz62Mgr) | [Twitter](https://twitter.com/deepfloydai)

We introduce DeepFloyd IF, a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. DeepFloyd IF is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: a base model that generates 64x64 px image based on text prompt and two super-resolution models, each designed to generate images of increasing resolution: 256x256 px and 1024x1024 px. All stages of the model utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
We introduce DeepFloyd IF, a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. DeepFloyd IF is a modular system composed of a frozen text encoder and three cascaded pixel diffusion modules: a base model that generates 64x64 px image based on text prompt and two super-resolution models, each designed to generate images of increasing resolution: 256x256 px and 1024x1024 px. All stages of the model utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.

<p align="center">
<img src="./pics/deepfloyd_if_scheme.jpg" width="100%">
Expand All @@ -19,10 +19,14 @@ We introduce DeepFloyd IF, a novel state-of-the-art open-source text-to-image mo
*Inspired by* [*Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding*](https://arxiv.org/pdf/2205.11487.pdf)

## Minimum requirements to use all IF models:
- 16GB vRAM for IF-I-XL (4.3B text to 64x64 base module) & IF-II-L (1.2B to 256x256 upscaler module)
- 24GB vRAM for IF-I-XL (4.3B text to 64x64 base module) & IF-II-L (1.2B to 256x256 upscaler module) & Stable x4 (to 1024x1024 upscaler)
- `xformers` and set env variable `FORCE_MEM_EFFICIENT_ATTN=1`

#### The following require 16GB of VRAM
- IF-I-XL (4.3B text to 64x64 base module)
- IF-II-L (1.2B to 256x256 upscaler module)
#### The following require 16GB of VRAM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### The following require 16GB of VRAM
#### The following require 24GB of VRAM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, thanks!

- 24GB vRAM for IF-I-XL (4.3B text to 64x64 base module)
- IF-II-L (1.2B to 256x256 upscaler module)
- Stable x4 (to 1024x1024 upscaler)

## Quick Start
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
Expand All @@ -46,12 +50,12 @@ The Dream, Style Transfer, Super Resolution or Inpainting modes are avaliable in

IF is also integrated with the 🤗 Hugging Face [Diffusers library](https://github.com/huggingface/diffusers/).

Diffusers runs each stage individually allowing the user to customize the image generation process as well as allowing to inspect intermediate results easily.
Diffusers runs each stage individually allowing the user to customize the image generation process and inspect intermediate results easily.

### Example

Before you can use IF, you need to accept its usage conditions. To do so:
1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be loggin in
1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in
2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0)
3. Make sure to login locally. Install `huggingface_hub`
```sh
Expand All @@ -68,15 +72,15 @@ login()

and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens).

Next we install `diffusers` and dependencies:
Next, we install `diffusers` and other dependencies:

```sh
pip install diffusers accelerate transformers safetensors
```

And we can now run the model locally.
We can now run the model locally.

By default `diffusers` makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings) to run the whole IF pipeline with as little as 14 GB of VRAM.
By default, `diffusers` makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings) to run the whole IF pipeline with as little as 14 GB of VRAM.

If you are using `torch>=2.0.0`, make sure to **delete all** `enable_xformers_memory_efficient_attention()`
functions.
Expand Down Expand Up @@ -131,7 +135,7 @@ image[0].save("./if_stage_III.png")
- 🚀 [Optimizing for inference time](https://huggingface.co/docs/diffusers/api/pipelines/if#optimizing-for-speed)
- ⚙️ [Optimizing for low memory during inference](https://huggingface.co/docs/diffusers/api/pipelines/if#optimizing-for-memory)

For more in-detail information about how to use IF, please have a look at [the IF blog post](https://huggingface.co/blog/if) and [the documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/if) 📖.
For more in-detail information about how to use IF, please have a look at the [IF blog post](https://huggingface.co/blog/if) and [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/if) 📖.

## Run the code locally

Expand Down Expand Up @@ -184,7 +188,8 @@ if_III.show(result['III'], size=14)

![](./pics/img_to_img_scheme.jpeg)

In Style Transfer mode, the output of your prompt comes out at the style of the `support_pil_img`
In Style Transfer mode, the output of your prompt comes out as the style of the `support_pil_img`

```python
from deepfloyd_if.pipelines import style_transfer

Expand Down Expand Up @@ -316,14 +321,13 @@ The link to download the weights as well as the model cards will be available so

The code in this repository is released under the bespoke license (see added [point two](https://github.com/deep-floyd/IF/blob/main/LICENSE#L13)).

The weights will be available soon via [the DeepFloyd organization at Hugging Face](https://huggingface.co/DeepFloyd) and have their own LICENSE.
The weights and licenses will be available soon via the [DeepFloyd organization] on Hugging Face(https://huggingface.co/DeepFloyd).

**Disclaimer:** *The initial release of the IF model is under a restricted research-purposes-only license temporarily to gather feedback, and after that we intend to release a fully open-source model in line with other Stability AI models.*

## Limitations and Biases

The models available in this codebase have known limitations and biases. Please refer to [the model card](https://huggingface.co/DeepFloyd/IF-I-L-v1.0) for more information.

The models available in this codebase have known limitations and biases. Please refer to the [model card](https://huggingface.co/DeepFloyd/IF-I-L-v1.0) for more information.

## 🎓 DeepFloyd IF creators:

Expand Down