VideoTuna

🤗🤗🤗 Videotuna is a useful codebase for text-to-video applications.
🌟 VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), and video-to-video (V2V) generation for model inference and finetuning (to the best of our knowledge).
🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).
🌟 An Emotion Control I2V model will be released soon.

Features

🌟 All-in-one framework: Inference and fine-tune up-to-date video generation models.
🌟 Pre-training: Build your own foundational text-to-video model.
🌟 Continuous training: Keep improving your model with new data.
🌟 Domain-specific fine-tuning: Adapt models to your specific scenario.
🌟 Concept-specific fine-tuning: Teach your models with unique concepts.
🌟 Enhanced language understanding: Improve model comprehension through continuous training.
🌟 Post-processing: Enhance the videos with video-to-video enhancement model.
🌟 Post-training/Human preference alignment: Post-training with RLHF for more attractive results.

🔆 Updates

[2025-02-03] 🐟 We update automatic code formatting from PR#27. Thanks samidarko!
[2025-02-01] 🐟 We update Poetry migration for better dependency management and script automation from PR#25. Thanks samidarko!
[2025-01-20] 🐟 We update the fine-tuning of Flux-T2I. Thanks VideoTuna team!
[2025-01-01] 🐟 We update the training of VideoVAE+ in this repo. Thanks VideoTuna team!
[2025-01-01] 🐟 We update the inference of Hunyuan Video and Mochi. Thanks VideoTuna team!
[2024-12-24] 🐟 We release a SOTA Video VAE model VideoVAE+ in this repo! Better video reconstruction than Nvidia's Cosmos-Tokenizer. Thanks VideoTuna team!
[2024-12-01] 🐟 We update the inference of CogVideoX-1.5-T2V&I2V, Video-to-Video Enhancement from ModelScope, and fine-tuning of CogVideoX-1. Thanks VideoTuna team!
[2024-11-01] 🐟 We make the VideoTuna V0.1.0 public! It supports inference of VideoCrafter1-T2V&I2V, VideoCrafter2-T2V, DynamiCrafter-I2V, OpenSora-T2V, CogVideoX-1-2B-T2V, CogVideoX-1-T2V, Flux-T2I, as well as training and finetuning of part of these models. Thanks VideoTuna team!

Application Demonstration

Model Inference and Comparison

Video VAE+

Video VAE+ can accurately compress and reconstruct the input videos with fine details.

Ground Truth	Reconstruction

Emotion Control I2V


Input 1	Input 2


Emotion: Anger	Emotion: Disgust	Emotion: Fear

Emotion: Happy	Emotion: Sad	Emotion: Surprise


Emotion: Anger	Emotion: Disgust	Emotion: Fear

Emotion: Happy	Emotion: Sad	Emotion: Surprise

Character-Consistent Storytelling Video Generation


The picture shows a cozy room with a little girl telling her travel story to her teddybear beside the bed.	As night falls, teddybear sits by the window, his eyes sparkling with longing for the distant place	Teddybear was in a corner of the room, making a small backpack out of old cloth strips, with a map, a compass and dry food next to it.	The first rays of sunlight in the morning came through the window, and teddybear quietly opened the door and embarked on his adventure.	In the forest, the sun shines through the treetops, and teddybear moves among various animals and communicates with them.

Teddybear leaves his mark on the edge of a clear lake, surrounded by exotic flowers, and the picture is full of mystery and exploration.	Teddybear climbs the rugged mountain road, the weather is changeable, but he is determined.	The picture switches to the top of the mountain, where teddybear stands in the glow of the sunrise, with a magnificent mountain view in the background.	On the way home, teddybear helps a wounded bird, the picture is warm and touching.	Teddybear sits by the little girl's bed and tells her his adventure story, and the little girl is fascinated.


The scene shows a peaceful village, with moonlight shining on the roofs and streets, creating a peaceful atmosphere.	cat sits by the window, her eyes twinkling in the night, reflecting her special connection with the moon and stars.	Villagers gather in the center of the village for the annual Moon Festival celebration, with lanterns and colored lights adorning the night sky.	cat feels the call of the moon, and her beard trembles with the excitement in her heart.	cat quietly leaves her home in the night and embarks on a path illuminated by the silver moonlight.

A group of forest elves dance around glowing mushrooms, their costumes and movements full of magic and vitality.	cat joins the celebration and dances with the elves, the picture is full of joy and freedom.	A wise old owl reveals the secret power of the moon to cat and the light of the moon in the picture becomes brighter.	cat closes her eyes in the moonlight, puts her hands together, and makes a wish, surrounded by the light of stars and the moon.	cat feels the surge of power, and her eyes become more determined.

🔆 Information

Code Structure

VideoTuna/
    ├── assets       # put images for readme
    ├── checkpoints  # put model checkpoints here
    ├── configs      # model and experimental configs
    ├── data         # data processing scripts and dataset files
    ├── docs         # documentations
    ├── eval         # evaluation scripts
    ├── inputs       # input examples for testing 
    ├── scripts      # train and inference python scripts
    ├── shsripts     # train and inference shell scripts
    ├── src          # model-related source code
    ├── tests        # testing scripts
    ├── tools        # some tool scripts

Supported Models

T2V-Models	HxWxL	Checkpoints
HunyuanVideo	720x1280x129	Hugging Face
Mochi	848x480, 3s	Hugging Face
CogVideoX-2B	480x720x49	Hugging Face
CogVideoX-5B	480x720x49	Hugging Face
Open-Sora 1.0	512×512x16	Hugging Face
Open-Sora 1.0	256×256x16	Hugging Face
Open-Sora 1.0	256×256x16	Hugging Face
VideoCrafter2	320x512x16	Hugging Face
VideoCrafter1	576x1024x16	Hugging Face
VideoCrafter1	320x512x16	Hugging Face

I2V-Models	HxWxL	Checkpoints
CogVideoX-5B-I2V	480x720x49	Hugging Face
DynamiCrafter	576x1024x16	Hugging Face
VideoCrafter1	320x512x16	Hugging Face

Note: H: height; W: width; L: length

Please check docs/CHECKPOINTS.md to download all the model checkpoints.

🔆 Get started

1.Prepare environment

conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install

Flash-attn installation (Optional)

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode.

poetry run install-flash-attn

2.Prepare checkpoints

Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.

3.Inference state-of-the-art T2V/I2V/T2I models

Inference a set of text-to-video models in one command: bash tools/video_comparison/compare.sh
- The default mode is to run all models, e.g., inference_methods="videocrafter2;dynamicrafter;cogvideo—t2v;cogvideo—i2v;opensora"
- If the users want to inference specific models, modify the inference_methods variable in compare.sh, and list the desired models separated by semicolons.
- Also specify the input directory via the input_dir variable. This directory should contain a prompts.txt file, where each line corresponds to a prompt for the video generation. The default input_dir is inputs/t2v
Inference a set of image-to-video models in one command: bash tools/video_comparison/compare_i2v.sh

Inference a specific model, run the corresponding commands as follows:

Task	Model	Command	Length (#frames)	Resolution	Inference Time (s)	GPU Memory (GiB)
T2V	HunyuanVideo	`poetry run inference-hunyuan`	129	720x1280	1920	59.15
T2V	Mochi	`poetry run inference-mochi`	84	480x848	109.0	26
I2V	CogVideoX-5b-I2V	`poetry run inference-cogvideox-15-5b-i2v`	49	480x720	310.4	4.78
T2V	CogVideoX-2b	`poetry run inference-cogvideo-t2v-diffusers`	49	480x720	107.6	2.32
T2V	Open Sora V1.0	`poetry run inference-opensora-v10-16x256x256`	16	256x256	11.2	23.99
T2V	VideoCrafter-V2-320x512	`poetry run inference-vc2-t2v-320x512`	16	320x512	26.4	10.03
T2V	VideoCrafter-V1-576x1024	`poetry run inference-vc1-t2v-576x1024`	16	576x1024	91.4	14.57
I2V	DynamiCrafter	`poetry run inference-dc-i2v-576x1024`	16	576x1024	101.7	52.23
I2V	VideoCrafter-V1	`poetry run inference-vc1-i2v-320x512`	16	320x512	26.4	10.03
T2I	Flux-dev	`poetry run inference-flux-dev`	1	768x1360	238.1	1.18
T2I	Flux-schnell	`poetry run inference-flux-schnell`	1	768x1360	5.4	1.20

Flux-dev: Trained using guidance distillation, it requires 40 to 50 steps to generate high-quality images.

Flux-schnell: Trained using latent adversarial diffusion distillation, it can generate high-quality images in only 1 to 4 steps.

4. Finetune T2V models

4.1 Prepare dataset

Please follow the docs/datasets.md to try provided toydataset or build your own datasets.

4.2 Fine-tune

1. VideoCrafter2 Full Fine-tuning

Before started, we assume you have finished the following two preliminary steps:

Install the environment
Prepare the dataset
Download the checkpoints and get these two checkpoints

  ll checkpoints/videocrafter/t2v_v2_512/model.ckpt
  ll checkpoints/stablediffusion/v2-1_512-ema/model.ckpt

First, run this command to convert the VC2 checkpoint as we make minor modifications on the keys of the state dict of the checkpoint. The converted checkpoint will be automatically save at checkpoints/videocrafter/t2v_v2_512/model_converted.ckpt.

python tools/convert_checkpoint.py --input_path checkpoints/videocrafter/t2v_v2_512/model.ckpt

Second, run this command to start training on the single GPU. The training results will be automatically saved at results/train/${CURRENT_TIME}_${EXPNAME}

poetry run train-videocrafter-v2

2. VideoCrafter2 Lora Fine-tuning

We support lora finetuning to make the model to learn new concepts/characters/styles.

Example config file: configs/001_videocrafter2/vc2_t2v_lora.yaml
Training lora based on VideoCrafter2: bash shscripts/train_videocrafter_lora.sh
Inference the trained models: bash shscripts/inference_vc2_t2v_320x512_lora.sh

3. Open-Sora Fine-tuning

We support open-sora finetuning, you can simply run the following commands:

# finetune the Open-Sora v1.0
poetry run train-opensorav10

4. FLUX Lora Fine-tuning

We support flux lora finetuning, you can simply run the following commands:

# finetune the Flux-Lora
poetry run train-flux-lora

# inference the lora model
poetry run inference-flux-lora

If you want to build your own dataset, please organize your data as inputs/t2i/flux/plushie_teddybear, which contains the training images and the corresponding text prompt files, as shown in the following directory structure. Then modify the instance_data_dir inconfigs/006_flux/multidatabackend.json.

owndata/
    ├── img1.jpg
    ├── img2.jpg  
    ├── img3.jpg           
    ├── ...
    ├── prompt1.txt      # prompt of img1.jpg
    ├── prompt2.txt      # prompt of img2.jpg
    ├── prompt3.txt      # prompt of img3.jpg
    ├── ...

5. Evaluation

We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.

Acknowledgement

We thank the following repos for sharing their awesome models and codes!

Mochi: A new SOTA in open-source video generation models
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Open-Sora: Democratizing Efficient Video Production for All
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
VADER: Video Diffusion Alignment via Reward Gradients
VBench: Comprehensive Benchmark Suite for Video Generative Models
Flux: Text-to-image models from Black Forest Labs.
SimpleTuner: A fine-tuning kit for text-to-image generation.

Some Resources

LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
MMTrail: A multimodal trailer video dataset with language and music descriptions.
Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
Animate-A-Story: A framework for storytelling video generation.
LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.

🍻 Contributors

📋 License

Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He ([email protected]) and Yazhou Xing ([email protected]).

😊 Citation

@software{videotuna,
  author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
  title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
  month = {Nov},
  year = {2024},
  url = {https://github.com/VideoVerses/VideoTuna}
}

Name		Name	Last commit message	Last commit date
Latest commit History 496 Commits
configs		configs
docker		docker
docs		docs
eval		eval
inputs		inputs
scripts		scripts
shscripts		shscripts
tests/datasets		tests/datasets
tools		tools
videotuna		videotuna
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scripts.py		scripts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoTuna

Features

🔆 Updates

Application Demonstration

Model Inference and Comparison

Video VAE+

Emotion Control I2V

Character-Consistent Storytelling Video Generation

🔆 Information

Code Structure

Supported Models

🔆 Get started

1.Prepare environment

2.Prepare checkpoints

3.Inference state-of-the-art T2V/I2V/T2I models

4. Finetune T2V models

4.1 Prepare dataset

4.2 Fine-tune

1. VideoCrafter2 Full Fine-tuning

2. VideoCrafter2 Lora Fine-tuning

3. Open-Sora Fine-tuning

4. FLUX Lora Fine-tuning

5. Evaluation

Acknowledgement

Some Resources

🍻 Contributors

📋 License

😊 Citation

Star History

About

Releases

Packages

Contributors 15

Languages

License

VideoVerses/VideoTuna

Folders and files

Latest commit

History

Repository files navigation

VideoTuna

Features

🔆 Updates

Application Demonstration

Model Inference and Comparison

Video VAE+

Emotion Control I2V

Character-Consistent Storytelling Video Generation

🔆 Information

Code Structure

Supported Models

🔆 Get started

1.Prepare environment

2.Prepare checkpoints

3.Inference state-of-the-art T2V/I2V/T2I models

4. Finetune T2V models

4.1 Prepare dataset

4.2 Fine-tune

1. VideoCrafter2 Full Fine-tuning

2. VideoCrafter2 Lora Fine-tuning

3. Open-Sora Fine-tuning

4. FLUX Lora Fine-tuning

5. Evaluation

Acknowledgement

Some Resources

🍻 Contributors

📋 License

😊 Citation

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 15

Languages

Packages