Skip to content

Latest commit

 

History

History
50 lines (43 loc) · 3.03 KB

MODELS.md

File metadata and controls

50 lines (43 loc) · 3.03 KB

Pretrained models

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main

Model Download link File size
Flow prediction network, small 16kHz mmaudio_small_16k.pth 601M
Flow prediction network, small 44.1kHz mmaudio_small_44k.pth 601M
Flow prediction network, medium 44.1kHz mmaudio_medium_44k.pth 2.4G
Flow prediction network, large 44.1kHz mmaudio_large_44k.pth 3.9G
Flow prediction network, large 44.1kHz, v2 (recommended) mmaudio_large_44k_v2.pth 3.9G
16kHz VAE v1-16.pth 655M
16kHz BigVGAN vocoder (from Make-An-Audio 2) best_netG.pt 429M
44.1kHz VAE v1-44.pth 1.2G
Synchformer visual encoder synchformer_state_dict.pth 907M

To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically. The _v2 model performs worse in benchmarking (e.g., in Fréchet distance), but, in my experience, generalizes better to new data.

The expected directory structure (full):

MMAudio
├── ext_weights
│   ├── best_netG.pt
│   ├── synchformer_state_dict.pth
│   ├── v1-16.pth
│   └── v1-44.pth
├── weights
│   ├── mmaudio_small_16k.pth
│   ├── mmaudio_small_44k.pth
│   ├── mmaudio_medium_44k.pth
│   ├── mmaudio_large_44k.pth
│   └── mmaudio_large_44k_v2.pth
└── ...

The expected directory structure (minimal, for the recommended model only):

MMAudio
├── ext_weights
│   ├── synchformer_state_dict.pth
│   └── v1-44.pth
├── weights
│   └── mmaudio_large_44k_v2.pth
└── ...