The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py
.
The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main
Model | Download link | File size |
---|---|---|
Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M |
Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M |
Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G |
Flow prediction network, large 44.1kHz | mmaudio_large_44k.pth | 3.9G |
Flow prediction network, large 44.1kHz, v2 (recommended) | mmaudio_large_44k_v2.pth | 3.9G |
16kHz VAE | v1-16.pth | 655M |
16kHz BigVGAN vocoder (from Make-An-Audio 2) | best_netG.pt | 429M |
44.1kHz VAE | v1-44.pth | 1.2G |
Synchformer visual encoder | synchformer_state_dict.pth | 907M |
To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes.
The 44.1kHz vocoder will be downloaded automatically.
The _v2
model performs worse in benchmarking (e.g., in Fréchet distance), but, in my experience, generalizes better to new data.
The expected directory structure (full):
MMAudio
├── ext_weights
│ ├── best_netG.pt
│ ├── synchformer_state_dict.pth
│ ├── v1-16.pth
│ └── v1-44.pth
├── weights
│ ├── mmaudio_small_16k.pth
│ ├── mmaudio_small_44k.pth
│ ├── mmaudio_medium_44k.pth
│ ├── mmaudio_large_44k.pth
│ └── mmaudio_large_44k_v2.pth
└── ...
The expected directory structure (minimal, for the recommended model only):
MMAudio
├── ext_weights
│ ├── synchformer_state_dict.pth
│ └── v1-44.pth
├── weights
│ └── mmaudio_large_44k_v2.pth
└── ...