Pretrained models

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main

Model	Download link	File size
Flow prediction network, small 16kHz	mmaudio_small_16k.pth	601M
Flow prediction network, small 44.1kHz	mmaudio_small_44k.pth	601M
Flow prediction network, medium 44.1kHz	mmaudio_medium_44k.pth	2.4G
Flow prediction network, large 44.1kHz	mmaudio_large_44k.pth	3.9G
Flow prediction network, large 44.1kHz, v2 (recommended)	mmaudio_large_44k_v2.pth	3.9G
16kHz VAE	v1-16.pth	655M
16kHz BigVGAN vocoder (from Make-An-Audio 2)	best_netG.pt	429M
44.1kHz VAE	v1-44.pth	1.2G
Synchformer visual encoder	synchformer_state_dict.pth	907M

To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically. The _v2 model performs worse in benchmarking (e.g., in Fréchet distance), but, in my experience, generalizes better to new data.

The expected directory structure (full):

MMAudio
├── ext_weights
│   ├── best_netG.pt
│   ├── synchformer_state_dict.pth
│   ├── v1-16.pth
│   └── v1-44.pth
├── weights
│   ├── mmaudio_small_16k.pth
│   ├── mmaudio_small_44k.pth
│   ├── mmaudio_medium_44k.pth
│   ├── mmaudio_large_44k.pth
│   └── mmaudio_large_44k_v2.pth
└── ...

The expected directory structure (minimal, for the recommended model only):

MMAudio
├── ext_weights
│   ├── synchformer_state_dict.pth
│   └── v1-44.pth
├── weights
│   └── mmaudio_large_44k_v2.pth
└── ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODELS.md

MODELS.md

Pretrained models

Files

MODELS.md

Latest commit

History

MODELS.md

File metadata and controls

Pretrained models