Step-Audio

中文｜ English ｜日本語

Step-Audio

🔥🔥🔥 News!!

Feb 17, 2025: 👋 We release the inference code and model weights of Step-Audio-Chat, Step-Audio-TTS-3B and Step-Audio-Tokenizer
Feb 17, 2025: 👋 We release the multi-turn audio benchmark of StepEval-Audio-360.
Feb 17, 2025: 👋 We release the technical report of Step-Audio.

1. Introduction

Step-Audio is the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation, supporting multilingual conversations (e.g., Chinese, English, Japanese), emotional tones (e.g., joy/sadness), regional dialects (e.g., Cantonese/Sichuanese), adjustable speech rates, and prosodic styles (e.g., rap). Step-Audio demonstrates four key technical innovations:

130B-Parameter Multimodal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.
Generative Data Engine: Eliminates traditional TTS's reliance on manual data collection by generating high-quality audio through our 130B-parameter multimodal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instruction-following capabilities for controllable speech synthesis.
Granular Voice Control: Enables precise regulation through instruction-based control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (rap, a cappella humming) to meet diverse speech generation needs.
Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.

2. Model Summary

In Step-Audio, audio streams are tokenized via a dual-codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) tokenizers with 2:3 temporal interleaving. A 130B-parameter LLM foundation (Step-1) is further enhanced via audio-contextualized continual pretraining and task-specific post-training, enabling robust cross-modal speech understanding. A hybrid speech decoder combining flow matching with neural vocoding, optimized for real-time waveform generation. A streaming-aware architecture featuring speculative response generation (40% commit rate) and text-based context management (14:1 compression ratio) for efficient cross-modal alignment.

2.1 Tokenizer

We implement a token-level interleaving approach to effectively integrate semantic tokenization and acoustic tokenization. The semantic tokenizer employs a codebook size of 1024, while the acoustic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details. Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two semantic tokens are paired with three acoustic tokens.

2.2 Language Model

To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pretrain-ing based on Step-1, a 130-billion parameter pretrained text-based large language model (LLM).

2.3 Speech Decoder

The speech decoder in Step-Audio serves a critical function in converting discrete speech tokens, which contain both semantic and acoustic information, into continuous time-domain waveforms that represent natural speech. The decoder architecture incorporates a flow matching model and a mel-to-wave vocoder. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of semantic and acoustic features throughout the generation process.

2.4 Real-time Inference Pipeline

To enable real-time interactions, we have designed an optimized inference pipeline. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include Voice Activity Detection (VAD) for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity.

2.5 Post training details

In the post-training phase, we conducted task-specific Supervised Fine-Tuning (SFT) for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). For Audio Input Text Output (AQTA) tasks, we implemented SFT using diversified high-quality datasets combined with Reinforcement Learning from Human Feedback (RLHF) to enhance response quality, enabling fine-grained control over emotional expression, speech speed, dialect, and prosody.

3. Model Download

3.1 Huggingface

Models	Links
Step-Audio-Tokenizer	🤗huggingface
Step-Audio-Chat	🤗huggingface
Step-Audio-TTS-3B	🤗huggingface

3.2 Modelscope

Models	Links
Step-Audio-Tokenizer	modelscope
Step-Audio-Chat	modelscope
Step-Audio-TTS-3B	modelscope

4. Model Usage

📜 4.1 Requirements

The following table shows the requirements for running Step-Audio model (batch size = 1):

Model	Setting (sample frequency)	GPU Minimum Memory
Step-Audio-Tokenizer	41.6Hz	1.5GB
Step-Audio-Chat	41.6Hz	265GB
Step-Audio-TTS-3B	41.6Hz	8GB

An NVIDIA GPU with CUDA support is required.
- The model is tested on a four A800 80G GPU.
- Recommended: We recommend using 4xA800/H800 GPU with 80GB memory for better generation quality.
Tested operating system: Linux

🔧 4.2 Dependencies and Installation

Python >= 3.10.0 (Recommend to use Anaconda or Miniconda)
PyTorch >= 2.3-cu121
CUDA Toolkit

git clone https://github.com/stepfun-ai/Step-Audio.git
conda create -n stepaudio python=3.10
conda activate stepaudio

cd Step-Audio
pip install -r requirements.txt

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-Chat
git clone https://huggingface.co/stepfun-ai/Step-Audio-TTS-3B

After downloading the models, where_you_download_dir should have the following structure:

where_you_download_dir
├── Step-Audio-Tokenizer
├── Step-Audio-Chat
├── Step-Audio-TTS-3B

🚀 4.3 Inference Scripts

Offline inference

Inference with e2e audio/text input and audio/text output.

python offline_inference.py --model-path where_you_download_dir

TTS inference

Inference tts with default speaker or clone with a new speaker

python tts_inference.py --model-path where_you_download_dir --output-path where_you_save_audio_dir --synthesis-type use_tts_or_clone

A speaker information dict is required for clone mode, formatted as follows:

{
    "speaker": "speaker id",
    "prompt_text": "content of prompt wav",
    "wav_path": "prompt wav path"
}

Launch Web Demo

Start a local server for online inference. Assume you have 4 GPUs available and have already downloaded all the models.

python app.py --model-path where_you_download_dir

Inference Chat Model with vLLM (recommended)

Step-Audio-Chat is a 130B LLM Model, it is recommended to use vLLM to inference with tensor parallelism.

Currently, the official vLLM does not support the Step 1 model. You can temporarily use our development branch for local installation.

Because our attention mechanism is a variant of ALIBI, the official flash attention library is not compatible. We have provided a custom flash attention library in the Step-Audio-Chat repository. Make sure export the custom flash attention library to the environment variable before running the model.

export OPTIMUS_LIB_PATH=where_you_download_dir/Step-Audio-Chat/lib

vllm serve where_you_download_dir/Step-Audio-Chat --dtype auto -tp $tp --served-model-name step_chat_audio --trust-remote-code

5. Benchmark

5.1 ASR result comparison

	Hidden Feature Modeling				Discrete Audio Token Modeling
	Whisper Large-v3	Qwen2-Audio	MinMo	LUCY	Moshi	GLM-4-voice Base	GLM-4-voice Chat	Step-Audio Pretrain	Step-Audio-Chat
Aishell-1	5.14	1.53	-	2.4	-	2.46	226.47	0.87	1.95
Aishell-2 ios	4.76	3.06	2.69	-	-	-	211.3	2.91	3.57
Wenetspeech test-net	9.68	7.72	6.64	8.78	-	-	146.05	7.62	8.75
Wenet test-meeting	18.54	8.4	7.6	10.42	-	-	140.82	7.78	9.52
Librispeech test-clean	1.9	1.6	1.6	3.36	5.7	2.82	75.39	2.36	3.11
Librispeech test-other	3.65	3.6	3.82	8.05	-	7.66	80.3	6.32	8.44
AVG	7.28	4.32	-	-	-	-	146.74	4.64	5.89

5.2 TTS

5.2.1 Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.

Model	test-zh	test-en
Model	CER (%) ↓	WER (%) ↓
GLM-4-Voice	2.19	2.91
MinMo	2.48	2.90
Step-Audio	1.53	2.71

5.2.2 Results of TTS Models on SEED Test Sets.

StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*

Model	test-zh		test-en
Model	CER (%) ↓	SS ↑	WER (%) ↓	SS ↑
FireRedTTS	1.51	0.630	3.82	0.460
MaskGCT	2.27	0.774	2.62	0.774
CosyVoice	3.63	0.775	4.29	0.699
CosyVoice 2	1.45	0.806	2.57	0.736
CosyVoice 2-S	1.45	0.812	2.38	0.743
Step-Audio-TTS-3B-Single	1.37	0.802	2.52	0.704
Step-Audio-TTS-3B	1.31	0.733	2.31	0.660
Step-Audio-TTS	1.17	0.73	2.0	0.660

5.2.3 Performance comparison of Dual-codebook Resynthesis with Cosyvoice.

Token	test-zh		test-en
Token	CER (%) ↓	SS ↑	WER (%) ↓	SS ↑
Groundtruth	0.972	-	2.156	-
CosyVoice	2.857	0.849	4.519	0.807
Step-Audio-TTS-3B	2.192	0.784	3.585	0.742

5.3 AQTA Chat

We release StepEval-Audio-360 as a new benchmark, which consists of 137 multi-turn Chinese prompts sourced from real users and is designed to evaluate the quality of generated response across the following dimensions: Voice Instruction Following, Voice Understanding, Logical Reasoning, Role-playing, Creativity, Sing, Language Ability, Speech Emotion Control, Gaming.

5.3.1 StepEval-Audio-360

LLM judge metrics(GPT-4o)

Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360.

Model	Factuality (% ↑)	Relevance (% ↑)	Chat Score ↑
GLM4-Voice	54.7	66.4	3.49
Qwen2-Audio	22.6	26.3	2.27
Moshi^*	1.0	0	1.49
Step-Audio-Chat	66.4	75.2	4.11

Note: Moshi are marked with "*" and should be considered for reference only.

Radar Chart(Human Evaluation)

5.3.2 Public Test Set

Model	Llama Question	Web Questions	TriviaQA*	ComplexBench	HSK-6
GLM4-Voice	64.7	32.2	39.1	66.0	74.0
Moshi	62.3	26.6	22.8	-	-
Freeze-Omni	72.0	44.7	53.9	-	-
LUCY	59.7	29.3	27.0	-	-
MinMo	78.9	55.0	48.3	-	-
Qwen2-Audio	52.0	27.0	37.3	54.0	-
Step-Audio-Chat	*81.0*	75.1	58.0	74.0	86.0

Note: Results marked with "*" on TriviaQA dataset are considered for reference only.*

5.3.3 Audio instruction following

Category	Instruction Following		Audio Quality
Category	GLM-4-Voice	Step-Audio	GLM-4-Voice	Step-Audio
Languages	1.9	3.8	2.9	3.3
Role-playing	3.8	4.2	3.2	3.6
Singing / RAP	2.1	2.4	2.4	4
Voice Control	3.6	4.4	3.3	4.1

6. Online Engine

The online version of Step-Audio can be accessed from app version of 跃问, where some impressive examples can be found as well.

7. Examples

Clone audio

role	prompt wav	clone wav
于谦	google drive audio file	google drive audio file
李雪琴	google drive audio file	google drive audio file

Speed control

prompt	response
Human: 说一个绕口令 Assistant: 吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮 Human: 哎，你能把这个绕口令说的再快一点吗？	google drive audio file
Human: 说一个绕口令 Assistant: 吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮 Human: 哎，你能把这个绕口令说的再快一点吗？ Assistant: 吃葡萄不吐葡萄皮，不吃葡萄倒吐葡萄皮 Human: 呃，你再用非常非常慢的速度说一遍的。	google drive audio file

High EQ(Emotional control & Tone control)

prompt	response
Human: 你这语气又不撒娇又不卖萌的，要不你撒个娇卖个萌吧。	google drive audio file
Human: 怎么办？我感觉我的人生很失败。	google drive audio file
Human: 小跃。你真的是。特别厉害。	google drive audio file

Multilingual (e.g., Chinese, English, Japanese)

prompt	response
Human: What did the speaker mean when they said, it's raining cats and dogs? Assistant: When they say "It's raining cats and dogs," it just means it's raining really hard. The speaker isn't literally saying cats and dogs are falling from the sky! It's just a fun way to describe heavy rain.	google drive audio file
Human: こんにちは。（你好） Assistant：こんにちは！何か手伝いましょうか？（您好！我可以帮你做点什么吗？）	google drive audio file

Rap & Vocal

prompt	response
Human: 唱一段rap	google drive audio file
Human: 唱一段中文的歌曲（Sing Chinese Song）	google drive audio file
Human: 唱一段日语的歌曲（Sing Japanese Song）	google drive audio file

8. Acknowledgements

Part of the code for this project comes from:

Thank you to all the open-source projects for their contributions to this project!

9. License Agreement

The use of weights for Step Audio related models requires following license in Step-Audio-Chat, Step-Audio-Tokenizer and Step-Audio-TTS-3B
The code in this open-source repository is licensed under the Apache 2.0 License.

10. Citation

@misc{huang2025stepaudiounifiedunderstandinggeneration,
      title={Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction}, 
      author={Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and Peng Liu and Ruihang Miao and Wang You and Xi Chen and Xuerui Yang and Yechang Huang and Yuxiang Zhang and Zheng Gong and Zixin Zhang and Brian Li and Changyi Wan and Hanpeng Hu and Ranchen Ming and Song Yuan and Xuelin Zhang and Yu Zhou and Bingxin Li and Buyun Ma and Kang An and Wei Ji and Wen Li and Xuan Wen and Yuankai Ma and Yuanwei Liang and Yun Mou and Bahtiyar Ahmidi and Bin Wang and Bo Li and Changxin Miao and Chen Xu and Chengting Feng and Chenrun Wang and Dapeng Shi and Deshan Sun and Dingyuan Hu and Dula Sai and Enle Liu and Guanzhe Huang and Gulin Yan and Heng Wang and Haonan Jia and Haoyang Zhang and Jiahao Gong and Jianchang Wu and Jiahong Liu and Jianjian Sun and Jiangjie Zhen and Jie Feng and Jie Wu and Jiaoren Wu and Jie Yang and Jinguo Wang and Jingyang Zhang and Junzhe Lin and Kaixiang Li and Lei Xia and Li Zhou and Longlong Gu and Mei Chen and Menglin Wu and Ming Li and Mingxiao Li and Mingyao Liang and Na Wang and Nie Hao and Qiling Wu and Qinyuan Tan and Shaoliang Pang and Shiliang Yang and Shuli Gao and Siqi Liu and Sitong Liu and Tiancheng Cao and Tianyu Wang and Wenjin Deng and Wenqing He and Wen Sun and Xin Han and Xiaomin Deng and Xiaojia Liu and Xu Zhao and Yanan Wei and Yanbo Yu and Yang Cao and Yangguang Li and Yangzhen Ma and Yanming Xu and Yaqiang Shi and Yilei Wang and Yinmin Zhong and Yu Luo and Yuanwei Lu and Yuhe Yin and Yuting Yan and Yuxiang Yang and Zhe Xie and Zheng Ge and Zheng Sun and Zhewei Huang and Zhichao Chang and Zidong Yang and Zili Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu},
      year={2025},
      eprint={2502.11946},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11946}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
assets		assets
cosyvoice		cosyvoice
examples		examples
funasr_detach		funasr_detach
speakers		speakers
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
README_JP.md		README_JP.md
__init__.py		__init__.py
app.py		app.py
offline_inference.py		offline_inference.py
requirements.txt		requirements.txt
stepaudio.py		stepaudio.py
tokenizer.py		tokenizer.py
tts.py		tts.py
tts_inference.py		tts_inference.py
utils.py		utils.py

License

stepfun-ai/Step-Audio

Folders and files

Latest commit

History

Repository files navigation

Step-Audio

🔥🔥🔥 News!!

Table of Contents

1. Introduction

2. Model Summary

2.1 Tokenizer

2.2 Language Model

2.3 Speech Decoder

2.4 Real-time Inference Pipeline

2.5 Post training details

3. Model Download

3.1 Huggingface

3.2 Modelscope

4. Model Usage

📜 4.1 Requirements

🔧 4.2 Dependencies and Installation

🚀 4.3 Inference Scripts

Offline inference

TTS inference

Launch Web Demo

Inference Chat Model with vLLM (recommended)

5. Benchmark

5.1 ASR result comparison

5.2 TTS

5.2.1 Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.

5.2.2 Results of TTS Models on SEED Test Sets.

5.2.3 Performance comparison of Dual-codebook Resynthesis with Cosyvoice.

5.3 AQTA Chat

5.3.1 StepEval-Audio-360

LLM judge metrics(GPT-4o)

Radar Chart(Human Evaluation)

5.3.2 Public Test Set

5.3.3 Audio instruction following

6. Online Engine

7. Examples

Clone audio

Speed control

High EQ(Emotional control & Tone control)

Multilingual (e.g., Chinese, English, Japanese)

Rap & Vocal

8. Acknowledgements

9. License Agreement

10. Citation

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages