Skip to content

stepfun-ai/Step-Audio

Repository files navigation

中文  |   English   |  日本語



Step-Audio

🔥🔥🔥 News!!

Table of Contents

  1. Introduction
  2. Model Summary
  3. Model Download
  4. Model Usage
  5. Benchmark
  6. Online Engine
  7. Citation

1. Introduction

Step-Audio is the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation, supporting multilingual conversations (e.g., Chinese, English, Japanese), emotional tones (e.g., joy/sadness), regional dialects (e.g., Cantonese/Sichuanese), adjustable speech rates, and prosodic styles (e.g., rap). Step-Audio demonstrates four key technical innovations:

  • 130B-Parameter Multimodal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.

  • Generative Data Engine: Eliminates traditional TTS's reliance on manual data collection by generating high-quality audio through our 130B-parameter multimodal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instruction-following capabilities for controllable speech synthesis.

  • Granular Voice Control: Enables precise regulation through instruction-based control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (rap, a cappella humming) to meet diverse speech generation needs.

  • Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.

2. Model Summary

In Step-Audio, audio streams are tokenized via a dual-codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) tokenizers with 2:3 temporal interleaving. A 130B-parameter LLM foundation (Step-1) is further enhanced via audio-contextualized continual pretraining and task-specific post-training, enabling robust cross-modal speech understanding. A hybrid speech decoder combining flow matching with neural vocoding, optimized for real-time waveform generation. A streaming-aware architecture featuring speculative response generation (40% commit rate) and text-based context management (14:1 compression ratio) for efficient cross-modal alignment. Architecture

2.1 Tokenizer

We implement a token-level interleaving approach to effectively integrate semantic tokenization and acoustic tokenization. The semantic tokenizer employs a codebook size of 1024, while the acoustic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details. Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two semantic tokens are paired with three acoustic tokens.

2.2 Language Model

To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pretrain-ing based on Step-1, a 130-billion parameter pretrained text-based large language model (LLM).

2.3 Speech Decoder

The speech decoder in Step-Audio serves a critical function in converting discrete speech tokens, which contain both semantic and acoustic information, into continuous time-domain waveforms that represent natural speech. The decoder architecture incorporates a flow matching model and a mel-to-wave vocoder. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of semantic and acoustic features throughout the generation process.

2.4 Real-time Inference Pipeline

To enable real-time interactions, we have designed an optimized inference pipeline. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include Voice Activity Detection (VAD) for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity. Inference Pipeline

2.5 Post training details

In the post-training phase, we conducted task-specific Supervised Fine-Tuning (SFT) for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). For Audio Input Text Output (AQTA) tasks, we implemented SFT using diversified high-quality datasets combined with Reinforcement Learning from Human Feedback (RLHF) to enhance response quality, enabling fine-grained control over emotional expression, speech speed, dialect, and prosody. RLHF

3. Model Download

3.1 Huggingface

Models Links
Step-Audio-Tokenizer 🤗huggingface
Step-Audio-Chat 🤗huggingface
Step-Audio-TTS-3B 🤗huggingface

3.2 Modelscope

Models Links
Step-Audio-Tokenizer modelscope
Step-Audio-Chat modelscope
Step-Audio-TTS-3B modelscope

4. Model Usage

📜 4.1 Requirements

The following table shows the requirements for running Step-Audio model (batch size = 1):

Model Setting
(sample frequency)
GPU Minimum Memory
Step-Audio-Tokenizer 41.6Hz 1.5GB
Step-Audio-Chat 41.6Hz 265GB
Step-Audio-TTS-3B 41.6Hz 8GB
  • An NVIDIA GPU with CUDA support is required.
    • The model is tested on a four A800 80G GPU.
    • Recommended: We recommend using 4xA800/H800 GPU with 80GB memory for better generation quality.
  • Tested operating system: Linux

🔧 4.2 Dependencies and Installation

git clone https://github.com/stepfun-ai/Step-Audio.git
conda create -n stepaudio python=3.10
conda activate stepaudio

cd Step-Audio
pip install -r requirements.txt

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-Chat
git clone https://huggingface.co/stepfun-ai/Step-Audio-TTS-3B

After downloading the models, where_you_download_dir should have the following structure:

where_you_download_dir
├── Step-Audio-Tokenizer
├── Step-Audio-Chat
├── Step-Audio-TTS-3B

🚀 4.3 Inference Scripts

Offline inference

Inference with e2e audio/text input and audio/text output.

python offline_inference.py --model-path where_you_download_dir

TTS inference

Inference tts with default speaker or clone with a new speaker

python tts_inference.py --model-path where_you_download_dir --output-path where_you_save_audio_dir --synthesis-type use_tts_or_clone

A speaker information dict is required for clone mode, formatted as follows:

{
    "speaker": "speaker id",
    "prompt_text": "content of prompt wav",
    "wav_path": "prompt wav path"
}

Launch Web Demo

Start a local server for online inference. Assume you have 4 GPUs available and have already downloaded all the models.

python app.py --model-path where_you_download_dir

Inference Chat Model with vLLM (recommended)

Step-Audio-Chat is a 130B LLM Model, it is recommended to use vLLM to inference with tensor parallelism.

Currently, the official vLLM does not support the Step 1 model. You can temporarily use our development branch for local installation.

Because our attention mechanism is a variant of ALIBI, the official flash attention library is not compatible. We have provided a custom flash attention library in the Step-Audio-Chat repository. Make sure export the custom flash attention library to the environment variable before running the model.

export OPTIMUS_LIB_PATH=where_you_download_dir/Step-Audio-Chat/lib

vllm serve where_you_download_dir/Step-Audio-Chat --dtype auto -tp $tp --served-model-name step_chat_audio --trust-remote-code

5. Benchmark

5.1 ASR result comparison

Hidden Feature Modeling Discrete Audio Token Modeling
Whisper Large-v3 Qwen2-Audio MinMo LUCY Moshi GLM-4-voice Base GLM-4-voice Chat Step-Audio Pretrain Step-Audio-Chat
Aishell-1 5.14 1.53 - 2.4 - 2.46 226.47 0.87 1.95
Aishell-2 ios 4.76 3.06 2.69 - - - 211.3 2.91 3.57
Wenetspeech test-net 9.68 7.72 6.64 8.78 - - 146.05 7.62 8.75
Wenet test-meeting 18.54 8.4 7.6 10.42 - - 140.82 7.78 9.52
Librispeech test-clean 1.9 1.6 1.6 3.36 5.7 2.82 75.39 2.36 3.11
Librispeech test-other 3.65 3.6 3.82 8.05 - 7.66 80.3 6.32 8.44
AVG 7.28 4.32 - - - - 146.74 4.64 5.89

5.2 TTS

5.2.1 Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.

Model test-zh test-en
CER (%) ↓ WER (%) ↓
GLM-4-Voice 2.19 2.91
MinMo 2.48 2.90
Step-Audio 1.53 2.71

5.2.2 Results of TTS Models on SEED Test Sets.

  • StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
FireRedTTS 1.51 0.630 3.82 0.460
MaskGCT 2.27 0.774 2.62 0.774
CosyVoice 3.63 0.775 4.29 0.699
CosyVoice 2 1.45 0.806 2.57 0.736
CosyVoice 2-S 1.45 0.812 2.38 0.743
Step-Audio-TTS-3B-Single 1.37 0.802 2.52 0.704
Step-Audio-TTS-3B 1.31 0.733 2.31 0.660
Step-Audio-TTS 1.17 0.73 2.0 0.660

5.2.3 Performance comparison of Dual-codebook Resynthesis with Cosyvoice.

Token test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
Groundtruth 0.972 - 2.156 -
CosyVoice 2.857 0.849 4.519 0.807
Step-Audio-TTS-3B 2.192 0.784 3.585 0.742

5.3 AQTA Chat

We release StepEval-Audio-360 as a new benchmark, which consists of 137 multi-turn Chinese prompts sourced from real users and is designed to evaluate the quality of generated response across the following dimensions: Voice Instruction Following, Voice Understanding, Logical Reasoning, Role-playing, Creativity, Sing, Language Ability, Speech Emotion Control, Gaming.

5.3.1 StepEval-Audio-360

LLM judge metrics(GPT-4o)

Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360.
Model Factuality (% ↑) Relevance (% ↑) Chat Score ↑
GLM4-Voice 54.7 66.4 3.49
Qwen2-Audio 22.6 26.3 2.27
Moshi* 1.0 0 1.49
Step-Audio-Chat 66.4 75.2 4.11
  • Note: Moshi are marked with "*" and should be considered for reference only.

Radar Chart(Human Evaluation)

QR code

5.3.2 Public Test Set

Model Llama Question Web Questions TriviaQA* ComplexBench HSK-6
GLM4-Voice 64.7 32.2 39.1 66.0 74.0
Moshi 62.3 26.6 22.8 - -
Freeze-Omni 72.0 44.7 53.9 - -
LUCY 59.7 29.3 27.0 - -
MinMo 78.9 55.0 48.3 - -
Qwen2-Audio 52.0 27.0 37.3 54.0 -
Step-Audio-Chat 81.0 75.1 58.0 74.0 86.0
  • Note: Results marked with "*" on TriviaQA dataset are considered for reference only.*

5.3.3 Audio instruction following

Category Instruction Following Audio Quality
GLM-4-Voice Step-Audio GLM-4-Voice Step-Audio
Languages 1.9 3.8 2.9 3.3
Role-playing 3.8 4.2 3.2 3.6
Singing / RAP 2.1 2.4 2.4 4
Voice Control 3.6 4.4 3.3 4.1

6. Online Engine

The online version of Step-Audio can be accessed from app version of 跃问, where some impressive examples can be found as well.

QR code

7. Examples

Clone audio

role prompt wav clone wav
于谦 google drive
audio file
google drive
audio file
李雪琴 google drive
audio file
google drive
audio file

Speed control

prompt response
Human: 说一个绕口令
Assistant: 吃葡萄不吐葡萄皮,不吃葡萄倒吐葡萄皮
Human: 哎,你能把这个绕口令说的再快一点吗?
google drive
audio file
Human: 说一个绕口令
Assistant: 吃葡萄不吐葡萄皮,不吃葡萄倒吐葡萄皮
Human: 哎,你能把这个绕口令说的再快一点吗?
Assistant: 吃葡萄不吐葡萄皮,不吃葡萄倒吐葡萄皮
Human: 呃,你再用非常非常慢的速度说一遍的。
google drive
audio file

High EQ(Emotional control & Tone control)

prompt response
Human: 你这语气又不撒娇又不卖萌的,要不你撒个娇卖个萌吧。 google drive
audio file
Human: 怎么办?我感觉我的人生很失败。 google drive
audio file
Human: 小跃。你真的是。特别厉害。 google drive
audio file

Multilingual (e.g., Chinese, English, Japanese)

prompt response
Human: What did the speaker mean when they said, it's raining cats and dogs?
Assistant: When they say "It's raining cats and dogs," it just means it's raining really hard. The speaker isn't literally saying cats and dogs are falling from the sky! It's just a fun way to describe heavy rain.
google drive
audio file
Human: こんにちは。(你好)
Assistant:こんにちは!何か手伝いましょうか?(您好!我可以帮你做点什么吗?)
google drive
audio file

Rap & Vocal

prompt response
Human: 唱一段rap google drive
audio file
Human: 唱一段中文的歌曲(Sing Chinese Song) google drive
audio file
Human: 唱一段日语的歌曲(Sing Japanese Song) google drive
audio file

8. Acknowledgements

Part of the code for this project comes from:

Thank you to all the open-source projects for their contributions to this project!

9. License Agreement

10. Citation

@misc{huang2025stepaudiounifiedunderstandinggeneration,
      title={Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction}, 
      author={Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and Peng Liu and Ruihang Miao and Wang You and Xi Chen and Xuerui Yang and Yechang Huang and Yuxiang Zhang and Zheng Gong and Zixin Zhang and Brian Li and Changyi Wan and Hanpeng Hu and Ranchen Ming and Song Yuan and Xuelin Zhang and Yu Zhou and Bingxin Li and Buyun Ma and Kang An and Wei Ji and Wen Li and Xuan Wen and Yuankai Ma and Yuanwei Liang and Yun Mou and Bahtiyar Ahmidi and Bin Wang and Bo Li and Changxin Miao and Chen Xu and Chengting Feng and Chenrun Wang and Dapeng Shi and Deshan Sun and Dingyuan Hu and Dula Sai and Enle Liu and Guanzhe Huang and Gulin Yan and Heng Wang and Haonan Jia and Haoyang Zhang and Jiahao Gong and Jianchang Wu and Jiahong Liu and Jianjian Sun and Jiangjie Zhen and Jie Feng and Jie Wu and Jiaoren Wu and Jie Yang and Jinguo Wang and Jingyang Zhang and Junzhe Lin and Kaixiang Li and Lei Xia and Li Zhou and Longlong Gu and Mei Chen and Menglin Wu and Ming Li and Mingxiao Li and Mingyao Liang and Na Wang and Nie Hao and Qiling Wu and Qinyuan Tan and Shaoliang Pang and Shiliang Yang and Shuli Gao and Siqi Liu and Sitong Liu and Tiancheng Cao and Tianyu Wang and Wenjin Deng and Wenqing He and Wen Sun and Xin Han and Xiaomin Deng and Xiaojia Liu and Xu Zhao and Yanan Wei and Yanbo Yu and Yang Cao and Yangguang Li and Yangzhen Ma and Yanming Xu and Yaqiang Shi and Yilei Wang and Yinmin Zhong and Yu Luo and Yuanwei Lu and Yuhe Yin and Yuting Yan and Yuxiang Yang and Zhe Xie and Zheng Ge and Zheng Sun and Zhewei Huang and Zhichao Chang and Zidong Yang and Zili Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu},
      year={2025},
      eprint={2502.11946},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11946}, 
}

Star History

Star History Chart

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages