中文 | English
Blog | Paper | WeChat | Discord
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:
-
voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
-
audio analysis: users could provide audio and text instructions for analysis during the interaction;
We are going to release two models of the Qwen2-Audio series soon: Qwen2-Audio and Qwen2-Audio-Chat.
The overview of three-stage training process of Qwen2-Audio.
-
2024.7.15 🎉 We released the paper of Qwen2-Audio, introducing the relevant model structure, training methods, and model performance. Check our report for details!
-
2023.11.30 🔥 We released the Qwen-Audio series.
We evaluated the Qwen2-Audio's abilities on 13 standard benchmarks as follows:
Task | Description | Dataset | Split | Metric |
---|---|---|---|---|
ASR | Automatic Speech Recognition | Fleurs | dev | test | WER |
Aishell2 | test | |||
Librispeech | dev | test | |||
Common Voice | dev | test | |||
S2TT | Speech-to-Text Translation | CoVoST2 | test | BLEU |
SER | Speech Emotion Recognition | Meld | test | ACC |
VSC | Vocal Sound Classification | VocalSound | test | ACC |
AIR-Bench | Chat-Benchmark-Speech | Fisher SpokenWOZ IEMOCAP Common voice | dev | test | GPT-4 Eval |
Chat-Benchmark-Sound | Clotho | dev | test | GPT-4 Eval | |
Chat-Benchmark-Music | MusicCaps | dev | test | GPT-4 Eval | |
Chat-Benchmark-Mixed-Audio | Common voice AudioCaps MusicCaps | dev | test | GPT-4 Eval |
The below is the overal performance:
The details of evaluation are as follows:
(Note: The evaluation results we present are based on the initial model of the original training framework. However, the scores showed a decline after converting the framework to Huggingface. Here, we present our complete evaluation results, starting with the initial model results from the paper.)
Task | Dataset | Model | Performance | |
---|---|---|---|---|
Metrics | Results | |||
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - | |||
SLM-FT | - | - | 2.6 | 5.0 | |||
SALMONN | - | - | 2.1 | 4.9 | |||
SpeechVerse | - | - | 2.1 | 4.4 | |||
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 | |||
Qwen2-Audio | 1.3 | 3.4 | 1.6 | 3.6 | |||
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 | |
Qwen2-Audio | 8.6 | 6.9 | 5.9 | 9.6 | |||
Fleurs zh | Whisper-large-v3 | WER | 7.7 | |
Qwen2-Audio | 7.5 | |||
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 | |
Paraformer-large | - | 2.9 | - | |||
Qwen-Audio | 3.3 | 3.1 | 3.3 | |||
Qwen2-Audio | 3.0 | 3.0 | 2.9 | |||
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 | |||
BLSP | 14.1 | - | - | - | |||
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 | |||
Qwen2-Audio | 29.9 | 35.2 | 45.2 | 24.4 | |||
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 | |
Qwen-Audio | 39.7 | 38.5 | 36.0 | |||
Qwen2-Audio | 40.0 | 38.5 | 36.3 | |||
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 | |||
Qwen2-Audio | 0.553 | |||
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 | |||
Qwen-Audio | 0.9289 | |||
Qwen2-Audio | 0.9392 | |||
AIR-Bench | Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.18 | 6.99 | 6.79 | 6.77 |
(Second is after converting huggingface)
Task | Dataset | Model | Performance | |
---|---|---|---|---|
Metrics | Results | |||
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - | |||
SLM-FT | - | - | 2.6 | 5.0 | |||
SALMONN | - | - | 2.1 | 4.9 | |||
SpeechVerse | - | - | 2.1 | 4.4 | |||
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 | |||
Qwen2-Audio | 1.7 | 3.6 | 1.7 | 4.0 | |||
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 | |
Qwen2-Audio | 8.7 | 6.5 | 5.9 | 9.6 | |||
Fleurs zh | Whisper-large-v3 | WER | 7.7 | |
Qwen2-Audio | 7.0 | |||
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 | |
Paraformer-large | - | 2.9 | - | |||
Qwen-Audio | 3.3 | 3.1 | 3.3 | |||
Qwen2-Audio | 3.2 | 3.1 | 2.9 | |||
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 | |||
BLSP | 14.1 | - | - | - | |||
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 | |||
Qwen2-Audio | 29.6 | 33.6 | 45.6 | 24.0 | |||
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 | |
Qwen-Audio | 39.7 | 38.5 | 36.0 | |||
Qwen2-Audio | 38.7 | 37.2 | 35.2 | |||
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 | |||
Qwen2-Audio | 0.535 | |||
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 | |||
Qwen-Audio | 0.9289 | |||
Qwen2-Audio | 0.9395 | |||
AIR-Bench | Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.24 | 6.83 | 6.73 | 6.42 |
We will provide all evaluation scripts to reproduce our results. We will provide Huggingface, ModelScope, Web UI soon. Please wait for a few days and we are working hard to the process.
More impressive cases will be updated on our blog at Qwen's blog.
If you are interested in joining us as full-time or intern, please contact us at qwen_audio@list.alibaba-inc.com.
Check the license of each model inside its HF repo. It is NOT necessary for you to submit a request for commercial usage.
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@article{Qwen-Audio,
title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2311.07919},
year={2023}
}
@article{Qwen2-Audio,
title={Qwen2-Audio Technical Report},
author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo, Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2407.10759},
year={2024}
}
If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.