Qwen2-Audio-train: A Python repository from chenpaopao

中文｜ English

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
audio analysis: users could provide audio and text instructions for analysis during the interaction;

We've released two models of the Qwen2-Audio series: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.

Architecture

The overview of three-stage training process of Qwen2-Audio.

News and Updates

2024.8.9 🎉 We released the checkpoints of both Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct on ModelScope and Hugging Face.
2024.7.15 🎉 We released the paper of Qwen2-Audio, introducing the relevant model structure, training methods, and model performance. Check our report for details!
2023.11.30 🔥 We released the Qwen-Audio series.

Evaluation

We evaluated the Qwen2-Audio's abilities on 13 standard benchmarks as follows:

Task	Description	Dataset	Split	Metric
ASR	Automatic Speech Recognition	Fleurs	dev \| test	WER
		Aishell2	test
		Librispeech	dev \| test
		Common Voice	dev \| test
S2TT	Speech-to-Text Translation	CoVoST2	test	BLEU
SER	Speech Emotion Recognition	Meld	test	ACC
VSC	Vocal Sound Classification	VocalSound	test	ACC
AIR-Bench	Chat-Benchmark-Speech	Fisher SpokenWOZ IEMOCAP Common voice	dev \| test	GPT-4 Eval
	Chat-Benchmark-Sound	Clotho	dev \| test	GPT-4 Eval
	Chat-Benchmark-Music	MusicCaps	dev \| test	GPT-4 Eval
	Chat-Benchmark-Mixed-Audio	Common voice AudioCaps MusicCaps	dev \| test	GPT-4 Eval

The below is the overal performance:

The details of evaluation are as follows:
(Note: The evaluation results we present are based on the initial model of the original training framework. However, the scores showed some fluctuations after converting the framework to Huggingface. Here, we present our complete evaluation results, starting with the initial model results from the paper.)

Task	Dataset	Model	Performance
Task	Dataset	Model	Metrics	Results
ASR	Librispeech dev-clean \| dev-other \| test-clean \| test-other	SpeechT5	WER	2.1 \| 5.5 \| 2.4 \| 5.8
		SpeechNet		- \| - \| 30.7 \| -
		SLM-FT		- \| - \| 2.6 \| 5.0
		SALMONN		- \| - \| 2.1 \| 4.9
		SpeechVerse		- \| - \| 2.1 \| 4.4
		Qwen-Audio		1.8 \| 4.0 \| 2.0 \| 4.2
		Qwen2-Audio		1.3 \| 3.4 \| 1.6 \| 3.6
	Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	WER	9.3 \| 12.8 \| 10.9 \| 10.8
	Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	WER	8.6 \| 6.9 \| 5.9 \| 9.6
	Fleurs zh	Whisper-large-v3	WER	7.7
	Fleurs zh	Qwen2-Audio	WER	7.5
	Aishell2 Mic \| iOS \| Android	MMSpeech-base	WER	4.5 \| 3.9 \| 4.0
		Paraformer-large		- \| 2.9 \| -
		Qwen-Audio		3.3 \| 3.1 \| 3.3
		Qwen2-Audio		3.0 \| 3.0 \| 2.9
S2TT	CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	BLEU	18.6 \| - \| 33.1 \| -
		SpeechLLaMA		- \| 27.1 \| - \| 12.3
		BLSP		14.1 \| - \| - \| -
		Qwen-Audio		25.1 \| 33.9 \| 41.5 \| 15.7
		Qwen2-Audio		29.9 \| 35.2 \| 45.2 \| 24.4
	CoVoST2 es-en \| fr-en \| it-en \|	SpeechLLaMA	BLEU	27.9 \| 25.2 \| 25.9
		Qwen-Audio		39.7 \| 38.5 \| 36.0
		Qwen2-Audio		40.0 \| 38.5 \| 36.3
SER	Meld	WavLM-large	ACC	0.542
		Qwen-Audio		0.557
		Qwen2-Audio		0.553
VSC	VocalSound	CLAP	ACC	0.4945
		Pengi		0.6035
		Qwen-Audio		0.9289
		Qwen2-Audio		0.9392
AIR-Bench	Chat Benchmark Speech \| Sound \| Music \| Mixed-Audio	SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio	GPT-4	6.16 \| 6.28 \| 5.95 \| 6.08 6.17 \| 5.55 \| 5.08 \| 5.33 3.58 \| 5.46 \| 5.06 \| 4.25 0.97 \| 1.01 \| 0.91 \| 1.01 1.57 \| 0.95 \| 0.95 \| 4.13 3.86 \| 4.76 \| 4.18 \| 4.13 6.47 \| 6.95 \| 5.52 \| 6.08 6.97 \| 5.49 \| 5.06 \| 5.27 7.18 \| 6.99 \| 6.79 \| 6.77

(Second is after converting huggingface)

Task	Dataset	Model	Performance
Task	Dataset	Model	Metrics	Results
ASR	Librispeech dev-clean \| dev-other \| test-clean \| test-other	SpeechT5	WER	2.1 \| 5.5 \| 2.4 \| 5.8
		SpeechNet		- \| - \| 30.7 \| -
		SLM-FT		- \| - \| 2.6 \| 5.0
		SALMONN		- \| - \| 2.1 \| 4.9
		SpeechVerse		- \| - \| 2.1 \| 4.4
		Qwen-Audio		1.8 \| 4.0 \| 2.0 \| 4.2
		Qwen2-Audio		1.7 \| 3.6 \| 1.7 \| 4.0
	Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	WER	9.3 \| 12.8 \| 10.9 \| 10.8
	Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	WER	8.7 \| 6.5 \| 5.9 \| 9.6
	Fleurs zh	Whisper-large-v3	WER	7.7
	Fleurs zh	Qwen2-Audio	WER	7.0
	Aishell2 Mic \| iOS \| Android	MMSpeech-base	WER	4.5 \| 3.9 \| 4.0
		Paraformer-large		- \| 2.9 \| -
		Qwen-Audio		3.3 \| 3.1 \| 3.3
		Qwen2-Audio		3.2 \| 3.1 \| 2.9
S2TT	CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	BLEU	18.6 \| - \| 33.1 \| -
		SpeechLLaMA		- \| 27.1 \| - \| 12.3
		BLSP		14.1 \| - \| - \| -
		Qwen-Audio		25.1 \| 33.9 \| 41.5 \| 15.7
		Qwen2-Audio		29.6 \| 33.6 \| 45.6 \| 24.0
	CoVoST2 es-en \| fr-en \| it-en \|	SpeechLLaMA	BLEU	27.9 \| 25.2 \| 25.9
		Qwen-Audio		39.7 \| 38.5 \| 36.0
		Qwen2-Audio		38.7 \| 37.2 \| 35.2
SER	Meld	WavLM-large	ACC	0.542
		Qwen-Audio		0.557
		Qwen2-Audio		0.535
VSC	VocalSound	CLAP	ACC	0.4945
		Pengi		0.6035
		Qwen-Audio		0.9289
		Qwen2-Audio		0.9395
AIR-Bench	Chat Benchmark Speech \| Sound \| Music \| Mixed-Audio	SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio	GPT-4	6.16 \| 6.28 \| 5.95 \| 6.08 6.17 \| 5.55 \| 5.08 \| 5.33 3.58 \| 5.46 \| 5.06 \| 4.25 0.97 \| 1.01 \| 0.91 \| 1.01 1.57 \| 0.95 \| 0.95 \| 4.13 3.86 \| 4.76 \| 4.18 \| 4.13 6.47 \| 6.95 \| 5.52 \| 6.08 6.97 \| 5.49 \| 5.06 \| 5.27 7.24 \| 6.83 \| 6.73 \| 6.42

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

Requirements

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2-audio'

Quickstart

Below, we provide simple examples to show how to use Qwen2-Audio and Qwen2-Audio-Instruct with 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Now you can start with ModelScope or Transformers. Qwen2-Audio models currently perform best with audio clips under 30 seconds.

🤗 Transformers

In the following, we demonstrate how to use Qwen2-Audio-7B-Instruct for the inference, supporting both voice chat and audio analysis modes. Note that we have used the ChatML format for dialog, in this demo we show how to leverage apply_chat_template for this purpose.

Voice Chat Inference

In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Audio Analysis Inference

In the audio analysis, users could provide both audio and text instructions for analysis:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Batch Inference

We also support batch inference:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Running Qwen2-Audio pretrained base model is also simple.

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")

generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Finetuning

We would like to thank the Hugging Face open-source community for their contributions, which have made it easy for us to implement model fine-tuning with Accelerate and DeepSpeed. We support both LoRA (Low-Rank Adaptation) and full-parameter fine-tuning, with the code provided by Xiaoming Liu.

cd finetune && bash run.sh

🤖 ModelScope

We strongly advise users especially those in mainland China to use ModelScope. snapshot_download can help you solve issues concerning downloading checkpoints.

Demo

Web UI

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

pip install -r requirements_web_demo.txt

Then run the command below and click on the generated link:

python demo/web_demo_audio.py

demos

More impressive cases will be updated on our blog at Qwen's blog.

We Are Hiring

If you are interested in joining us as full-time or intern, please contact us at qwen_audio@list.alibaba-inc.com.

License Agreement

Check the license of each model inside its HF repo. It is NOT necessary for you to submit a request for commercial usage.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}

@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.

chenpaopao/Qwen2-Audio-train