Best practice for Qwen2-Audio
Jintao-Huang opened this issue · 3 comments
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift)
pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm]
# 安装最新的transformers(Install the latest transformers.)
pip install git+https://github.com/huggingface/transformers.git
pip install librosa
推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct
# 如果是本地路径(If it is a local path.)
CUDA_VISIBLE_DEVICES=0 swift infer \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path '<local_path>'
推理效果:(Inference result:)
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav
Yes, I can guess that you are a female in your twenties.
--------------------------------------------------
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav
每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。
--------------------------------------------------
<<< clear
<<< 你是谁
我是来自达摩院的语言模型,我叫通义千问。
使用python调用:(Using Python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = None
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = '<audio>这段语音说了什么'
audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav']
response, history = inference(model, template, query, audios=audios)
print(f'query: {query}')
print(f'response: {response}')
# 流式(streaming)
query = '这段语音是男生还是女生'
gen = inference_stream(model, template, query, history, audios=audios)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <audio>这段语音说了什么
response: 这段语音说的是:'今天天气真好呀'
query: 这段语音是男生还是女生
response: 男声。
history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']]
"""
Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b
推理效果:(Inference result)
<<< <audio>Generate the caption in English:
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3
Glass is breaking.
微调(Fine-tuning)
通常,多模态大模型微调会使用自定义数据集进行微调。在这里,我们将展示可直接运行的demo。我们使用aishell1-zh-mini数据集进行微调,您可以在 modelscope 上找到该数据集:https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
Typically, fine-tuning multimodal large models involves using custom datasets for the process. Here, we will demonstrate a runnable demo. We use the aishell1-zh-mini dataset for fine-tuning, which you can find on Modelscope at: https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
使用python:(Using python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import sft_main, SftArguments, ModelType, DatasetName
sft_main(SftArguments(model_type=ModelType.qwen2_audio_7b_instruct,
model_id_or_path=None,
dataset=[DatasetName.aishell1_zh_mini]))
ZeRO2:
# 如果是本地路径需要增加:`--model_id_or_path <local_path>` (If it is a local path, it needs to be added.)
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen2-audio-7b-instruct \
--dataset aishell1-zh-mini \
--deepspeed default-zero2
如果要使用自定义数据集,只需按以下方式进行指定:(If you want to use a custom dataset, simply specify it as follows:)
# val_dataset可选,如果不指定,则会从dataset中切出一部分数据集作为验证集
--dataset train.jsonl \
--val_dataset val.jsonl \
自定义数据集支持json和jsonl样式。以下提供了两种自定义数据集格式:(Custom datasets support JSON and JSONL formats. Below are two formats for custom datasets:)
[
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<audio>audio_path</audio>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
{"query": "<audio>55555", "response": "66666", "audios": ["audio_path"]}
{"query": "<audio><audio>eeeee", "response": "fffff", "history": [], "audios": ["audio_path1", "audio_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}
微调后推理脚本:(Fine-tuned inference script:)
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true
# merge-lora and inference
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true --merge_lora true
微调后模型对验证集进行推理的示例,时间原因,只跑了400个steps:(Example of the model performing inference on the validation set after fine-tuning. Due to time constraints, only 400 steps were run)
训练过程中的log没有报告acc值,这个是我设置的问题吗?
export WANDB_API_KEY=""
swift sft \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path "" \
--sft_type full \
--freeze_parameters 0.999 \
--template_type AUTO \
--dtype AUTO \
--output_dir output \
--custom_train_dataset_path "" \
--val_dataset ''\
--val_dataset_sample -1 \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 2048 \
--check_dataset_strategy warning \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 32 \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--lazy_tokenize true \
--evaluation_strategy 'no' \
--system '' \
--save_strategy "steps" \
--report_to 'wandb' \
--acc_strategy 'token' \
--acc_steps 10
Hi @Jintao-Huang ,
I'd be interested in further finetuning it to improve on german language. Are there any plans to include this architecture in mergekit? Obviously, my thoughts were to either:
- Merge e.g. VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct into it (assuming that this merge won't injure the audio layers)
- Finetune it on a german dataset (most likely synthetic)
Any hints on how to proceed?
Best
Julian