Best practice for Qwen2-Audio
Jintao-Huang opened this issue · 37 comments
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift)
pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm]
# 安装最新的transformers(Install the latest transformers.)
pip install git+https://github.com/huggingface/transformers.git
pip install librosa
推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct
# 如果是本地路径(If it is a local path.)
CUDA_VISIBLE_DEVICES=0 swift infer \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path '<local_path>'
推理效果:(Inference result:)
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav
Yes, I can guess that you are a female in your twenties.
--------------------------------------------------
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav
每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。
--------------------------------------------------
<<< clear
<<< 你是谁
我是来自达摩院的语言模型,我叫通义千问。
使用python调用:(Using Python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = None
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = '<audio>这段语音说了什么'
audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav']
response, history = inference(model, template, query, audios=audios)
print(f'query: {query}')
print(f'response: {response}')
# 流式(streaming)
query = '这段语音是男生还是女生'
gen = inference_stream(model, template, query, history, audios=audios)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <audio>这段语音说了什么
response: 这段语音说的是:'今天天气真好呀'
query: 这段语音是男生还是女生
response: 男声。
history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']]
"""
Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b
推理效果:(Inference result)
<<< <audio>Generate the caption in English:
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3
Glass is breaking.
微调(Fine-tuning)
通常,多模态大模型微调会使用自定义数据集进行微调。在这里,我们将展示可直接运行的demo。我们使用aishell1-zh-mini数据集进行微调,您可以在 modelscope 上找到该数据集:https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
Typically, fine-tuning multimodal large models involves using custom datasets for the process. Here, we will demonstrate a runnable demo. We use the aishell1-zh-mini dataset for fine-tuning, which you can find on Modelscope at: https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
使用python:(Using python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import sft_main, SftArguments, ModelType, DatasetName
sft_main(SftArguments(model_type=ModelType.qwen2_audio_7b_instruct,
model_id_or_path=None,
dataset=[DatasetName.aishell1_zh_mini]))
ZeRO2:
# 如果是本地路径需要增加:`--model_id_or_path <local_path>` (If it is a local path, it needs to be added.)
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen2-audio-7b-instruct \
--dataset aishell1-zh-mini \
--deepspeed default-zero2
如果要使用自定义数据集,只需按以下方式进行指定:(If you want to use a custom dataset, simply specify it as follows:)
# val_dataset可选,如果不指定,则会从dataset中切出一部分数据集作为验证集
--dataset train.jsonl \
--val_dataset val.jsonl \
自定义数据集支持json和jsonl样式。以下提供了两种自定义数据集格式:(Custom datasets support JSON and JSONL formats. Below are two formats for custom datasets:)
[
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<audio>audio_path</audio>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
{"query": "<audio>55555", "response": "66666", "audios": ["audio_path"]}
{"query": "<audio><audio>eeeee", "response": "fffff", "history": [], "audios": ["audio_path1", "audio_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}
微调后推理脚本:(Fine-tuned inference script:)
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true
# merge-lora and inference
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true --merge_lora true
微调后模型对验证集进行推理的示例,时间原因,只跑了400个steps:(Example of the model performing inference on the validation set after fine-tuning. Due to time constraints, only 400 steps were run)
训练过程中的log没有报告acc值,这个是我设置的问题吗?
export WANDB_API_KEY=""
swift sft \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path "" \
--sft_type full \
--freeze_parameters 0.999 \
--template_type AUTO \
--dtype AUTO \
--output_dir output \
--custom_train_dataset_path "" \
--val_dataset ''\
--val_dataset_sample -1 \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 2048 \
--check_dataset_strategy warning \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 32 \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--lazy_tokenize true \
--evaluation_strategy 'no' \
--system '' \
--save_strategy "steps" \
--report_to 'wandb' \
--acc_strategy 'token' \
--acc_steps 10
Hi @Jintao-Huang ,
I'd be interested in further finetuning it to improve on german language. Are there any plans to include this architecture in mergekit? Obviously, my thoughts were to either:
- Merge e.g. VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct into it (assuming that this merge won't injure the audio layers)
- Finetune it on a german dataset (most likely synthetic)
Any hints on how to proceed?
Best
Julian
Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢
check了一下 peft_config.target_modules里面是空的
Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'}
{'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'}
{'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'}
{'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'}
{'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'}
{'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}
数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}
命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path ./Qwen2-Audio-7B-Instruct \
--tuner_backend peft \
--dataset ./total_audios_prompt_qwen2.jsonl \
--dataset_test_ratio 0.01 \
--dataloader_num_workers 1 \
--report_to "none" \
--max_length 1024 \
--save_steps 100 \
--eval_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--gradient_accumulation_steps 5 \
--output_dir output \
--save_total_limit 50 \
--lazy_tokenize true \
--preprocess_num_proc 1 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--sft_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--use_flash_attn false \
--dtype bf16 \
--warmup_ratio 0.05 \
--num_train_epochs 1
How to achieve batch inference based on swift framework? Is there any parameter like --batch-size to accelerate the swift infer script?
如何使用vllm或lmdeploy进行加速呢
Qwen2-audio微调时能使用lora训练只训练audio-encoder部分吗?怎么配置能实现此功能?@Jintao-Huang
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'} {'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'} {'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'} {'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'} {'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'} {'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}
数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}
命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path ./Qwen2-Audio-7B-Instruct \ --tuner_backend peft \ --dataset ./total_audios_prompt_qwen2.jsonl \ --dataset_test_ratio 0.01 \ --dataloader_num_workers 1 \ --report_to "none" \ --max_length 1024 \ --save_steps 100 \ --eval_steps 100 \ --logging_steps 10 \ --batch_size 16 \ --gradient_accumulation_steps 5 \ --output_dir output \ --save_total_limit 50 \ --lazy_tokenize true \ --preprocess_num_proc 1 \ --weight_decay 0.1 \ --learning_rate 1e-4 \ --sft_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --use_flash_attn false \ --dtype bf16 \ --warmup_ratio 0.05 \ --num_train_epochs 1
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift) pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm] # 安装最新的transformers(Install the latest transformers.) pip install git+https://github.com/huggingface/transformers.git pip install librosa推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct # 如果是本地路径(If it is a local path.) CUDA_VISIBLE_DEVICES=0 swift infer \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path '<local_path>'推理效果:(Inference result:)
<<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav Yes, I can guess that you are a female in your twenties. -------------------------------------------------- <<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav 每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。 -------------------------------------------------- <<< clear <<< 你是谁 我是来自达摩院的语言模型,我叫通义千问。
使用python调用:(Using Python)
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, inference_stream ) from swift.utils import seed_everything import torch model_type = ModelType.qwen2_audio_7b_instruct model_id_or_path = None template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'}) model.generation_config.max_new_tokens = 256 template = get_template(template_type, tokenizer) seed_everything(42) query = '<audio>这段语音说了什么' audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'] response, history = inference(model, template, query, audios=audios) print(f'query: {query}') print(f'response: {response}') # 流式(streaming) query = '这段语音是男生还是女生' gen = inference_stream(model, template, query, history, audios=audios) print_idx = 0 print(f'query: {query}\nresponse: ', end='') for response, history in gen: delta = response[print_idx:] print(delta, end='', flush=True) print_idx = len(response) print() print(f'history: {history}') """ query: <audio>这段语音说了什么 response: 这段语音说的是:'今天天气真好呀' query: 这段语音是男生还是女生 response: 男声。 history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']] """Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b推理效果:(Inference result)
<<< <audio>Generate the caption in English: Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3 Glass is breaking.
使用python调用:(Using Python)
推理结果不正确。输出跟示例不一致是怎么回事
query: 这段语音说了什么
response:
<|endoftext|>ORIZONTAL_RULE' is not defined
我遇到一个问题,“You are attempting to use Flash Attention 2.0 without specifying a torch dtype.”
跑起来的时候,报错。我想问下,qwen2-audio 需要安装那个版本的flash attn ??
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'} {'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'} {'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'} {'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'} {'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'} {'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}
数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}
命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path ./Qwen2-Audio-7B-Instruct \ --tuner_backend peft \ --dataset ./total_audios_prompt_qwen2.jsonl \ --dataset_test_ratio 0.01 \ --dataloader_num_workers 1 \ --report_to "none" \ --max_length 1024 \ --save_steps 100 \ --eval_steps 100 \ --logging_steps 10 \ --batch_size 16 \ --gradient_accumulation_steps 5 \ --output_dir output \ --save_total_limit 50 \ --lazy_tokenize true \ --preprocess_num_proc 1 \ --weight_decay 0.1 \ --learning_rate 1e-4 \ --sft_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --use_flash_attn false \ --dtype bf16 \ --warmup_ratio 0.05 \ --num_train_epochs 1
使用fp32训练就不会出现nan,经过排查, feature_extractor_whisper的call函数入参do_normalize默认为False, 而processing_qwen2_audio.py的call函数里audio_inputs = self.feature_extractor(
audios, sampling_rate=sampling_rate, return_attention_mask=True, padding="max_length", **kwargs
)没有传入do_normalize,也就是没有对原始音频数据做归一化,原始音频数值过大或过小时,可能超出bf16或fp16的的范围或精度,导致loss变成nan, 我会提一个PR给processing加上这个optional参数do_normalize
== 实际不是这里导致的,可忽略
mark
qwen2-audio 微调的是哪部分,是language_model部分还是 模型全部. @Jintao-Huang
qwen2-audio 微调的是哪部分,是language_model部分还是 模型全部. @Jintao-Huang
lora默认是 language_model, 可以设置为 --target_modules ALL 来训练全部的linear层
全参数默认是 全部参数, 可以通过--freeze_vit来冻结encoder部分
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in
[rank0]: sft_main()
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]: result = llm_x(args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 545, in llm_sft
[rank0]: return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 495, in trainer_train
[rank0]: trainer.train(training_args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 488, in train
[rank0]: res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2140, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank0]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 564, in _maybe_log_save_evaluate
[rank0]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3018, in _maybe_log_save_evaluate
[rank0]: self._save_checkpoint(model, trial)
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 386, in _save_checkpoint
[rank0]: result = super()._save_checkpoint(model, trial, metrics)
[rank0]: TypeError: Trainer._save_checkpoint() takes 3 positional arguments but 4 were given
您好, 执行以下代码报错,麻烦您看看应该如何改?
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft
--model_type qwen2-audio-7b-instruct
--dataset aishell1-zh-mini
--deepspeed default-zero2
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell
CUDA_VISIBLE_DEVICES=0 swift sft
--model_type qwen2-audio-7b-instruct
--model_id_or_path Model_Files/Qwen2-Audio-7B-Instruct
--tuner_backend peft
--template_type AUTO
--dtype AUTO
--train_dataset_sample -1
--max_length 2048
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--weight_decay 0.1
--learning_rate 1e-4
--max_grad_norm 0.5
--warmup_ratio 0.03
--save_total_limit 2
--batch_size 4
--use_flash_attn false
--lazy_tokenize true
--dataset train_v1.jsonl
--val_dataset test_v1.jsonl
--output_dir /output
>> ./sft-v1.log
数据共六万五千条,做情感描述的。在第五百步loss大概6.,但在三千步地时候loss还是6.,一直在6.5左右震荡,无法继续收敛。但是用Qwen1-Audio即可收敛到0.8*
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
排查了一下数据,发现有的音频长度太短,在whisper_feature_extract的时候,窗口大小为400个frame,如果音频小于400,会导致实际的mel feature长度为0,也就是实际进入计算的是0tensor,会出现奇怪的问题。
在训练前,先对所有的音频和文本做一遍长度校验,过短的丢弃,训练就正常了
[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in [rank0]: sft_main() [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 32, in x_main [rank0]: result = llm_x(args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 545, in llm_sft [rank0]: return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 495, in trainer_train [rank0]: trainer.train(training_args.resume_from_checkpoint) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 488, in train [rank0]: res = super().train(resume_from_checkpoint, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2140, in train [rank0]: return inner_training_loop( [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop [rank0]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 564, in _maybe_log_save_evaluate [rank0]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3018, in _maybe_log_save_evaluate [rank0]: self._save_checkpoint(model, trial) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 386, in _save_checkpoint [rank0]: result = super()._save_checkpoint(model, trial, metrics) [rank0]: TypeError: Trainer._save_checkpoint() takes 3 positional arguments but 4 were given
您好, 执行以下代码报错,麻烦您看看应该如何改? NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft --model_type qwen2-audio-7b-instruct --dataset aishell1-zh-mini --deepspeed default-zero2
请使用transformers<4.46或者升级ms-swift到2.5.2
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell
CUDA_VISIBLE_DEVICES=0 swift sft
--model_type qwen2-audio-7b-instruct
--model_id_or_path Model_Files/Qwen2-Audio-7B-Instruct
--tuner_backend peft
--template_type AUTO
--dtype AUTO
--train_dataset_sample -1
--max_length 2048
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--weight_decay 0.1
--learning_rate 1e-4
--max_grad_norm 0.5
--warmup_ratio 0.03
--save_total_limit 2
--batch_size 4
--use_flash_attn false
--lazy_tokenize true
--dataset train_v1.jsonl
--val_dataset test_v1.jsonl
--output_dir /output
>> ./sft-v1.log
数据共六万五千条,做情感描述的。在第五百步loss大概6.,但在三千步地时候loss还是6.,一直在6.5左右震荡,无法继续收敛。但是用Qwen1-Audio即可收敛到0.8*
看看是否是这个原因:
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
排查了一下数据,发现有的音频长度太短,在whisper_feature_extract的时候,窗口大小为400个frame,如果音频小于400,会导致实际的mel feature长度为0,也就是实际进入计算的是0tensor,会出现奇怪的问题。 在训练前,先对所有的音频和文本做一遍长度校验,过短的丢弃,训练就正常了
好的 感谢分享
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift) pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm] # 安装最新的transformers(Install the latest transformers.) pip install git+https://github.com/huggingface/transformers.git pip install librosa推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct # 如果是本地路径(If it is a local path.) CUDA_VISIBLE_DEVICES=0 swift infer \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path '<local_path>'推理效果:(Inference result:)
<<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav Yes, I can guess that you are a female in your twenties. -------------------------------------------------- <<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav 每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。 -------------------------------------------------- <<< clear <<< 你是谁 我是来自达摩院的语言模型,我叫通义千问。
使用python调用:(Using Python)
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, inference_stream ) from swift.utils import seed_everything import torch model_type = ModelType.qwen2_audio_7b_instruct model_id_or_path = None template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'}) model.generation_config.max_new_tokens = 256 template = get_template(template_type, tokenizer) seed_everything(42) query = '<audio>这段语音说了什么' audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'] response, history = inference(model, template, query, audios=audios) print(f'query: {query}') print(f'response: {response}') # 流式(streaming) query = '这段语音是男生还是女生' gen = inference_stream(model, template, query, history, audios=audios) print_idx = 0 print(f'query: {query}\nresponse: ', end='') for response, history in gen: delta = response[print_idx:] print(delta, end='', flush=True) print_idx = len(response) print() print(f'history: {history}') """ query: <audio>这段语音说了什么 response: 这段语音说的是:'今天天气真好呀' query: 这段语音是男生还是女生 response: 男声。 history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']] """Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b推理效果:(Inference result)
<<< <audio>Generate the caption in English: Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3 Glass is breaking.
你好请教一下当我sft_type设置为full的时候,model_type=qwen2-audio-7b-instruct,在a100上发现显存占用只有10个g左右,感觉对于7B这个大小这个显存占用是不是有点不太对,还是说哪里的设置不对,感觉这个显存占用不太像是一个7b模型全参数量sft,感谢解答
ps:考了auto_find_batch_size但是发现输出的sft_args.json中bs是1
多轮对话的数据再在训练的时候,是只有最后一个回复会被用来计算loss吗?
上面实例中aishell 数据集构建形式是怎么样的?直接从魔搭社区下载的数据,搭配aishell的wav文件,总是报错。但是通过Swift脚本下载到缓存文件夹中的数据,看不到具体数据构建的json文件,请问类似于ASR,语音翻译的数据应该如何构建json文件呢
Qwen2-audio微调时能使用lora训练只训练audio-encoder部分吗?怎么配置命令啊 @HyacinthJingjing
使用自己的数据微调了qwen2-audio-7b-instruct audio_tower层,微调数据是英语口语,训练数据样例如下:
{"query": "what did this voice say", "response": "which is really quite long enough", "audios": ["/root/sent/k12/cz200/CRK100/CRK00206/CRK0096268218.wav"]}
{"query": "what did this voice say", "response": "daniel is my good friend he is always kind to others he is friendly and helpful", "audios": ["/root/sent/k12/md100_5000_snt/2dc2c32679bbc5c9d55ccadecc34bc80_1_0_6240.wav"]}
微调命令如下:
NPROC_PER_NODE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen2-audio-7b-instruct \
--model_cache_dir "/root/dev/qwen2_audio/Qwen2-Audio-7B-Instruct" \
--sft_type full \
--freeze_parameters 0.999 \
--additional_trainable_parameters audio_tower \
--dtype AUTO \
--template_type AUTO \
--output_dir "/root/dev/qwen2_audio/output" \
--dataset "/root/dev/qwen2_audio/train_en.jsonl" \
--dataset_test_ratio 0.01 \
--num_train_epochs 3 \
--max_length 1024 \
--check_dataset_strategy warning \
--gradient_checkpointing true \
--batch_size 6 \
--weight_decay 0.01 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 32 \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_step 1000 \
--train_dataset_sample -1 \
--save_total_limit 10 \
--report_to tensorboard \
--logging_steps 10 \
--lazy_tokenize true
微调后,在自己的测试集上,wer从13.23%降为了9.37%,在新生成的模型的基础上使用同一批数据通过lora微调,微调命令如下:
OMP_NUM_THREADS=4 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen2-audio-7b-instruct \
--model_cache_dir "/root/dev/qwen2_audio/output/qwen2-audio-7b-instruct/v0-20241127-170841/checkpoint-7344" \
--output_dir "/root/dev/qwen2_audio/output_epochs_6_peft" \
--dataset "/root/dev/qwen2_audio/train_en.jsonl" \
--sft_type lora \
--tuner_backend peft \
--template_type AUTO \
--dtype AUTO \
--num_train_epochs 3 \
--max_length 2048 \
--check_dataset_strategy warning \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 2 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 16 \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 500 \
--logging_steps 10 \
--use_flash_attn false \
--lazy_tokenize true
请问大家有人复现了论文的ASR任务上的结果吗?请问qwen2-audio 和 qwen2-audio-7b-instruct 有什么区别吗
微调qwen2-audio-7b-instruct,训练数据样例如下:
[
{
"query": "What's the mood of the speaker?",
"response": "Neutral",
"audios": [
"/home/mali/projects/CREMA-D/AudioMP3/1001_IEO_NEU_XX.mp3"
]
},
{
"query": "What's the mood of the speaker?",
"response": "Neutral",
"audios": [
"/home/mali/projects/CREMA-D/AudioMP3/1001_IEO_HAP_LO.mp3"
]
},
]
微调命令如下:
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=5,6,2,3 swift sft --model_type qwen2_audio --model /home/mali/projects/Qwen2-audio/Qwen2-Audio-7B-Instruct-SFT --dataset /home/mali/projects/qwen2-audio-sft/output.json
报错如下:
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward
function. Please make sure model pa
rameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in
multiple reentrant backward passes. For example, if you use multiple checkpoint
functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant
backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph d
oes not change over iterations.
[rank0]: Parameter at index 447 with name base_model.model.language_model.model.layers.31.mlp.down_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks ha
ve fired for this particular parameter during this iteration.
库版本如下:
transformers==4.45.0
swift==2.5.0.dev0
哪位大神帮忙看看出了什么问题? @fmbao @Liufeiran123 @kindaQ @HyacinthJingjing @farmer21cn
推理时num_beam无法设置大于1
NotImplementedError: Make sure that a
_reorder_cache
function is correctly implemented in transformers.models.qwen2.modeling_qwen2 to enable beam search for <class 'transformers.models.qwen2.modeling_qwen2.Qwen2ForCausalLM'>
您好,我在微调的时候,发现开启deepspeed后,同样的训练步数,同样的学习率,loss下降的更快了,这是什么原因呢?
deepspeed的参数我基本设置的 auto
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
}
Hello I read this topic few times in the row and still need clarification how to perform fine tuning on custom dataset.
I would like to fine tune using LoRa.
I would like to use template:
but may I pass here system message ?
and why we use from: user rather then "role": user - as it is written on model official when we use conversation
As well when I fine tune using LoRa, how should I merge lora after training ?
I would like to use python code like it is used in official model page
Is it possible to train onQwen2-Audio-7B-Instruct-4bit
or whether it is not advisable ?
TY
swift3, please refer to here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal