Error encountered while training qwen-2.5-3b model using Qwen2.5-Coder/finetuning/sft/train.py
Closed this issue · 5 comments
Please provide us with a minimal example to reproduce the error: training data (a small set is ok), binarize scripts and training scripts.
training data:
I have organized the spider dev data according to the format specified in Qwen2.5-Coder/finetuning/sft
, and processed it using ./scripts/binarize_data.sh
. Here is an example:
{
"messages": [
{"role": "user", "content": "CREATE TABLE \"stadium\" (\n\"Stadium_ID\" int,\n\"Location\" text,\n\"Name\" text,\n\"Capacity\" int,\n\"Highest\" int,\n\"Lowest\" int,\n\"Average\" int,\nPRIMARY KEY (\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM stadium LIMIT 3;\nStadium_ID Location Name Capacity Highest Lowest Average\n1 Raith Rovers Stark's Park 10104 4812 1294 2106\n2 Ayr United Somerset Park 11998 2363 1057 1477\n3 East Fife Bayview Stadium 2000 1980 533 864\n*/\n\nCREATE TABLE \"singer\" (\n\"Singer_ID\" int,\n\"Name\" text,\n\"Country\" text,\n\"Song_Name\" text,\n\"Song_release_year\" text,\n\"Age\" int,\n\"Is_male\" bool,\nPRIMARY KEY (\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer LIMIT 3;\nSinger_ID Name Country Song_Name Song_release_year Age Is_male\n1 Joe Sharp Netherlands You 1992 52 F\n2 Timbaland United States Dangerous 2008 32 T\n3 Justin Brown France Hey Oh 2013 29 T\n*/\n\nCREATE TABLE \"concert\" (\n\"concert_ID\" int,\n\"concert_Name\" text,\n\"Theme\" text,\n\"Stadium_ID\" text,\n\"Year\" text,\nPRIMARY KEY (\"concert_ID\"),\nFOREIGN KEY (\"Stadium_ID\") REFERENCES \"stadium\"(\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM concert LIMIT 3;\nconcert_ID concert_Name Theme Stadium_ID Year\n1 Auditions Free choice 1 2014\n2 Super bootcamp Free choice 2 2 2014\n3 Home Visits Bleeding Love 2 2015\n*/\n\nCREATE TABLE \"singer_in_concert\" (\n\"concert_ID\" int,\n\"Singer_ID\" text,\nPRIMARY KEY (\"concert_ID\",\"Singer_ID\"),\nFOREIGN KEY (\"concert_ID\") REFERENCES \"concert\"(\"concert_ID\"),\nFOREIGN KEY (\"Singer_ID\") REFERENCES \"singer\"(\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer_in_concert LIMIT 3;\nconcert_ID Singer_ID\n1 2\n1 3\n1 5\n*/\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\nQuestion: How many singers do we have?\n"},
{"role": "assistant", "content": "SELECT count(*) FROM singer"}
],
"format": "chatml"
}
Here is my binarize scripts:
export PATH=/path/to/miniconda3/envs/qwen/bin:$PATH;
# cd ./finetuning/sft/;
INPUT_PATH=${1}
OUTPUT_PATH=${2}
TOKENIZER_PATH=${3}
INPUT_PATH=${INPUT_PATH:-"./raw_data/sft.jsonl"}
OUTPUT_PATH=${OUTPUT_PATH:-"./processed/sft.jsonl"}
TOKENIZER_PATH=${TOKENIZER_PATH:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
python binarize_data.py -input_path ${INPUT_PATH} -output_path ${OUTPUT_PATH} -workers 64 -tokenizer_path ${TOKENIZER_PATH}
and my training scripts:
export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=enp0s31f6
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export CUDA_VISIBLE_DEVICES=0
export CUDA_LAUNCH_BLOCKING=1
export NCCL_NET_PLUGIN=none
export TORCHELASTIC_ERROR_FILE=error.json
export PATH=/home/yhw/miniconda3/envs/sft_env/bin:$PATH;
DATA_PATH=${1}
PRETRAINED_MODEL=${2}
OUTPUT_DIR=${3}
DATA_PATH=${DATA_PATH:-"./processed/sft.jsonl"}
PRETRAINED_MODEL=${PRETRAINED_MODEL:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
OUTPUT_DIR=${OUTPUT_DIR:-"./checkpoints/lr${LR}-wr${WARMUP_STEPS}-wd${WEIGHT_DECAY}-bsz${BATCH_SIZE}-maxlen${MAX_LENGTH}/"}
GPUS_PER_NODE=$(python -c "import torch; print(torch.cuda.device_count());")
MASTER_ADDR=${MASTER_ADDR:-localhost}
NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=${MASTER_PORT:-6105}
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
DEEPSPEED_CONFIG="./configs/default_offload_opt_param.json"
BATCH_SIZE=1024
MICRO_BATCH_SIZE=4
GRAD_ACCU=$(($BATCH_SIZE / $WORLD_SIZE / $MICRO_BATCH_SIZE))
LR=5e-5
MIN_LR=5e-6
WARMUP_STEPS=100
WEIGHT_DECAY=0.0
MAX_LENGTH=1280
echo $OUTPUT_DIR
echo "Pretrained Model" ${PRETRAINED_MODEL}
echo "WORLD_SIZE" $WORLD_SIZE "MICRO BATCH SIZE" $MICRO_BATCH_SIZE "GRAD_ACCU" $GRAD_ACCU
echo $DISTRIBUTED_ARGS
# cd ROOT_PATH="/path/to/sft/";
torchrun ${DISTRIBUTED_ARGS} train.py \
--model_name_or_path ${PRETRAINED_MODEL} \
--data_path $DATA_PATH \
--model_max_length ${MAX_LENGTH} \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 3 \
--per_device_train_batch_size ${MICRO_BATCH_SIZE} \
--gradient_accumulation_steps ${GRAD_ACCU} \
--per_device_eval_batch_size 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 100 \
--learning_rate ${LR} \
--weight_decay ${WEIGHT_DECAY} \
--warmup_steps ${WARMUP_STEPS} \
--lr_scheduler_type "cosine" \
--logging_strategy "steps" \
--logging_steps 1 \
--deepspeed ${DEEPSPEED_CONFIG} \
--report_to "tensorboard" \
--bf16 True \
--tf32 True \
--truncate_source False
I made some modifications to ensure that both scripts can run on my server. The binarize_data.sh
script runs successfully, but the sft_qwencoder.sh
script encounters the error mentioned above.
Does the same problem also occur in other model sizes (e.g. Qwen2.5-Coder-1.5B) ?
I'm facing the same issue. Have you managed to resolve it? @Yhw109 @CSJianYang