QwenLM/Qwen2.5-Coder

Error encountered while training qwen-2.5-3b model using Qwen2.5-Coder/finetuning/sft/train.py

Closed this issue · 5 comments

Hello,
When I run the code and scripts in Qwen2.5-Coder/finetuning/sft/train.py to train the qwen-2.5-3b model, I encounter the following error:
1732023614442
1732023628145

it seems to be a token range issue. Could you please advise me on how to resolve this problem? My library version matches requirements.
Thank you!

Please provide us with a minimal example to reproduce the error: training data (a small set is ok), binarize scripts and training scripts.

training data:
I have organized the spider dev data according to the format specified in Qwen2.5-Coder/finetuning/sft, and processed it using ./scripts/binarize_data.sh. Here is an example:

{
"messages": [
{"role": "user", "content": "CREATE TABLE \"stadium\" (\n\"Stadium_ID\" int,\n\"Location\" text,\n\"Name\" text,\n\"Capacity\" int,\n\"Highest\" int,\n\"Lowest\" int,\n\"Average\" int,\nPRIMARY KEY (\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM stadium LIMIT 3;\nStadium_ID    Location    Name    Capacity    Highest    Lowest    Average\n1    Raith Rovers    Stark's Park    10104    4812    1294    2106\n2    Ayr United    Somerset Park    11998    2363    1057    1477\n3    East Fife    Bayview Stadium    2000    1980    533    864\n*/\n\nCREATE TABLE \"singer\" (\n\"Singer_ID\" int,\n\"Name\" text,\n\"Country\" text,\n\"Song_Name\" text,\n\"Song_release_year\" text,\n\"Age\" int,\n\"Is_male\" bool,\nPRIMARY KEY (\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer LIMIT 3;\nSinger_ID    Name    Country    Song_Name    Song_release_year    Age    Is_male\n1    Joe Sharp    Netherlands    You    1992    52    F\n2    Timbaland    United States    Dangerous    2008    32    T\n3    Justin Brown    France    Hey Oh    2013    29    T\n*/\n\nCREATE TABLE \"concert\" (\n\"concert_ID\" int,\n\"concert_Name\" text,\n\"Theme\" text,\n\"Stadium_ID\" text,\n\"Year\" text,\nPRIMARY KEY (\"concert_ID\"),\nFOREIGN KEY (\"Stadium_ID\") REFERENCES \"stadium\"(\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM concert LIMIT 3;\nconcert_ID    concert_Name    Theme    Stadium_ID    Year\n1    Auditions    Free choice    1    2014\n2    Super bootcamp    Free choice 2    2    2014\n3    Home Visits    Bleeding Love    2    2015\n*/\n\nCREATE TABLE \"singer_in_concert\" (\n\"concert_ID\" int,\n\"Singer_ID\" text,\nPRIMARY KEY (\"concert_ID\",\"Singer_ID\"),\nFOREIGN KEY (\"concert_ID\") REFERENCES \"concert\"(\"concert_ID\"),\nFOREIGN KEY (\"Singer_ID\") REFERENCES \"singer\"(\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer_in_concert LIMIT 3;\nconcert_ID    Singer_ID\n1    2\n1    3\n1    5\n*/\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\nQuestion: How many singers do we have?\n"}, 
{"role": "assistant", "content": "SELECT count(*) FROM singer"}
], 
"format": "chatml"
}

Here is my binarize scripts:

export PATH=/path/to/miniconda3/envs/qwen/bin:$PATH;
# cd ./finetuning/sft/;
INPUT_PATH=${1}
OUTPUT_PATH=${2}
TOKENIZER_PATH=${3}
INPUT_PATH=${INPUT_PATH:-"./raw_data/sft.jsonl"}
OUTPUT_PATH=${OUTPUT_PATH:-"./processed/sft.jsonl"}
TOKENIZER_PATH=${TOKENIZER_PATH:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
python binarize_data.py -input_path ${INPUT_PATH} -output_path ${OUTPUT_PATH} -workers 64 -tokenizer_path ${TOKENIZER_PATH}

and my training scripts:

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=enp0s31f6
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export CUDA_VISIBLE_DEVICES=0
export CUDA_LAUNCH_BLOCKING=1
export NCCL_NET_PLUGIN=none
export TORCHELASTIC_ERROR_FILE=error.json
export PATH=/home/yhw/miniconda3/envs/sft_env/bin:$PATH;

DATA_PATH=${1}
PRETRAINED_MODEL=${2}
OUTPUT_DIR=${3}

DATA_PATH=${DATA_PATH:-"./processed/sft.jsonl"}
PRETRAINED_MODEL=${PRETRAINED_MODEL:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
OUTPUT_DIR=${OUTPUT_DIR:-"./checkpoints/lr${LR}-wr${WARMUP_STEPS}-wd${WEIGHT_DECAY}-bsz${BATCH_SIZE}-maxlen${MAX_LENGTH}/"}

GPUS_PER_NODE=$(python -c "import torch; print(torch.cuda.device_count());")
MASTER_ADDR=${MASTER_ADDR:-localhost}
NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=${MASTER_PORT:-6105}
DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
DEEPSPEED_CONFIG="./configs/default_offload_opt_param.json"
BATCH_SIZE=1024
MICRO_BATCH_SIZE=4
GRAD_ACCU=$(($BATCH_SIZE / $WORLD_SIZE / $MICRO_BATCH_SIZE))

LR=5e-5
MIN_LR=5e-6
WARMUP_STEPS=100
WEIGHT_DECAY=0.0
MAX_LENGTH=1280

echo $OUTPUT_DIR
echo "Pretrained Model" ${PRETRAINED_MODEL}
echo "WORLD_SIZE" $WORLD_SIZE "MICRO BATCH SIZE" $MICRO_BATCH_SIZE "GRAD_ACCU" $GRAD_ACCU
echo $DISTRIBUTED_ARGS

# cd ROOT_PATH="/path/to/sft/";
torchrun ${DISTRIBUTED_ARGS} train.py \
    --model_name_or_path  ${PRETRAINED_MODEL} \
    --data_path $DATA_PATH \
    --model_max_length ${MAX_LENGTH} \
    --output_dir ${OUTPUT_DIR} \
    --num_train_epochs 3 \
    --per_device_train_batch_size ${MICRO_BATCH_SIZE} \
    --gradient_accumulation_steps ${GRAD_ACCU} \
    --per_device_eval_batch_size 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 100 \
    --learning_rate ${LR} \
    --weight_decay ${WEIGHT_DECAY} \
    --warmup_steps ${WARMUP_STEPS} \
    --lr_scheduler_type "cosine" \
    --logging_strategy "steps" \
    --logging_steps 1 \
    --deepspeed ${DEEPSPEED_CONFIG} \
    --report_to "tensorboard" \
    --bf16 True \
    --tf32 True \
    --truncate_source False

I made some modifications to ensure that both scripts can run on my server. The binarize_data.sh script runs successfully, but the sft_qwencoder.sh script encounters the error mentioned above.

Does the same problem also occur in other model sizes (e.g. Qwen2.5-Coder-1.5B) ?

I'm facing the same issue. Have you managed to resolve it? @Yhw109 @CSJianYang

#189

hi, please merge the latest PR, which may solve your problems. @nwputian @Yhw109