hiyouga/LLaMA-Factory

errors while in finetune intermlm2-chat-20b with qlora

Opened this issue · 1 comments

Reminder

  • I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=1 llamafactory-cli example/......
below is the yaml file:

model

model_name_or_path: /home/ybh/ybh/models/internlm2-chat-20b
quantization_bit: 4

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: wqkv

dataset

dataset: text_classification_coarse
template: intern2
cutoff_len: 6144
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /home/ybh/ybh/nlpcc/LLaMA-Factory/saves/internlm2-chat-20b/qlora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

eval

val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 10

Expected behavior

No response

System Info

[INFO|trainer.py:2048] 2024-05-18 00:07:10,006 >> ***** Running training *****
[INFO|trainer.py:2049] 2024-05-18 00:07:10,006 >> Num examples = 122
[INFO|trainer.py:2050] 2024-05-18 00:07:10,006 >> Num Epochs = 5
[INFO|trainer.py:2051] 2024-05-18 00:07:10,006 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2054] 2024-05-18 00:07:10,006 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2055] 2024-05-18 00:07:10,006 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2056] 2024-05-18 00:07:10,006 >> Total optimization steps = 75
[INFO|trainer.py:2057] 2024-05-18 00:07:10,007 >> Number of trainable parameters = 2,621,440
0%| | 0/75 [00:00<?, ?it/s]/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
Traceback (most recent call last):
File "/home/ybh/miniconda3/envs/nlpcc/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/cli.py", line 65, in main
run_exp()
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/ybh/nlpcc/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/accelerate/accelerator.py", line 2121, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/home/ybh/miniconda3/envs/nlpcc/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/75 [00:00<?, ?it/s

Others

i changed lora to finetune internlm-chat-7b, but this error is not happend.

Yes, same here, though in my case I tried it with internlm2-20b (base, non-chat)

The same configuration, but applied to internlm2-7b, appears to work (I did not allow it to conclude as I am not interested in that model)