jianzhnie/LLamaTuner

faile on 3090

Closed this issue · 6 comments

(gh_Chinese-Guanaco) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco$ python3 qlora_int8_finetune.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf --data_path tatsu-lab/alpaca --output_dir work_dir_lora/ --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 1e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 2048 --logging_steps 1
[2023-06-11 00:48:41,928] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Traceback (most recent call last):
File "/home/ub2004/llm_dev/Chinese-Guanaco/qlora_int8_finetune.py", line 338, in
train(load_in_8bit=True)
File "/home/ub2004/llm_dev/Chinese-Guanaco/qlora_int8_finetune.py", line 234, in train
model = AutoModelForCausalLM.from_pretrained(
File "/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "/home/ub2004/anaconda3/envs/gh_Chinese-Guanaco/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2819, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

(gh_Chinese-Guanaco) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco$

It seems that your GPU's memory is insufficient. Can you tell me about your computer configuration?

RTX 3090 24GB

CPU mem is 32GB

Under the "examples" folder, I have added a minimal example for fine-tuning the llama7b model. Please feel free to try it again.

(base) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco/examples$ python3 finetune_llama_with_qlora.py --model_name_or_path /data-ssd-1t/hf_model/llama-7b-hf --data_path tatsu-lab/alpaca --output_dir work_dir_lora/ --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 5 --learning_rate 1e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 128 --logging_steps 1

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /home/ub2004/anaconda3 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ub2004/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 91%|██████████████████████████████████████████████████████████████████████████████████████████████████▏ | 30/33 [00:23<00:02, 1.27it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ub2004/llm_dev/Chinese-Guanaco/examples/finetune_llama_with_qlora.py:72 in │
│ │
│ 69 │ │ we will partially dequantize it when needed and do all the computations with a 1 │
│ 70 │ """ │
│ 71 │ # So now we can load the model in 4-bit: │
│ ❱ 72 │ model = AutoModelForCausalLM.from_pretrained( │
│ 73 │ │ model_id, quantization_config=bnb_config, device_map={'': 0}) │
│ 74 │ │
│ 75 │ # Then, we enable gradient checkpointing, to reduce the memory footprint of the mode │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:471 │
│ in from_pretrained │
│ │
│ 468 │ │ │ ) │
│ 469 │ │ elif type(config) in cls._model_mapping.keys(): │
│ 470 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
│ ❱ 471 │ │ │ return model_class.from_pretrained( │
│ 472 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 473 │ │ │ ) │
│ 474 │ │ raise ValueError( │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py:2795 in │
│ from_pretrained │
│ │
│ 2792 │ │ │ │ mismatched_keys, │
│ 2793 │ │ │ │ offload_index, │
│ 2794 │ │ │ │ error_msgs, │
│ ❱ 2795 │ │ │ ) = cls._load_pretrained_model( │
│ 2796 │ │ │ │ model, │
│ 2797 │ │ │ │ state_dict, │
│ 2798 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py:3123 in │
│ _load_pretrained_model │
│ │
│ 3120 │ │ │ │ ) │
│ 3121 │ │ │ │ │
│ 3122 │ │ │ │ if low_cpu_mem_usage: │
│ ❱ 3123 │ │ │ │ │ new_error_msgs, offload_index, state_dict_index = _load_state_dict_i │
│ 3124 │ │ │ │ │ │ model_to_load, │
│ 3125 │ │ │ │ │ │ state_dict, │
│ 3126 │ │ │ │ │ │ loaded_keys, │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py:698 in │
│ _load_state_dict_into_meta_model │
│ │
│ 695 │ │ │ state_dict_index = offload_weight(param, param_name, state_dict_folder, stat │
│ 696 │ │ elif not load_in_8bit: │
│ 697 │ │ │ # For backward compatibility with older versions of accelerate
│ ❱ 698 │ │ │ set_module_tensor_to_device(model, param_name, param_device, **set_module_kw │
│ 699 │ │ else: │
│ 700 │ │ │ if param.dtype == torch.int8 and param_name.replace("weight", "SCB") in stat │
│ 701 │ │ │ │ fp16_statistics = state_dict[param_name.replace("weight", "SCB")] │
│ │
│ /home/ub2004/anaconda3/lib/python3.10/site-packages/accelerate/utils/modeling.py:149 in │
│ set_module_tensor_to_device │
│ │
│ 146 │ │ if value is None: │
│ 147 │ │ │ new_value = old_value.to(device) │
│ 148 │ │ elif isinstance(value, torch.Tensor): │
│ ❱ 149 │ │ │ new_value = value.to(device) │
│ 150 │ │ else: │
│ 151 │ │ │ new_value = torch.tensor(value, device=device) │
│ 152 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 23.04 GiB already allocated; 14.12 MiB free; 23.04 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
(base) ub2004@ub2004-B85M-A0:~/llm_dev/Chinese-Guanaco/examples$

try : --max_memory_MB 48000 \

Under the "examples" folder, I have added a minimal example for fine-tuning the llama7b model. Please feel free to try it again.
I have tried but still got bugs:
Traceback (most recent call last): File "/home/xdx/baichuan1/finetune/Efficient-Tuning-LLMs/baichuan7b_demo.py", line 23, in <module> main(load_in_8bit, model_path) File "/home/xdx/baichuan1/finetune/Efficient-Tuning-LLMs/baichuan7b_demo.py", line 8, in main model = AutoModelForCausalLM.from_pretrained( File "/home/xdx/miniconda3/envs/baichuan/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained return model_class.from_pretrained( File "/home/xdx/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 658, in from_pretrained return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args, File "/home/xdx/miniconda3/envs/baichuan/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2959, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/home/xdx/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 531, in __init__ if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']: TypeError: 'BitsAndBytesConfig' object is not subscriptable

here is my code in linux >>>
CUDA_VISIBLE_DEVICES=4,5 python baichuan7b_demo.py \ --model_name_or_path ../../../../baichuan-inc/Baichuan2-7B-Chat \ --dataset_cfg ./data/alpaca_zh_pcyn.yaml \ --output_dir ../../../..oasst1-baichuan-7b \ --num_train_epochs 4 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy steps \ --eval_steps 50 \ --save_strategy steps \ --save_total_limit 5 \ --save_steps 100 \ --logging_strategy steps \ --logging_steps 1 \ --learning_rate 0.0002 \ --warmup_ratio 0.03 \ --weight_decay 0.0 \ --lr_scheduler_type constant \ --adam_beta2 0.999 \ --max_grad_norm 0.3 \ --max_new_tokens 32 \ --lora_r 64 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --double_quant \ --quant_type nf4 \ --fp16 \ --bits 4 \ --gradient_checkpointing \ --trust_remote_code \ --do_train \ --do_eval \ --sample_generate \ --data_seed 42 \ --seed 0 \ --max_memory_MB 48000