[FSDP+QLoRA] ValueError: Expected a cuda device, but got: cpu
iseesaw opened this issue · 7 comments
System Info
pip list
accelerate 0.29.3
bitsandbytes 0.43.1
datasets 2.14.6
huggingface-hub 0.20.3
llama-recipes 0.0.1
peft 0.10.0
safetensors 0.4.2
tokenizers 0.19.1
torch 2.1.2
transformers 4.40.0
cupy-cuda12x 12.1.0
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
8xA6000 48G, CUDA Version: 12.2
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
Code from https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-141b-A35b
Set use_dora=True
in LoRAConfig
Running with my modified command from the following
ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file recipes/accelerate_configs/fsdp.yaml scripts/run_orpo.py recipes/zephyr-141b-A35b/orpo/config_qlora.yaml
Raise ValueError
Traceback (most recent call last):
File "/root/kyzhang/llms/UltraMedical/llm_dpo/run_sft.py", line 209, in <module>
main()
File "/root/kyzhang/llms/UltraMedical/llm_dpo/run_sft.py", line 141, in main
trainer = SFTTrainer(
File "/root/miniconda3/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 228, in __init__
model = get_peft_model(model, peft_config)
File "/root/miniconda3/lib/python3.10/site-packages/peft/mapping.py", line 136, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 1094, in __init__
super().__init__(model, peft_config, adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 129, in __init__
self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 136, in __init__
super().__init__(model, config, adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 148, in __init__
self.inject_adapter(self.model, adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 325, in inject_adapter
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 220, in _create_and_replace
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 295, in _create_new_module
new_module = dispatcher(target, adapter_name, lora_config=lora_config, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 506, in dispatch_bnb_4bit
new_module = Linear4bit(target, adapter_name, **fourbit_kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 293, in __init__
self.update_layer(
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 126, in update_layer
self.dora_init(adapter_name)
File "/root/miniconda3/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 186, in dora_init
weight = dequantize_bnb_weight(weight, state=quant_state) # no-op if not bnb
File "/root/miniconda3/lib/python3.10/site-packages/peft/utils/integrations.py", line 58, in dequantize_bnb_weight
return bnb.functional.dequantize_4bit(weight.data, weight.quant_state)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1353, in dequantize_4bit
device = pre_call(A.device)
File "/root/miniconda3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 459, in pre_call
torch.cuda.set_device(device)
File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 402, in set_device
device = _get_device_index(device)
File "/root/miniconda3/lib/python3.10/site-packages/torch/cuda/_utils.py", line 35, in _get_device_index
raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu
Expected behavior
I successfully trained the LLaMA-3-70B model using the script from the official PEFT example: run_peft_qlora_fsdp.sh.
However, I'm still encountering this problem when I set use_dora=True
in the code.
Thanks for reporting. It looks like at initialization time, the model is still on CPU. As initializing DoRA requires us to dequantize the bnb weights, which is not supported on CPU, we see this error. This should hopefully not be that hard to fix on our side. Meanwhile, perhaps you can adjust your scripts so that the base model is sent to GPU before calling get_peft_model
and check if that works.
Edit: Honestly not sure how the weights can be on CPU here, maybe some form of offloading? In that case, the problem probably runs deeper. Are you aware if any offloading goes on here?
I have this same issue. I can do Lora/Dora, DDP Lora/Dora, QLora/QDora, DDP QLora/QDora, FSDP Lora/Dora, and FSDP QLora but FSDP QDora does not seem to be working.
This fixed the issue I was having, but when using DORA/QDora with FSDP it errors outs:
[rank0]: Traceback (most recent call last):
[rank0]: File "trl_finetune.py", line 401, in
[rank0]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank0]: output = super().train(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
[rank0]: self.model = self.accelerator.prepare(self.model)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]: result = tuple(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model
[rank0]: model = FSDP(model, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init
[rank0]: _auto_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank0]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank0]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank0]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank0]: return wrapper_cls(module, **kwargs)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
[rank0]: _init_param_handle_from_module(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank0]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank0]: handle = FlatParamHandle(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
[rank0]: self._init_flat_param_and_metadata(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank0]: ) = self._validate_tensors_to_flatten(params)
[rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank0]: raise ValueError(
[rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3541/3541 [00:00<00:00, 12989.37 examples/s]
/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with padding_side
not equal to right
to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right'
to your code.
warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]: File "trl_finetune.py", line 401, in
[rank1]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
[rank1]: output = super().train(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]: result = tuple(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model
[rank1]: model = FSDP(model, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init
[rank1]: _auto_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
[rank1]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
[rank1]: return wrapper_cls(module, **kwargs)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module
[rank1]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params
[rank1]: handle = FlatParamHandle(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
[rank1]: self._init_flat_param_and_metadata(
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
[rank1]: ) = self._validate_tensors_to_flatten(params)
[rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
[rank1]: raise ValueError(
[rank1]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32