bitsandbytes-foundation/bitsandbytes

Device_map='auto' not working along with bitsandbytes (transformers)

Closed this issue · 3 comments

System Info

Hardware: Amazon Linux EC2 Instance.
8 NVIDIA A10G (23 GB)

Python 3.10.14
CUDA Version: 12.4
accelerate==0.34.2
bitsandbytes==0.44.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
torch==2.4.1
transformers==4.45.1

Reproduction

from accelerate import infer_auto_device_map
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(**{'load_in_4bit':True})
 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)

device_map = infer_auto_device_map(model, max_memory = {0: "23GB", 1: "23GB", 2: "23GB", 3: "23GB", 4: "23GB", 5: "23GB", 6: "23GB", 7: "23GB"})
print(device_map)
--> OrderedDict([('', 0)])

However, if I load without the quantization_config, no issue at all:

 model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', device_map='auto', quantization_config=bnb_config)
 print(device_map)
--> OrderedDict([('model.embed_tokens', 0), ('lm_head', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7.self_attn', 0), ('model.layers.7.mlp.gate_proj', 0), ('model.layers.7.mlp.up_proj', 0), ('model.layers.7.mlp.down_proj', 1), ('model.layers.7.mlp.act_fn', 1), ('model.layers.7.input_layernorm', 1), ('model.layers.7.pre_feedforward_layernorm', 1), ('model.layers.7.post_feedforward_layernorm', 1), ('model.layers.7.post_attention_layernorm', 1), ('model.layers.8', 1), ('model.layers.9', 1), ('model.layers.10', 1), ('model.layers.11', 1), ('model.layers.12', 1), ('model.layers.13', 1), ('model.layers.14', 1), ('model.layers.15', 1), ('model.layers.16', 1), ('model.layers.17.self_attn', 1), ('model.layers.17.mlp.gate_proj', 1), ('model.layers.17.mlp.up_proj', 1), ('model.layers.17.mlp.down_proj', 2), ('model.layers.17.mlp.act_fn', 2), ('model.layers.17.input_layernorm', 2), ('model.layers.17.pre_feedforward_layernorm', 2), ('model.layers.17.post_feedforward_layernorm', 2), ('model.layers.17.post_attention_layernorm', 2), ('model.layers.18', 2), ('model.layers.19', 2), ('model.layers.20', 2), ('model.layers.21', 2), ('model.layers.22', 2), ('model.layers.23', 2), ('model.layers.24', 2), ('model.layers.25', 2), ('model.layers.26', 2), ('model.layers.27.self_attn', 2), ('model.layers.27.mlp.gate_proj', 2), ('model.layers.27.mlp.up_proj', 2), ('model.layers.27.mlp.down_proj', 3), ('model.layers.27.mlp.act_fn', 3), ('model.layers.27.input_layernorm', 3), ('model.layers.27.pre_feedforward_layernorm', 3), ('model.layers.27.post_feedforward_layernorm', 3), ('model.layers.27.post_attention_layernorm', 3), ('model.layers.28', 3), ('model.layers.29', 3), ('model.layers.30', 3), ('model.layers.31', 3), ('model.layers.32', 3), ('model.layers.33', 3), ('model.layers.34', 3), ('model.layers.35', 3), ('model.layers.36', 3), ('model.layers.37.self_attn', 3), ('model.layers.37.mlp.gate_proj', 3), ('model.layers.37.mlp.up_proj', 3), ('model.layers.37.mlp.down_proj', 4), ('model.layers.37.mlp.act_fn', 4), ('model.layers.37.input_layernorm', 4), ('model.layers.37.pre_feedforward_layernorm', 4), ('model.layers.37.post_feedforward_layernorm', 4), ('model.layers.37.post_attention_layernorm', 4), ('model.layers.38', 4), ('model.layers.39', 4), ('model.layers.40', 4), ('model.layers.41', 4), ('model.layers.42', 4), ('model.layers.43', 4), ('model.layers.44', 4), ('model.layers.45', 4), ('model.norm', 4)])

Expected behavior

The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working.
I have experienced this very similar issue with different hardware.

I'm getting the same issue. Can anyone answer?

I think I've isolated part of the issue. When I don't allow one GPU then the model is split across GPUS:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6
I don't know if the issue is version-specific or happens for settings with >7 GPUs. Interestingly enough, 8 GPUs worked fine for Mistral-7B.

This was an issue with accelerate, find the fix here: huggingface/accelerate#3244