Can the BNB quantization process be on GPU?

Question

Can the BNB quantization process be on GPU?

mxjmtxrm opened this issue a month ago · 2 comments

mxjmtxrm commented a month ago

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0a0+81ea7a4 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@SunMarc and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I noticed that when quantization config is not None and is_deepspeed_zero3_enabled() is True, the device map is 'cpu'. Thus the quantization process is on CPU.
Why this? If the quantization can be run on the GPUs?

Expected behavior

--

Answer 1 · 2024-05-13T09:46:12.000Z

Hi @mxjmtxrm
Thanks for the issue ! Do you have a small reproducer of the issue for us to better picture what is going on?

Answer 2 · 2024-05-13T09:56:43.000Z

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_storage=torch.float16,
        )
model = AutoModelForCausalLM.from_pretrained(
            ‘meta-llama/Llama-2-7b-chat-hf’,
            torch_dtype=orch.float16,
            trust_remote_code=True,
            quantization_config=bnb_config,
            attn_implementation="flash_attention_2",
        )

The command is:

accelerate launch --config_file "configs/deepspeed_config_z3.yaml" test.py

And the deepspeed_config_z3.yaml is

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The GPU memory usage during from_pretrained is very slow, as the quantization process is going on CPU.
Same with other quantization method, like EETQ and AWQ.