Can the BNB quantization process be on GPU?
mxjmtxrm opened this issue · 2 comments
mxjmtxrm commented
System Info
transformers
version: 4.41.0.dev0- Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.2
- Accelerate version: 0.28.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.0a0+81ea7a4 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I noticed that when quantization config is not None and is_deepspeed_zero3_enabled() is True, the device map is 'cpu'. Thus the quantization process is on CPU.
Why this? If the quantization can be run on the GPUs?
Expected behavior
--
younesbelkada commented
Hi @mxjmtxrm
Thanks for the issue ! Do you have a small reproducer of the issue for us to better picture what is going on?
mxjmtxrm commented
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_storage=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
‘meta-llama/Llama-2-7b-chat-hf’,
torch_dtype=orch.float16,
trust_remote_code=True,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
)
The command is:
accelerate launch --config_file "configs/deepspeed_config_z3.yaml" test.py
And the deepspeed_config_z3.yaml is
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
The GPU memory usage during from_pretrained is very slow, as the quantization process is going on CPU.
Same with other quantization method, like EETQ and AWQ.