HanGuo97/flute

Only CUDA devices are supported, but got: {device} ({device.type})

Closed this issue · 13 comments

Hi, after I used “pip install flute-kernel” ”CUDA_VISIBLE_DEVICES=0 python -m flute.integrations.base --pretrained_model_name_or_path /extra_data/llama/Meta-Llama-3-8B-Instruct --save_directory /extra_data/llama/Meta-Llama-3-8B-Instruct-Flute --num_bits 4 --group_size 128”

there are warning as follows and stopped, please tell me how to solve this: I use A6000-48G

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s]
/extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/integrations/base.py:46: UserWarning: Quantization always happen on 1st GPU
warnings.warn(f"Quantization always happen on 1st GPU")
/extra_data/miniconda3/envs/Ominiquant/lib/python3.10/site-packages/flute/utils.py:51: UserWarning: Only CUDA devices are supported, but got: cpu (cpu)
warnings.warn(f"Only CUDA devices are supported, but got: {device} ({device.type})")

Hi, thanks for trying it!

This is a benign warning, and feel free to ignore it. It was initially a warning for a different reason. It stalls probably just because it needs some time to process it.

but it quit the processing......

Do you have any error message? Without knowing anything, my a priori guess is that you got "CPU" OOM. (But you are just quantizing the 8B model, so that's a bit odd.)

in this case,How could I put it on GPU? I thought it was runining on GPU. No messages printed, just quit after warinings

Then maybe it actually finished successfully? (Try listing the directory you specified.)

Let me explain a bit what's going on behind the scene.

In order to prepare the model for FLUTE, we need to quantize and apply some FLUTE-specific packing. To avoid GPU OOM, we put the model on CPU first. Then, layer-by-layer, we put the layer to GPU and quantize it. You could put the model to GPU, but we made that choice so it works well with 70B model.

yes,it finished,but there is no tokenizer in it,so I could not use the ppl script to get the ppl result

Ah yeah, good catch! We will fix that once we have the Learned Normal Float Quantization code pushed into the codebase.

In the meantime, a simple workaround is to pass --tokenizer /extra_data/llama/Meta-Llama-3-8B-Instruct .

thanks, but If I copy the tokenizer.jason /tokenizer_config.jason from Meta-Llama-3-8B-Instruct, there are still errors:
Traceback (most recent call last):
File "/extra_data/datasets/evaluate/ppl_eval.py", line 72, in
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="cuda")#float16
File "/extra_data/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
) = cls._load_pretrained_model(
File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/extra_data/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/extra_data/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 14336]) in "weight" (which has shape torch.Size([4096, 14336])), this look incorrect.

Unfortunately, we don’t support “loading” quantized model in HF. (Mostly because we target inference in another platform like vLLM.) That being said, there’s a simple workaround we used internally for prototyping —- you can quantize model using the Python API and use it in the same Python session. The tricky part is to make sure the model loads in GPU not CPU by default.

@radi-cho should have the code snippet. I’m away from laptop right now (it’s midnight in my timezone.) But I can send you the code in the morning.

So how can I get the ppl for wikitext2?thanks

If you use the Python API directly, I believe it should work. For example,

    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path,
        device_map="cpu",  # <-- replace this with cuda/auto
        torch_dtype=torch_dtype)

    if isinstance(model, (LlamaForCausalLM, Gemma2ForCausalLM)):
        prepare_model_flute(
            module=model.model.layers,
            num_bits=num_bits,
            group_size=group_size,
            fake=fake)
    else:
        raise NotImplementedError

Closing the issue as I assume this is fixed. Feel free to reopen if you still need help!

@LiMa-cas For perplexity calculation, you can follow this huggingface example. It should work the same way for a quantized or dense model.