juncongmoo/pyllama

Can't Load Quantized Model with GPTQ-for-LLaMa

chigkim opened this issue · 2 comments

I was able to convert LLaMA weights, quantize, and inference using qwopqwop200/GPTQ-for-LLaMa.
However, I can't load it using Pyllama.
Thanks!

  1. Clone: git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
  2. Install: pip install -r requirements.txt
  3. setup_cuda: python setup_cuda.py install
  4. Convert the LLaMA weights: python convert_llama_weights_to_hf.py --input_dir ./7B --model_size 7B --output_dir ./llama-hf
  5. Quantize: CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save ./llama-hf/llama7b-4bit-128g.pt
  6. Inference: CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load ./llama-hf/llama7b-4bit-128g.pt --text "this is llama"

However, when I tried to load the model using pyllama like this, but I get an error.

python inference.py --ckpt_dir ./llama-hf/llama-7b --tokenizer_path ./llama-hf/llama-7b/tokenizer.model
Traceback (most recent call last):
  File "/content/text-generation-webui/repositories/pyllama/inference.py", line 82, in <module>
    run(
  File "/content/text-generation-webui/repositories/pyllama/inference.py", line 50, in run
    generator = load(
  File "/content/text-generation-webui/repositories/pyllama/inference.py", line 17, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 1

Also I tried to run inference using the quantized model like this, but I get the following error.
python quant_infer.py --wbits 4 --load ./llama-hf/llama7b-4bit-128g.pt --text "The meaning of life is" --max_length 24 --cuda cuda:0
I attached the log because it's too long.
log (1).txt

What OS are you using?

Colab Linux