Can't Load Quantized Model with GPTQ-for-LLaMa
chigkim opened this issue · 2 comments
chigkim commented
I was able to convert LLaMA weights, quantize, and inference using qwopqwop200/GPTQ-for-LLaMa.
However, I can't load it using Pyllama.
Thanks!
- Clone:
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
- Install:
pip install -r requirements.txt
- setup_cuda:
python setup_cuda.py install
- Convert the LLaMA weights:
python convert_llama_weights_to_hf.py --input_dir ./7B --model_size 7B --output_dir ./llama-hf
- Quantize:
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save ./llama-hf/llama7b-4bit-128g.pt
- Inference:
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load ./llama-hf/llama7b-4bit-128g.pt --text "this is llama"
However, when I tried to load the model using pyllama like this, but I get an error.
python inference.py --ckpt_dir ./llama-hf/llama-7b --tokenizer_path ./llama-hf/llama-7b/tokenizer.model
Traceback (most recent call last):
File "/content/text-generation-webui/repositories/pyllama/inference.py", line 82, in <module>
run(
File "/content/text-generation-webui/repositories/pyllama/inference.py", line 50, in run
generator = load(
File "/content/text-generation-webui/repositories/pyllama/inference.py", line 17, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 1
Also I tried to run inference using the quantized model like this, but I get the following error.
python quant_infer.py --wbits 4 --load ./llama-hf/llama7b-4bit-128g.pt --text "The meaning of life is" --max_length 24 --cuda cuda:0
I attached the log because it's too long.
log (1).txt
george-adams1 commented
What OS are you using?
chigkim commented
Colab Linux