already quantize to 4bit and get the model pyllama-7B4b.pt，but can not run in RTX3080. report torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 10.00 GiB total capacity; 9.24 GiB already allocated;

Question

already quantize to 4bit and get the model pyllama-7B4b.pt，but can not run in RTX3080. report torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 10.00 GiB total capacity; 9.24 GiB already allocated;

elven2016 opened this issue 2 years ago · 2 comments

elven2016 commented 2 years ago

the error is as the follow：
python webapp_single.py --ckpt_dir $CKPT_DIR --tokenizer_path $TOKENIZER_PATH
Traceback (most recent call last):
File "/home/xxxx/chatllama/pyllama/apps/gradio/webapp_single.py", line 80, in
generator = load(
File "/home/u/chatllama/pyllama/apps/gradio/webapp_single.py", line 42, in load
model = Transformer(model_args)
File "/home/xxxx/miniconda3/envs/chatllama/lib/python3.10/site-packages/llama/model_single.py", line 199, in init
self.layers.append(TransformerBlock(layer_id, params))
File "/home/xxxx/miniconda3/envs/chatllama/lib/python3.10/site-packages/llama/model_single.py", line 167, in init
self.feed_forward = FeedForward(
File "/home/xxxx/miniconda3/envs/chatllama/lib/python3.10/site-packages/llama/model_single.py", line 154, in init
self.w3 = nn.Linear(dim, hidden_dim, bias=False)
File "/home/xxxx/miniconda3/envs/chatllama/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 10.00 GiB total capacity; 9.24 GiB already allocated; 0 bytes free; 9.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

du -sh pyllama-7B4b.pt
3.6G pyllama-7B4b.pt

Answer 1 · 2023-03-30T04:38:12.000Z

I think you need to use HF version instead of Meta Version for the quantized models.

Answer 2 · 2023-04-18T21:50:06.000Z

Did you find a way to run it on the 3080 ?