Minimum GPU device requirement for inference (with OOM issue)
nigue3025 opened this issue · 2 comments
Hi,
I am a very new bee of the player
I got just single RTX2080ti(with 11GB ram only) to run the model with text-generation-inference
After executing the .sh file, the GPU ram consumption gradually increase.
And the message "waiting for shard to be ready ... rank=0" constantly appears.
Finally, it ends up with the message "torch.cuda.OutOfMemoryError: CUDA out of memory..."
I attempted to set PYTORCH_CUDA_ALLOC_CONF to different value, but it still not work.
Is that mean I have to update my card with larger ram (e.g. rtx4090 with 24GB) if I insist to run this 13B model (other than 4bits-GQTQ model) with GPU?
It would be appreciated to have any advice.
I recommend 4-bit quantization when using low-memory GPUs, e.g.
docker run --gpus 'device=0' -p 8085:80 \
-v ./Models:/Models \
ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
--model-id /Models/TaiwanLlama-13B \
--quantize "bitsandbytes-nf4" \
--max-input-length 1500 \
--max-total-tokens 2000 \
--max-batch-prefill-tokens 1500 \
--max-batch-total-tokens 2000 \
--max-best-of 1 \
--max-concurrent-requests 128
Since I don't possess an 11GB RTX GPU, I simulate this situation using the --cuda-memory-fraction 0.45
parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.
I recommend 4-bit quantization when using low-memory GPUs, e.g.
docker run --gpus 'device=0' -p 8085:80 \ -v ./Models:/Models \ ghcr.io/huggingface/text-generation-inference:sha-5485c14 \ --model-id /Models/TaiwanLlama-13B \ --quantize "bitsandbytes-nf4" \ --max-input-length 1500 \ --max-total-tokens 2000 \ --max-batch-prefill-tokens 1500 \ --max-batch-total-tokens 2000 \ --max-best-of 1 \ --max-concurrent-requests 128Since I don't possess an 11GB RTX GPU, I simulate this situation using the
--cuda-memory-fraction 0.45
parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.
Great! it works! BTW. I slightly modified the parameters as follows to ensure RTX2080ti work
--max-input-length 1000
--max-total-tokens 1500
--max-batch-prefill-tokens 1000
--max-batch-total-tokens 1500 \
Once again, thanks for your kind help!!