Minimum GPU device requirement for inference (with OOM issue)

Question

Minimum GPU device requirement for inference (with OOM issue)

nigue3025 opened this issue a year ago · 2 comments

Hi,
I am a very new bee of the player
I got just single RTX2080ti(with 11GB ram only) to run the model with text-generation-inference
After executing the .sh file, the GPU ram consumption gradually increase.
And the message "waiting for shard to be ready ... rank=0" constantly appears.
Finally, it ends up with the message "torch.cuda.OutOfMemoryError: CUDA out of memory..."
I attempted to set PYTORCH_CUDA_ALLOC_CONF to different value, but it still not work.
Is that mean I have to update my card with larger ram (e.g. rtx4090 with 24GB) if I insist to run this 13B model (other than 4bits-GQTQ model) with GPU?
It would be appreciated to have any advice.

Answer 1 · 2023-09-01T01:42:47.000Z

I recommend 4-bit quantization when using low-memory GPUs, e.g.

docker run --gpus 'device=0' -p 8085:80 \
    -v ./Models:/Models \
    ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
    --model-id /Models/TaiwanLlama-13B \
    --quantize "bitsandbytes-nf4" \
    --max-input-length 1500 \
    --max-total-tokens 2000 \
    --max-batch-prefill-tokens 1500 \
    --max-batch-total-tokens 2000 \
    --max-best-of 1 \
    --max-concurrent-requests 128

Since I don't possess an 11GB RTX GPU, I simulate this situation using the --cuda-memory-fraction 0.45 parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.

Answer 2 · 2023-09-01T04:05:03.000Z

I recommend 4-bit quantization when using low-memory GPUs, e.g.
docker run --gpus 'device=0' -p 8085:80 \
    -v ./Models:/Models \
    ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
    --model-id /Models/TaiwanLlama-13B \
    --quantize "bitsandbytes-nf4" \
    --max-input-length 1500 \
    --max-total-tokens 2000 \
    --max-batch-prefill-tokens 1500 \
    --max-batch-total-tokens 2000 \
    --max-best-of 1 \
    --max-concurrent-requests 128
Since I don't possess an 11GB RTX GPU, I simulate this situation using the --cuda-memory-fraction 0.45 parameter. In this scenario, it consumes about 10,000 MiB of my GPU memory. I believe the paged attention mechanism of TGI will consume all of the remaining GPU memory.

Great! it works! BTW. I slightly modified the parameters as follows to ensure RTX2080ti work
--max-input-length 1000
--max-total-tokens 1500
--max-batch-prefill-tokens 1000
--max-batch-total-tokens 1500 \

Once again, thanks for your kind help!!