microsoft/DeepSpeedExamples

CPU OOM when inferencing Llama3-70B-Chinese-Chat

GORGEOUSLCX opened this issue · 0 comments

Code: text-generation demo
Command:
deepspeed --num_gpus 2 inference-test.py --dtype float16 --batch_size 4 --max_new_tokens 200 --model ../Llama3-70B-Chinese-Chat
Hardware: two A100 80GB GPUs, CPU 250GB
Problem: When using Deepspeed to load the float16 model, it consumes too much CPU memory, and 250GB of memory cannot load the 70B model. When I use the built-in model of Transformers for inference, Model=AutoModelForCausalLM. from_pretrained (model_id, torch dtype=torch. float16, device_map="auto"), can perform inference without occupying CPU memory.
How to reduce CPU memory usage?