FMInference/FlexLLMGen

Peak gpu memory use not scale linearly with the percentage of gpu usage of weight

Opened this issue · 0 comments

command 1:
python -m flexgen.flex_opt --model facebook/opt-30b --path _DUMMY_ --prompt-len 20 --gen-len 15 --percent 25 75 60 40 0 100 --gpu-batch-size 1 --num-gpu-batches 2 --cpu-cache-compute --debug fewer_batch
peak gpu mem: 6.0679 GB

command 2:
python -m flexgen.flex_opt --model facebook/opt-30b --path _DUMMY_ --prompt-len 20 --gen-len 15 --percent 30 70 60 40 0 100 --gpu-batch-size 1 --num-gpu-batches 2 --cpu-cache-compute --debug fewer_batch
gpu oom

The only difference of command 2 from command 1 is the percentage of gpu usage of weight to increase from 25% to 30%.

The capacity of my gpu is 24 GB.