OpenBMB/CPM-Live

about serving

xv44586 opened this issue · 4 comments

I want to serving a cpm-plus-10b model,but all failed.
when I use bminf wrapper,it's too slow,and always oom;since I have 4 gpus, then I try deepspeed to wrapper model,but failed too.
is there any serving example code ?

Hi,
What GPUs do you have?
BMInf is designed for low-resource scenarios, especially for people who only have one GPU with limited memory.
For the OOM issue, you should set the memory-limit (here) to be smaller than the actual GPU memory, as there are some intermediate results needed to be stored.
For the speed issue, one option is to set quantization=True to enable model quantization, however this may result in some performance loss.

I have 4 T4(16G) gpus, I tried set memory-limit=4 and it's too slow and when input text length longer than 200, sometimes oom.
can I use deepspeed or other distributed way infer this model and how to do ?

I tried to reproduce the OOM issue by setting memory-limit=4 with input length over 500, and the peak GPU memory will not exceed 10G. Is your batch size greater than 1?
Moreover, we have developed a super fast and stable inference system and will adapt it to CPM-Bee.
For now, if you want to try other distributed inference methods, you can just treat CPM-Ant+ as a normal PyTorch model and follow their instructions. Further feedback on adaptation solutions is welcome!

thanks , I will try batch size always equal to 1