about serving

Question

about serving

xv44586 opened this issue 2 years ago · 4 comments

I want to serving a cpm-plus-10b model,but all failed.
when I use bminf wrapper,it's too slow,and always oom;since I have 4 gpus, then I try deepspeed to wrapper model,but failed too.
is there any serving example code ?

Answer 1 · 2022-11-01T06:15:10.000Z

Hi,
What GPUs do you have?
BMInf is designed for low-resource scenarios, especially for people who only have one GPU with limited memory.
For the OOM issue, you should set the memory-limit (here) to be smaller than the actual GPU memory, as there are some intermediate results needed to be stored.
For the speed issue, one option is to set quantization=True to enable model quantization, however this may result in some performance loss.

Answer 2 · 2022-11-01T07:29:44.000Z

I have 4 T4(16G) gpus, I tried set memory-limit=4 and it's too slow and when input text length longer than 200, sometimes oom.
can I use deepspeed or other distributed way infer this model and how to do ?

Answer 3 · 2022-11-01T08:05:03.000Z

I tried to reproduce the OOM issue by setting memory-limit=4 with input length over 500, and the peak GPU memory will not exceed 10G. Is your batch size greater than 1？
Moreover, we have developed a super fast and stable inference system and will adapt it to CPM-Bee.
For now, if you want to try other distributed inference methods, you can just treat CPM-Ant+ as a normal PyTorch model and follow their instructions. Further feedback on adaptation solutions is welcome!

Answer 4 · 2022-11-02T01:02:14.000Z

thanks , I will try batch size always equal to 1