about serving
xv44586 opened this issue · 4 comments
I want to serving a cpm-plus-10b model,but all failed.
when I use bminf wrapper,it's too slow,and always oom;since I have 4 gpus, then I try deepspeed to wrapper model,but failed too.
is there any serving example code ?
Hi,
What GPUs do you have?
BMInf is designed for low-resource scenarios, especially for people who only have one GPU with limited memory.
For the OOM issue, you should set the memory-limit
(here) to be smaller than the actual GPU memory, as there are some intermediate results needed to be stored.
For the speed issue, one option is to set quantization=True
to enable model quantization, however this may result in some performance loss.
I have 4 T4(16G) gpus, I tried set memory-limit=4 and it's too slow and when input text length longer than 200, sometimes oom.
can I use deepspeed or other distributed way infer this model and how to do ?
I tried to reproduce the OOM issue by setting memory-limit=4
with input length over 500, and the peak GPU memory will not exceed 10G. Is your batch size greater than 1?
Moreover, we have developed a super fast and stable inference system and will adapt it to CPM-Bee.
For now, if you want to try other distributed inference methods, you can just treat CPM-Ant+ as a normal PyTorch model and follow their instructions. Further feedback on adaptation solutions is welcome!
thanks , I will try batch size always equal to 1