Does Zero-Inference support TP?
preminstrel opened this issue · 11 comments
Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.
Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks!
I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU.
I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work.
Thanks! But how can I make it work? Do you have example command?
I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.
This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.
Thanks! But how can I make it work? Do you have example command?
Below are commands for single-gpu inference with kv-cache-offload.
https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference
Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question:
I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU.
So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong.
Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway)
Thanks!
Glad that kv-cache-offload performance might be good for your scenario.
Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions.
Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU.
Thank you very much! Nice work!