Does Zero-Inference support TP?

Question

Does Zero-Inference support TP?

preminstrel opened this issue 8 months ago · 11 comments

preminstrel commented 8 months ago

Answer 1 · 2024-04-16T16:12:35.000Z

Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.

Answer 2 · 2024-04-16T16:14:30.000Z

Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks!

Answer 3 · 2024-04-16T16:34:33.000Z

I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU.

Answer 4 · 2024-04-16T16:35:07.000Z

I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work.

Answer 5 · 2024-04-16T16:36:15.000Z

Thanks! But how can I make it work? Do you have example command?

Answer 6 · 2024-04-16T16:39:12.000Z

I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.

Answer 7 · 2024-04-16T16:47:35.000Z

This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.

Answer 8 · 2024-04-16T16:48:14.000Z

Thanks! But how can I make it work? Do you have example command?

Below are commands for single-gpu inference with kv-cache-offload.
https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference

Answer 9 · 2024-04-16T17:55:52.000Z

Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question:

I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU.

So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong.

Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway)

Thanks!

Answer 10 · 2024-04-16T18:15:17.000Z

Glad that kv-cache-offload performance might be good for your scenario.

Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions.

Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU.

Answer 11 · 2024-04-16T18:19:41.000Z

Thank you very much! Nice work!