Distributed Training

Question

Distributed Training

Opened this issue 4 years ago · 5 comments

We need to add support for Distributed training, we can directly make use of Pytorch DDP if we want as of now. Let me know if anyone wants to take this up.

Answer 1 · 2020-09-01T10:42:01.000Z

I wouldn't mind taking this up. But I'd need a little time to do it. Let me know if that works

Answer 2 · 2020-09-01T10:43:19.000Z

Yea take your time we don't really need to release this immediately anyway.

Answer 3 · 2021-04-17T21:50:52.000Z

Hi @Het-Shah and @avishreekh, thanks for creating this wonderful library with support for multiple KD algorithms. The code and implementation is nicely done and structured.

Wanted to know if any update on distributed training is being done? Currently if I do python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 vanilla_kd.py the library does not run. Multi-GPU training is crucial for this library to be really useful as both model sizes and data sizes are increasing and we cannot get away from using multiple-GPU's for training.

thanks again!

Answer 4 · 2021-04-20T09:33:23.000Z

Thank you @srikar2097. We are glad that this library could be useful to you.
We are working on the distributed training enhancement and hope to release it by Mid-may.

Thank you for your patience.

Answer 5 · 2021-05-30T14:08:01.000Z

There are certain design choices that we are debating on currently. Will add this feature once it is decided how to efficiently accommodate it in the existing framework. Thanks!