SforAiDl/KD_Lib

Distributed Training

Opened this issue · 5 comments

We need to add support for Distributed training, we can directly make use of Pytorch DDP if we want as of now. Let me know if anyone wants to take this up.

I wouldn't mind taking this up. But I'd need a little time to do it. Let me know if that works

Yea take your time we don't really need to release this immediately anyway.

Hi @Het-Shah and @avishreekh, thanks for creating this wonderful library with support for multiple KD algorithms. The code and implementation is nicely done and structured.

Wanted to know if any update on distributed training is being done? Currently if I do python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 vanilla_kd.py the library does not run. Multi-GPU training is crucial for this library to be really useful as both model sizes and data sizes are increasing and we cannot get away from using multiple-GPU's for training.

thanks again!

Thank you @srikar2097. We are glad that this library could be useful to you.
We are working on the distributed training enhancement and hope to release it by Mid-may.

Thank you for your patience.

There are certain design choices that we are debating on currently. Will add this feature once it is decided how to efficiently accommodate it in the existing framework. Thanks!