pytorch/ignite

DeepSpeed support for ignite.distributed

Opened this issue ยท 8 comments

๐Ÿš€ Feature

Pytorch lightning recently added native support for MS DeepSpeed.

I believe it is also helpful for users if ignite incorporates the DeepSpeed pipeline for memory-efficient distributed training.

1. for idist.auto_model ..?

To initialize the DeepSpeed engine:

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

And for distributed environment setup, we need to replace torch.distributed.init_process_group(...) to deepspeed.init_distributed()

2. checkpoint handler

slightly different thing for checkpointing

model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

@Kashu7100 Thank you for this suggestion!

I confirm that it would be very nice to support DeepSpeed with idist. Maybe a new backend could be introduced, what do you think @vfdev-5 and @fco-dv ?

Currently we have docker environment configured with MS DeepSpeed.

https://github.com/pytorch/ignite/tree/master/docker/msdp

Would you like to contribute on this ? It seems you already know how to do it ๐Ÿ˜‰

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

with idist.Parallel(backend=backend, **spawn_kwargs) as parallel:
        parallel.run(main, config)

It depends on what you want to do. The features list of msdp is quite long and there are more or less deep impacts.

For instance, I think that the pipeline parallelism would be a very nice feature to have but not trivial to adapt.

Maybe a first step could be the distributed parallelism using the simplified api as you mentioned. Thus, it may be a new backend to develop and integrate in our idist.Parallel.

You can have a look here. Btw, it's not an easy task and maybe I'm wrong about what to do. @vfdev-5 was looking further on this, maybe he could help in the discussion.

@Kashu7100 Finally, introducing a new backend does not seem to be the good option. Have a look here, and you would see that native PyTorch distributed is used when distributed environment variables are set.

That is a good news for simple use cases.

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

I would say yes.

@Kashu7100 thanks for the feature request!

Yes, we plan to improve our support of deepspeed framework which is roughly:

  • cmd line launcher + config file
  • model_engine wrapper
  • various modern optimizers
  • pipeline parallelism
  • amp using nvidia/apex
  • customized distributed (support azure) on top of torch distributed

Our idea was to provide basic integration examples of how to use ignite and deepspeed together. I looked at it multiple times and due to certain overlap between the framework it was not obvious where to put the split.

@sdesrozis I'm not sure whether we should add it as a new backend or not. Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

customized distributed (support azure) on top of torch distributed

I think this could be integrated in our native backend, beside slurm.

@sdesrozis I'm not sure whether we should add it as a new backend or not.

IMO it is not necessary.

Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

That is a good option. As discussed a few weeks ago, the specific engine should be the tricky part. Otherwise, auto helpers could do the job. I suppose.

Hi, is there any update on this?

@saifullah3396 well this feature is not really a priority right now. If you would like to help with, we can guide your development from ignite side.