ehsanhaghighat/sciann

Distributed training

Opened this issue · 11 comments

Is it possible to do Distributed training on multiple GPUs and machines using SciANN?
Like can something like horovod, tf distributed etc be used readily?

Hi @ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

  1. Default tenserflow distributed which fails due to some reason, probably due to SciANN's custom training routines.
  2. Horovod, which works completely fine and seamlessly, which is very good.

I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

@ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved.

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:

Wow this is awesome news!
Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.
I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.


Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4.
You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.

check this example: https://github.com/sciann/sciann-applications/blob/master/SciANN-Elasticity/Elasticity-Forward.ipynb BC_left_2, BC_right_2, BC_top_2 are all Neumann type.

On Apr 23, 2023, at 10:51 PM, Ehsan Haghighat @.> wrote: There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear? > On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore @.> wrote: > > > @ehsanhaghighat https://github.com/ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved. > > — > Reply to this email directly, view it on GitHub <#85 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4. > You are receiving this because you were mentioned. >

Thanks a lot @ehsanhaghighat for sharing this.

Dear Pradhyumna,

Recently I am working with a large dataset and complex model which required multiple GPUs to perform distributed training. Would you please share an example or a simple demonstration on how to perform SciANN model training with horovod?
I failed to do so and could not find any examples in both SciANN and SciANN application repo.

Thanks for sharing in advance!

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:
Wow this is awesome news!
Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:
Hi @ehsanhaghighat https://github.com/ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.
I tried:
Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.
I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

Reply to this email directly, view it on GitHub #85 (comment), or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4.
You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.