Distributed training

Question

Distributed training

Opened this issue 2 years ago · 11 comments

Is it possible to do Distributed training on multiple GPUs and machines using SciANN?
Like can something like horovod, tf distributed etc be used readily?

Answer 1 · 2023-03-26T15:51:00.000Z

It should be possible since backend is all Keras but i have never worked on it.

…

On Mar 26, 2023, at 5:42 AM, Pradyumna Singh Rathore ***@***.***> wrote: Is it possible to do Distributed training on multiple GPUs and machines using SciANN? Like can something like horovod, tf distributed etc be used readily? — Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZYL7UACYUM5GJ3MVKTW6A2UVANCNFSM6AAAAAAWIFDNB4>. You are receiving this because you are subscribed to this thread.

Answer 2 · 2023-04-24T05:45:03.000Z

Hi @ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN's custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.

I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

Answer 3 · 2023-04-24T05:46:48.000Z

Wow this is awesome news! Thanks for checking and your update.

…

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore ***@***.***> wrote: Hi @ehsanhaghighat <https://github.com/ehsanhaghighat>, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs. I tried: Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it. — Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85#issuecomment-1519416418>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4>. You are receiving this because you were mentioned.

Answer 4 · 2023-04-24T05:48:18.000Z

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

…

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat ***@***.***> wrote: Wow this is awesome news! Thanks for checking and your update. > On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore ***@***.***> wrote: > > > Hi @ehsanhaghighat <https://github.com/ehsanhaghighat>, > I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs. > > I tried: > > Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. > Horovod, which works completely fine and seamlessly, which is very good. > I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it. > > — > Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85#issuecomment-1519416418>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4>. > You are receiving this because you were mentioned. >

Answer 5 · 2023-04-24T05:48:26.000Z

@ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved.

Answer 6 · 2023-04-24T05:51:23.000Z

There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear?

…

On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore ***@***.***> wrote: @ehsanhaghighat <https://github.com/ehsanhaghighat>, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved. — Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85#issuecomment-1519418277>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4>. You are receiving this because you were mentioned.

Answer 7 · 2023-04-24T05:53:01.000Z

check this example: https://github.com/sciann/sciann-applications/blob/master/SciANN-Elasticity/Elasticity-Forward.ipynb BC_left_2, BC_right_2, BC_top_2 are all Neumann type.

…

On Apr 23, 2023, at 10:51 PM, Ehsan Haghighat ***@***.***> wrote: There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear? > On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore ***@***.***> wrote: > > > @ehsanhaghighat <https://github.com/ehsanhaghighat>, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved. > > — > Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85#issuecomment-1519418277>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4>. > You are receiving this because you were mentioned. >

Answer 8 · 2023-04-24T05:54:54.000Z

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:

Wow this is awesome news!
Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:

Hi @ehsanhaghighat https://github.com/ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.

I tried:

Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.
I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.

—
Reply to this email directly, view it on GitHub https://github.com/sciann/sciann/issues/85#issuecomment-1519416418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4.
You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.

Answer 9 · 2023-04-24T06:08:32.000Z

check this example: https://github.com/sciann/sciann-applications/blob/master/SciANN-Elasticity/Elasticity-Forward.ipynb BC_left_2, BC_right_2, BC_top_2 are all Neumann type.
…
On Apr 23, 2023, at 10:51 PM, Ehsan Haghighat @.> wrote: There is really no difference between how you implement Neumann or Dirchlet BCs in strong form PINNs. In our examples, we usually have both types. Note that in strong form, you need to add all BCs (even natural ones that are naturally satisfied in weak form). Is that clear? > On Apr 23, 2023, at 10:48 PM, Pradyumna Singh Rathore @.> wrote: > > > @ehsanhaghighat https://github.com/ehsanhaghighat, 1 more help outside the scope of this issue. Is is possible to have Neumann BCs in SciANN? If yes could you please share how it can be achieved. > > — > Reply to this email directly, view it on GitHub <#85 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZY4574MU4QSRBYFM3TXCYH3LANCNFSM6AAAAAAWIFDNB4. > You are receiving this because you were mentioned. >

Thanks a lot @ehsanhaghighat for sharing this.

Answer 10 · 2023-04-24T06:42:59.000Z

sciann-applications is where I usually upload all examples.

…

On Apr 23, 2023, at 10:55 PM, Pradyumna Singh Rathore ***@***.***> wrote: Are you interested to share a simple example with details on how to use Horovod in sciann-repo? On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote: Wow this is awesome news! Thanks for checking and your update. On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote: Hi @ehsanhaghighat <https://github.com/ehsanhaghighat> https://github.com/ehsanhaghighat, I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs. I tried: Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines. Horovod, which works completely fine and seamlessly, which is very good. I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it. — Reply to this email directly, view it on GitHub #85 (comment) <https://github.com/sciann/sciann/issues/85#issuecomment-1519416418>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4. You are receiving this because you were mentioned. Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo? I think in our main repo's read me we can add a section for distributed training support linking it to sciann-applications repo's relevant folder. Let me know your thoughts. — Reply to this email directly, view it on GitHub <https://github.com/sciann/sciann/issues/85#issuecomment-1519421940>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMBIZ446JWP3LWNDZ64DVLXCYITTANCNFSM6AAAAAAWIFDNB4>. You are receiving this because you were mentioned.

Answer 11 · 2024-03-11T02:23:18.000Z

Dear Pradhyumna,

Recently I am working with a large dataset and complex model which required multiple GPUs to perform distributed training. Would you please share an example or a simple demonstration on how to perform SciANN model training with horovod?
I failed to do so and could not find any examples in both SciANN and SciANN application repo.

Thanks for sharing in advance!

Are you interested to share a simple example with details on how to use Horovod in sciann-repo?

On Apr 23, 2023, at 10:46 PM, Ehsan Haghighat @.***> wrote:
Wow this is awesome news!
Thanks for checking and your update.

On Apr 23, 2023, at 10:45 PM, Pradyumna Singh Rathore @.***> wrote:
Hi @ehsanhaghighat https://github.com/ehsanhaghighat,
I tried to run SciANN in distributed settings on a g4dn.12xlarge with 4 Nvidia T4 GPUs.
I tried:
Default tenserflow distributed which fails due to some reason, probably due to SciANN custom training routines.
Horovod, which works completely fine and seamlessly, which is very good.
I can confirm, that Horovod works very well with SciANN, so people looking for distributed training on SciANN should be able to use it.
—
Reply to this email directly, view it on GitHub #85 (comment), or unsubscribe https://github.com/notifications/unsubscribe-auth/AHMBIZYJ4EHSZZVM7FEXPX3XCYHOTANCNFSM6AAAAAAWIFDNB4.
You are receiving this because you were mentioned.

Yeah, sure, that will be great. Where you would like me to raise the pull request - in the Sciann repo or the sciann-applications repo?

I think in our main repo's readme we can add a section for distributed training support linking it to sciann-applications repo's relevant folder.

Let me know your thoughts.