About distributed training

Question

About distributed training

Closed this issue a month ago · 3 comments

ucasyouzhao1987 commented 2 months ago

How to run your code in a distributed training? I try to set "use_distributed: True" in your configuration file, but I found it is not work. I found it only support one gpu mode.

Answer 1 · 2024-05-24T11:38:40.000Z

The current version supports both single node multi GPUs mode and multi nodes multi GPUs mode, so just run the train.py script with torchrun. If you have encountered any problem, feel free to talk about it here!!

Answer 2 · 2024-05-30T19:10:25.000Z

@Yu-Doit

how about inference on multi-gpu?

Answer 3 · 2024-05-31T07:27:16.000Z

@Yu-Doit

how about inference on multi-gpu?

It's the same as training. The only difference is that you should set run.evaluate: True in your config, which will skip training and run inference directly.