bytedance/SALMONN

About distributed training

Closed this issue · 3 comments

How to run your code in a distributed training? I try to set "use_distributed: True" in your configuration file, but I found it is not work. I found it only support one gpu mode.

The current version supports both single node multi GPUs mode and multi nodes multi GPUs mode, so just run the train.py script with torchrun. If you have encountered any problem, feel free to talk about it here!!

@Yu-Doit

how about inference on multi-gpu?

@Yu-Doit

how about inference on multi-gpu?

It's the same as training. The only difference is that you should set run.evaluate: True in your config, which will skip training and run inference directly.