MCG-NJU/MeMOTR

Error while training in distributed mode

etema19 opened this issue · 2 comments

Hello, I ran into an error when I am training the code in distributed mode. Error is as follow "torch.distributed.elastic.multiprocessing.errors.childFailedError: main.py FAILED

any idea?

Thanks!

Could you please give me some more complete error messages? I think there should be some other outputs before this error message that you have given.
And it would be better if you could provide more details about the script you are running.

Thanks~

Since I have been waiting to receive a reply for a long time, I have temporarily closed this issue.