The training gets stuck at `self.p_train_step` at a random step.

Question

The training gets stuck at `self.p_train_step` at a random step.

baofff opened this issue 3 years ago · 2 comments

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

LuChengTHU commented 3 years ago

Same.

Answer 1 · 2022-07-03T11:38:07.000Z

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

Do you have any error msg?
Have you checked GPU-util when such thing happens?

I also have similar issue during training. In my situation, the lock file used in custom torch ops was a problem. Due to crashed previous training, lock file of torch extension hasn't been deleted completely, introducing waiting sequence for using such operation.

You'd better check such locks are cleaned before training.