lopuhin/transformer-lm

Silent failure when training on GPU

Closed this issue · 5 comments

Hi,

I have a strange phenomenon when training a model using GPU, I just get a silent failure during training. This happens both on the data that I'm trying to train on and the shakespeare test.

I get something like this

Resuming from seen_tokens 1,280,600,064
device cuda:0 initializing process group
Resuming from seen_tokens 1,280,600,064
device cuda:1 initializing process group
process group for cuda:0 initialized
process group for cuda:1 initialized
epochs: 150it [00:08, 17.41it/s]                                                                                        
  2%|█▌                                                                     | 190464/8536064 [00:08<06:17, 22110.78it/s]

and then it just stops (note the iterations per second, it's way too high to be realistic, I have a few 10s of megabytes of data).

Have you seen this type of error before? Any pointers on how to find out what is going on? It works fine with CPU.

Many thanks!

@vilhub I see, thanks for the report. This is using pytorch, right? I haven't seen such an error to be honest, not sure what could cause it. Does this happen only with two GPUs, or with one as well?

Hey, thanks for your answer. Yes exactly, using pytorch. It does it with both one and two GPUs. I will investigate some more to see if I can at least find an error message or understand why it terminates. I will report back shortly.

Hi again,
So the issue was in the end that I was continuing training of a model that I load and therefore the seen_tokens variable which is loaded would already be greater than epochs * epoch_size so the training loop is done. Perhaps printing the seen_tokens variable would be useful if someone encounters the same issue?

Thanks again for the reply and the code!

Edit : This wouldn't explain the failure of the shakespeare test though, I will look into it as well.

OK the Shakespeare test works too, I don't know why it didn't at the time. I will close this now, thanks!

I am reloading German model and training on my own dataset. I have vocab size 50000
When I start training it does not complete set number of epochs it silently finishes like mentioned above.
Loading dataset from tests/german-encoded Train dataset has 806,414 tokens Validation dataset has 210,954 tokens 135,790 "sentences" found for sampling 34,354 "sentences" found for sampling Resuming from seen_tokens 11,962,060,800 epochs: 0% 0/10 [00:00<?, ?it/s]

I set max-tokens 15,962,060,800
But it still stops at 21%
My code
%%shell gpt-2 \ tests/de345-root/ \ tests/german-encoded \ tests/de345-root/sp-model.model \ --batch-size 3 \ --g-accum-gradients 1 \ --gradient_checkpointing 1\ --n-ctx 1024 \ --n-embed 1024 \ --n-hidden 1024 \ --n-head 16 \ --n-layer 24 \ --epochs 10 \ --log-every 2 \ --save-every 50 \ --validate-every 100 \ --sample-sentences \ --max_tokens 15,962,060,800\