[question] error message while finetuning

Question

[question] error message while finetuning

nevermet opened this issue 8 months ago · 2 comments

Dear all,
I ran finetuning and while validating, I encountered this error message:
iter 3198: loss nan, time: 123.08ms
Validating ...
.......
lit-llama/generate.py", line 74, in generate
idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype)
RuntimeError: probability tensor contains either inf, nan or element < 0

Could you tell me how I can solve this problem?

Thanks in advance.

Answer 1 · 2023-10-21T13:19:53.000Z

It may or may not be related, but are you using --precision 16-true? I noticed that for training some models it results in NaNs during training. If your GPU supports it, can you try brain float precision, i.e. --precision bf16-true?

Answer 2 · 2023-10-23T00:38:21.000Z

No I did not use --precision 16-true.