Lightning-AI/lit-llama

[question] error message while finetuning

nevermet opened this issue · 2 comments

Dear all,
I ran finetuning and while validating, I encountered this error message:
iter 3198: loss nan, time: 123.08ms
Validating ...
.......
lit-llama/generate.py", line 74, in generate
idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype)
RuntimeError: probability tensor contains either inf, nan or element < 0

Could you tell me how I can solve this problem?

Thanks in advance.

It may or may not be related, but are you using --precision 16-true? I noticed that for training some models it results in NaNs during training. If your GPU supports it, can you try brain float precision, i.e. --precision bf16-true?

No I did not use --precision 16-true.