Divide by zero error

Question

Divide by zero error

Closed this issue 2 years ago · 2 comments

Hello，great work! Since there is no nvcc on the server shared by our laboratory, I choose to use torch to calculate the dag loss. When running, I find that logging_outputs and ntokens are 0, and the error is as follows：

There is a divide by zero error, which I suspect is caused by version mismatch，my experimental environment is as follows：
pytorch and cuda version:1.10.1+cu102 Python 3.7.11 gcc version 7.5.0 fairseq-1.0.0a0+2d06841
For the above problems, can I ask you for solutions? Thank you very much! @hzhwcmhf

Answer 1 · 2022-10-11T02:15:37.000Z

@GengRuotong I do not think it's a problem caused by nvcc.
What is your dataset, max-tokens, and glat-p? Or you can print your mini-batch to see whether it contains any sample.

Answer 2 · 2022-10-11T04:29:27.000Z

@GengRuotong I do not think it's a problem caused by nvcc. What is your dataset, max-tokens, and glat-p? Or you can print your mini-batch to see whether it contains any sample.

Thank you for your reply! After checking, I found that the problem lies in the cuda out of memory, which can be solved by reducing -- max tokens