thu-coai/DA-Transformer

Divide by zero error

Closed this issue · 2 comments

Hello,great work! Since there is no nvcc on the server shared by our laboratory, I choose to use torch to calculate the dag loss. When running, I find that logging_outputs and ntokens are 0, and the error is as follows:
1
There is a divide by zero error, which I suspect is caused by version mismatch,my experimental environment is as follows:
pytorch and cuda version:1.10.1+cu102 Python 3.7.11 gcc version 7.5.0 fairseq-1.0.0a0+2d06841
For the above problems, can I ask you for solutions? Thank you very much! @hzhwcmhf

@GengRuotong I do not think it's a problem caused by nvcc.
What is your dataset, max-tokens, and glat-p? Or you can print your mini-batch to see whether it contains any sample.

@GengRuotong I do not think it's a problem caused by nvcc. What is your dataset, max-tokens, and glat-p? Or you can print your mini-batch to see whether it contains any sample.

Thank you for your reply! After checking, I found that the problem lies in the cuda out of memory, which can be solved by reducing -- max tokens