Tsingularity/FRN

About the training process

Closed this issue · 3 comments

Excellent work!

The results of multiple trainings are inconsistent, and sometimes the training process will not converge or "'NaN or Inf found in input tensor'" will appear. Is it related to the fixed random number seed?

Hi, thanks for ur interest in our work!

For the NaN issue, I suspect it's due to pytorch version change. Could you please install our provided conda environment and re-run the training? Please let me know if the error still exists.

I don't think this is related to random seed. According to my previous experiments, our methods are quite robust to different random seeds.

Thanks for your reply, I am using the same environment configuration (pytorch1.7.0), graphics card is 3090, CUDA11.3. Because the random number seed is not completely fixed, there are some inconsistencies in the results of multiple experiments. During the experiment, there is a high probability that "NaN or Inf found in input tensor." will appear in the log. It takes a few more code runs to get a normal situation. Can you give me some advice, thank you very much!

Just wondering when does this error happen? during pre-training or finetuning?

Could you please also provide the script you use such that I can re-produce the error on my end?

Thanks!