AILAB-CEFET-RJ/stconvs2s

mae and rmse are nan

Closed this issue · 11 comments

When I used five frames to predict fifteen, I found that mae and rmse were both nan.

Hi, could you give me more info so I can investigate. Which dataset did you use or does it happen to both? What model? Are you training from scratch?
Thanks.

I have both data sets,I use python main.py -i 10 -v 4 -s 15 --plot > output/full-dataset/results/cfsr-stconvs2s-step15-rmse-v4.out to run code.

I used the stconvs2s model.

I trained from scratch.

OK, I'll take a look.

I was unable to reproduce this issue. Following are the results of running 2 iterations over chirps dataset and stconvs2s model.

chirps-stconvs2s-step15-rmse-v4.txt

First, check if the downloaded dataset you used has the same checksum as informed in the zenodo repository.

Since this could be a problem due to different environment settings, especially GPU (see 1 and 2), I suggest you do the following:

To get a quick training use --small-dataset and small number of iteration (-i 2). You can use the script below on GPU to see if it still generates Nan results in this simple training. After that, force device = torch.device(´cpu´) in here, to see what happens on cpu.

python main.py -i 2 -v 4 -s 15 --plot --small-dataset > output/full-dataset/results/cfsr-stconvs2s-step15-rmse-v4-small.out

My environment settings:

Pytorch 1.0
Python 3.6
CUDA Version: 10.1
GPU GeForce GTX 1080

Please, inform your environment settings.

Have you tried it on the CFSR data set?When I used five frames to predict fifteen in CFSR that appear No checkpoint found.

Results of training with CFSR dataset to predict 15 frames.
cfsr-stconvs2s-step15-rmse-v4.txt

Send your results so I can check and your environment settings. Also, try to train on the CPU as I said in the previous comment.

I can't run on the CPU. I don't have enough memory.

RUN MODEL: STConvS2s
Device: cuda:0
=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-105020.pth.tar
timestamp: 984.9105520248413
Training time: 0:16:24.91
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-110655.pth.tar
timestamp: 985.0962965488434
Training time: 0:16:25.10
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-112329.pth.tar
timestamp: 985.7570700645447
Training time: 0:16:25.76
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-114005.pth.tar
timestamp: 985.1541411876678
Training time: 0:16:25.15
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-115639.pth.tar
timestamp: 984.8205137252808
Training time: 0:16:24.82
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-121313.pth.tar
timestamp: 986.3730194568634
Training time: 0:16:26.37
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-122949.pth.tar
timestamp: 987.6874010562897
Training time: 0:16:27.69
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-124627.pth.tar
timestamp: 985.93021941185
Training time: 0:16:25.93
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-130302.pth.tar
timestamp: 985.8921804428101
Training time: 0:16:25.89
STConvS2s RMSELoss: nan

=> No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-131937.pth.tar
timestamp: 986.9276404380798
Training time: 0:16:26.93
STConvS2s RMSELoss: nan

timestamp: 985.8549034357071

Errors: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Times: [984.9105520248413, 985.0962965488434, 985.7570700645447, 985.1541411876678, 984.8205137252808, 986.3730194568634, 987.6874010562897, 985.93021941185, 985.8921804428101, 986.9276404380798]

Mean and standard deviation after 10 iterations
=> Test Error: mean: nan, std: nan
=> Training time: mean_readable: 0:16:25.85, mean: 985.8549, std: 0.884432

Based on your log, the problem of the Nan is due to this: => No checkpoint found at /root/lny/stconvs2s-master/output/full-dataset/checkpoints/STConvS2s/cfsr_rmseloss_4_20200521-105020.pth.tar, which means, during the training phase the model wasn't able to save the checkpoints in this folder.

Check the latest release and for more information in the log, you can add the --verbose parameter.