MDIL-SNU/SevenNet

Get nan values during train

Closed this issue · 4 comments

Dear,

I try to train a simple model. and get all values in loss
The logfile as attachment.

Can you have a little guide.
Thank you so much

log.log

It seems like your data has no stress label or the label is strange (see 'Stress distribution' of log).

Have you tried with is_train_stress as False? The key is under train:

hi @YutackPark
I set it False

Then train now interrupt without any error, at log

Trainer initialized, ready to training
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
Epoch 1/10  lr: 0.001000
------------------------------------------------------------------------------------------------------------------------

Do you know why?
This is my input:
input.txt

Firstly, you should uncomment "# - ['TotalLoss', 'None']". SevenNet needs total loss to determine the best checkpoint to save.
However, SevenNet should raise an error and quit if this is the case.

I failed to reproduce the issue with the same input but a different training set. Maybe, it is just that training is very slow. Could you share your dataset if you don't mind?

hi @YutackPark

The dataset at this link

With PR#89, you can set input as

data_format: 'ase' 
data_format_args:                         
        energy_key: 'TotEnergy'                 
        force_key: 'force' 

Then you can repoduce the problem.
I confirn that, above problem occur on Windows, when I test on Linux the problem disappear, and code can run well