Invalid loss; hello-world-datapipeline
Opened this issue · 3 comments
Hello,
Thanks for sharing this awesome example in using wtte-rnn.
When I try to run it, I get an invalid loss error during the training phase. Playing around with it, it seems to be very sensitive to the initial value for alpha.
Epoch 35/100
1000/1000 [==============================] - 1s 1ms/step - loss: 1.4000 - val_loss: 1.9154
Epoch 36/100
1000/1000 [==============================] - 1s 1ms/step - loss: 1.3995 - val_loss: 1.9376
Epoch 37/100
1000/1000 [==============================] - 2s 2ms/step - loss: 1.3992 - val_loss: 1.9075
Epoch 38/100
800/1000 [=======================>......] - ETA: 0s - loss: 1.3988Batch 8: Invalid loss, terminating training
900/1000 [==========================>...] - ETA: 0s - loss: nan
Do you happen to know why this may be?
Many thanks,
Andris
Hi there,
That's a correct observation. Initialization is the absolutely most important reason for exploding gradient. If it's far away initially it'll take a huge gradient-step into one direction leading to overshooting the target and/or numerical instability due to large magnitudes.
99.9% of the cases of NaN at later stages of training is errors in data or chosen architecture:
- Ground truth is leaked/overfitted so perfect (infinity or zero)-prediction possible
- Censoring is predictable (leading to infinity-prediction)
- Wrong magnitude of input data
- Unbounded activation functions of pre-output layer (like relu or similar) leading to instability
Recommended reading:
Hello,
This algorithm is awesome, thank you for sharing those examples!
I also had this problem, but using data-pipeline-template.ipynb. The model actually explodes in the first epoch:
Train on 1141 samples, validate on 171 samples
Epoch 1/200
600/1141 [==============>...............] - ETA: 2s - loss: nan Batch 1: Invalid loss, terminating training
I am just running the jupyter notebook exactly as it is. The only difference is in the tensorflow.csv. I am using tensorflow.csv that I acquired using the provided code (which might have a couple months more of data). I tried filtering new data, to have an approximate dataframe of the original execution, but it still failed...
python 3.6.5
pandas.version 0.21.0
numpy.version 1.12.1
keras.version 2.1.6
theano.version 1.0.2
keras episolon: 1e-08
Any ideas why is that happening, since I am following the aprox. same thing as the example? I.e. I am not sure if the discussed topics on 'Recommended reading' would apply here... please correct me if I am wrong.
Thank you very much!
Gabriel
EDIT:
Just saw that this problem was already addressed on the develop branch. It is working now! Thank you!
@gabrielgonzaga yes in those months I think some high-frequency committer churned or something but yes it suddenly exploded lol. It'll be addressed in ragulpr/wtte-rnn#41 until then just find the right initial alpha.