ragulpr/wtte-rnn-examples

Invalid loss; hello-world-datapipeline

Opened this issue · 3 comments

Hello,

Thanks for sharing this awesome example in using wtte-rnn.
When I try to run it, I get an invalid loss error during the training phase. Playing around with it, it seems to be very sensitive to the initial value for alpha.

Epoch 35/100
1000/1000 [==============================] - 1s 1ms/step - loss: 1.4000 - val_loss: 1.9154
Epoch 36/100
1000/1000 [==============================] - 1s 1ms/step - loss: 1.3995 - val_loss: 1.9376
Epoch 37/100
1000/1000 [==============================] - 2s 2ms/step - loss: 1.3992 - val_loss: 1.9075
Epoch 38/100
800/1000 [=======================>......] - ETA: 0s - loss: 1.3988Batch 8: Invalid loss, terminating training
900/1000 [==========================>...] - ETA: 0s - loss: nan

Do you happen to know why this may be?

Many thanks,
Andris

Hi there,
That's a correct observation. Initialization is the absolutely most important reason for exploding gradient. If it's far away initially it'll take a huge gradient-step into one direction leading to overshooting the target and/or numerical instability due to large magnitudes.

99.9% of the cases of NaN at later stages of training is errors in data or chosen architecture:

  1. Ground truth is leaked/overfitted so perfect (infinity or zero)-prediction possible
  2. Censoring is predictable (leading to infinity-prediction)
  3. Wrong magnitude of input data
  4. Unbounded activation functions of pre-output layer (like relu or similar) leading to instability

Recommended reading:

Hello,

This algorithm is awesome, thank you for sharing those examples!

I also had this problem, but using data-pipeline-template.ipynb. The model actually explodes in the first epoch:

Train on 1141 samples, validate on 171 samples
Epoch 1/200
600/1141 [==============>...............] - ETA: 2s - loss: nan Batch 1: Invalid loss, terminating training

I am just running the jupyter notebook exactly as it is. The only difference is in the tensorflow.csv. I am using tensorflow.csv that I acquired using the provided code (which might have a couple months more of data). I tried filtering new data, to have an approximate dataframe of the original execution, but it still failed...

python 3.6.5
pandas.version 0.21.0
numpy.version 1.12.1
keras.version 2.1.6
theano.version 1.0.2
keras episolon: 1e-08

Any ideas why is that happening, since I am following the aprox. same thing as the example? I.e. I am not sure if the discussed topics on 'Recommended reading' would apply here... please correct me if I am wrong.

Thank you very much!

Gabriel


EDIT:

Just saw that this problem was already addressed on the develop branch. It is working now! Thank you!

@gabrielgonzaga yes in those months I think some high-frequency committer churned or something but yes it suddenly exploded lol. It'll be addressed in ragulpr/wtte-rnn#41 until then just find the right initial alpha.