Can't start training

Question

Can't start training

Opened this issue 6 years ago · 7 comments

Hi,

When I execute

python nem.py with dataset.balls4mass64 network.r_nem nem.k=5

It prints out the following but wouldn't start training at all...

WARNING - R-NEM - No observers have been added to this run
INFO - R-NEM - Running command 'run'
INFO - R-NEM - Started
R-RNNEM/InputWrapper1/Conv/weights:0 [4, 4, 1, 16]
R-RNNEM/InputWrapper1/Conv/biases:0 [16]
R-RNNEM/LayerNormI1/LayerNorm/beta:0 [16]
R-RNNEM/LayerNormI1/LayerNorm/gamma:0 [16]
R-RNNEM/InputWrapper2/Conv/weights:0 [4, 4, 16, 32]
R-RNNEM/InputWrapper2/Conv/biases:0 [32]
R-RNNEM/LayerNormI2/LayerNorm/beta:0 [32]
R-RNNEM/LayerNormI2/LayerNorm/gamma:0 [32]
R-RNNEM/InputWrapper3/Conv/weights:0 [4, 4, 32, 64]
R-RNNEM/InputWrapper3/Conv/biases:0 [64]
R-RNNEM/LayerNormI3/LayerNorm/beta:0 [64]
R-RNNEM/LayerNormI3/LayerNorm/gamma:0 [64]
R-RNNEM/InputWrapper5/fully_connected/weights:0 [4096, 512]
R-RNNEM/InputWrapper5/fully_connected/biases:0 [512]
R-RNNEM/LayerNormI5/LayerNorm/beta:0 [512]
R-RNNEM/LayerNormI5/LayerNorm/gamma:0 [512]
R-RNNEM/NPE/fully_connected/weights:0 [250, 250]
R-RNNEM/NPE/fully_connected/biases:0 [250]
R-RNNEM/NPE/LayerNorm/beta:0 [250]
R-RNNEM/NPE/LayerNorm/gamma:0 [250]
R-RNNEM/NPE/fully_connected_1/weights:0 [500, 250]
R-RNNEM/NPE/fully_connected_1/biases:0 [250]
R-RNNEM/NPE/LayerNorm_1/beta:0 [250]
R-RNNEM/NPE/LayerNorm_1/gamma:0 [250]
R-RNNEM/NPE/fully_connected_2/weights:0 [250, 250]
R-RNNEM/NPE/fully_connected_2/biases:0 [250]
R-RNNEM/NPE/LayerNorm_2/beta:0 [250]
R-RNNEM/NPE/LayerNorm_2/gamma:0 [250]
R-RNNEM/NPE/fully_connected_3/weights:0 [250, 100]
R-RNNEM/NPE/fully_connected_3/biases:0 [100]
R-RNNEM/NPE/LayerNorm_3/beta:0 [100]
R-RNNEM/NPE/LayerNorm_3/gamma:0 [100]
R-RNNEM/NPE/fully_connected_4/weights:0 [100, 1]
R-RNNEM/NPE/fully_connected_4/biases:0 [1]
R-RNNEM/NPE/fully_connected_5/weights:0 [1012, 250]
R-RNNEM/NPE/fully_connected_5/biases:0 [250]
R-RNNEM/LayerNormR0/LayerNorm/beta:0 [250]
R-RNNEM/LayerNormR0/LayerNorm/gamma:0 [250]
R-RNNEM/OutputWrapper0/fully_connected/weights:0 [250, 512]
R-RNNEM/OutputWrapper0/fully_connected/biases:0 [512]
R-RNNEM/LayerNormO0/LayerNorm/beta:0 [512]
R-RNNEM/LayerNormO0/LayerNorm/gamma:0 [512]
R-RNNEM/OutputWrapper1/fully_connected/weights:0 [512, 4096]
R-RNNEM/OutputWrapper1/fully_connected/biases:0 [4096]
R-RNNEM/LayerNormO1/LayerNorm/beta:0 [4096]
R-RNNEM/LayerNormO1/LayerNorm/gamma:0 [4096]
R-RNNEM/OutputWrapper3/Conv/weights:0 [4, 4, 64, 32]
R-RNNEM/OutputWrapper3/Conv/biases:0 [32]
R-RNNEM/LayerNormO3/LayerNorm/beta:0 [32]
R-RNNEM/LayerNormO3/LayerNorm/gamma:0 [32]
R-RNNEM/OutputWrapper4/Conv/weights:0 [4, 4, 32, 16]
R-RNNEM/OutputWrapper4/Conv/biases:0 [16]
R-RNNEM/LayerNormO4/LayerNorm/beta:0 [16]
R-RNNEM/LayerNormO4/LayerNorm/gamma:0 [16]
R-RNNEM/OutputWrapper5/Conv/weights:0 [4, 4, 16, 1]
R-RNNEM/OutputWrapper5/Conv/biases:0 [1]
4951978 total variables

Answer 1 · 2018-08-06T20:44:42.000Z

That output looks perfectly fine to me. Logs are printed every epoch, which depending on your GPU may take a while.

Perhaps you can add a print statement after each batch to check whether this is the case? Happy to help you further if really does not train.

Answer 2 · 2018-08-06T21:15:13.000Z

Thanks for your quick response.
It seems really not training as there's no PID. I also checked by executing nvidia-smi.

btw I'm using NVIDIA Tesla P100 with 16GB of device RAM.

Answer 3 · 2018-08-07T16:59:30.000Z

Hmm that is strange indeed. Just to be sure, do you have tensorflow-gpu installed? If you install tensorflow then it will try to run on the CPU by default.

Answer 4 · 2018-08-07T18:13:24.000Z

I installed tensorflow-gpu==1.2.1, but it shows the error

ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

I think the reason is that tensorflow 1.2.1 does not support CUDA 9, however, I'm using the gpu from our university, I can't install CUDA 8 instead of 9...

Can you think any other way that I can solve this problem?

Thank you.

Answer 5 · 2018-08-08T09:15:40.000Z

This code base is several tf versions behind, so I think upgrading the code to support a later version of tensorflow would be your only alternative.

I can try to have a look at it at some point, but I don't have a lot of time available right now to do the required testing.

Answer 6 · 2018-08-08T18:01:17.000Z

Thank you so much for your help. I will let you know if I have any further questions.

Answer 7 · 2018-08-12T22:42:36.000Z

I was able to run the code on tensorflow 1.9 with a change on the NEMCell property name input_shape (I had to change it to something else due to some conflict of using the RNNCell property names). Other than that, I don't think the code needs to be upgraded.