sjoerdvansteenkiste/Relational-NEM

Can't start training

Opened this issue · 7 comments

Hi,

When I execute

python nem.py with dataset.balls4mass64 network.r_nem nem.k=5

It prints out the following but wouldn't start training at all...

WARNING - R-NEM - No observers have been added to this run
INFO - R-NEM - Running command 'run'
INFO - R-NEM - Started
R-RNNEM/InputWrapper1/Conv/weights:0 [4, 4, 1, 16]
R-RNNEM/InputWrapper1/Conv/biases:0 [16]
R-RNNEM/LayerNormI1/LayerNorm/beta:0 [16]
R-RNNEM/LayerNormI1/LayerNorm/gamma:0 [16]
R-RNNEM/InputWrapper2/Conv/weights:0 [4, 4, 16, 32]
R-RNNEM/InputWrapper2/Conv/biases:0 [32]
R-RNNEM/LayerNormI2/LayerNorm/beta:0 [32]
R-RNNEM/LayerNormI2/LayerNorm/gamma:0 [32]
R-RNNEM/InputWrapper3/Conv/weights:0 [4, 4, 32, 64]
R-RNNEM/InputWrapper3/Conv/biases:0 [64]
R-RNNEM/LayerNormI3/LayerNorm/beta:0 [64]
R-RNNEM/LayerNormI3/LayerNorm/gamma:0 [64]
R-RNNEM/InputWrapper5/fully_connected/weights:0 [4096, 512]
R-RNNEM/InputWrapper5/fully_connected/biases:0 [512]
R-RNNEM/LayerNormI5/LayerNorm/beta:0 [512]
R-RNNEM/LayerNormI5/LayerNorm/gamma:0 [512]
R-RNNEM/NPE/fully_connected/weights:0 [250, 250]
R-RNNEM/NPE/fully_connected/biases:0 [250]
R-RNNEM/NPE/LayerNorm/beta:0 [250]
R-RNNEM/NPE/LayerNorm/gamma:0 [250]
R-RNNEM/NPE/fully_connected_1/weights:0 [500, 250]
R-RNNEM/NPE/fully_connected_1/biases:0 [250]
R-RNNEM/NPE/LayerNorm_1/beta:0 [250]
R-RNNEM/NPE/LayerNorm_1/gamma:0 [250]
R-RNNEM/NPE/fully_connected_2/weights:0 [250, 250]
R-RNNEM/NPE/fully_connected_2/biases:0 [250]
R-RNNEM/NPE/LayerNorm_2/beta:0 [250]
R-RNNEM/NPE/LayerNorm_2/gamma:0 [250]
R-RNNEM/NPE/fully_connected_3/weights:0 [250, 100]
R-RNNEM/NPE/fully_connected_3/biases:0 [100]
R-RNNEM/NPE/LayerNorm_3/beta:0 [100]
R-RNNEM/NPE/LayerNorm_3/gamma:0 [100]
R-RNNEM/NPE/fully_connected_4/weights:0 [100, 1]
R-RNNEM/NPE/fully_connected_4/biases:0 [1]
R-RNNEM/NPE/fully_connected_5/weights:0 [1012, 250]
R-RNNEM/NPE/fully_connected_5/biases:0 [250]
R-RNNEM/LayerNormR0/LayerNorm/beta:0 [250]
R-RNNEM/LayerNormR0/LayerNorm/gamma:0 [250]
R-RNNEM/OutputWrapper0/fully_connected/weights:0 [250, 512]
R-RNNEM/OutputWrapper0/fully_connected/biases:0 [512]
R-RNNEM/LayerNormO0/LayerNorm/beta:0 [512]
R-RNNEM/LayerNormO0/LayerNorm/gamma:0 [512]
R-RNNEM/OutputWrapper1/fully_connected/weights:0 [512, 4096]
R-RNNEM/OutputWrapper1/fully_connected/biases:0 [4096]
R-RNNEM/LayerNormO1/LayerNorm/beta:0 [4096]
R-RNNEM/LayerNormO1/LayerNorm/gamma:0 [4096]
R-RNNEM/OutputWrapper3/Conv/weights:0 [4, 4, 64, 32]
R-RNNEM/OutputWrapper3/Conv/biases:0 [32]
R-RNNEM/LayerNormO3/LayerNorm/beta:0 [32]
R-RNNEM/LayerNormO3/LayerNorm/gamma:0 [32]
R-RNNEM/OutputWrapper4/Conv/weights:0 [4, 4, 32, 16]
R-RNNEM/OutputWrapper4/Conv/biases:0 [16]
R-RNNEM/LayerNormO4/LayerNorm/beta:0 [16]
R-RNNEM/LayerNormO4/LayerNorm/gamma:0 [16]
R-RNNEM/OutputWrapper5/Conv/weights:0 [4, 4, 16, 1]
R-RNNEM/OutputWrapper5/Conv/biases:0 [1]
4951978 total variables

That output looks perfectly fine to me. Logs are printed every epoch, which depending on your GPU may take a while.

Perhaps you can add a print statement after each batch to check whether this is the case? Happy to help you further if really does not train.

Thanks for your quick response.
It seems really not training as there's no PID. I also checked by executing nvidia-smi.

btw I'm using NVIDIA Tesla P100 with 16GB of device RAM.

Hmm that is strange indeed. Just to be sure, do you have tensorflow-gpu installed? If you install tensorflow then it will try to run on the CPU by default.

I installed tensorflow-gpu==1.2.1, but it shows the error

ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

I think the reason is that tensorflow 1.2.1 does not support CUDA 9, however, I'm using the gpu from our university, I can't install CUDA 8 instead of 9...

Can you think any other way that I can solve this problem?

Thank you.

This code base is several tf versions behind, so I think upgrading the code to support a later version of tensorflow would be your only alternative.

I can try to have a look at it at some point, but I don't have a lot of time available right now to do the required testing.

Thank you so much for your help. I will let you know if I have any further questions.

I was able to run the code on tensorflow 1.9 with a change on the NEMCell property name input_shape (I had to change it to something else due to some conflict of using the RNNCell property names). Other than that, I don't think the code needs to be upgraded.