Failed at training

Question

Failed at training

cctsou opened this issue 4 years ago · 3 comments

Hi, I tried to train a model using our data but got this error, would you be able to see how to address the issue?

D:\AutoRT>python autort.py train -e 100 -b 256 -g models/base_models/model.json -u m -p 1 -i RT_train.pep.txt -sm min_max -l 30 -rlr -n 20 -o RT_model/
WARNING:tensorflow:From autort.py:4: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

Using TensorFlow backend.
Scaling method: min_max
New aa: 1 -> 20
Save aa coding data to file RT_model//aa.tsv
AA types: 21
Longest peptide in training data: 30

['1', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
RT range: 0 - 101

X_train shape:
(333047, 30, 21)
X_test shape:
(37006, 30, 21)
Modeling start ...
max_x_length: 30
Model file: 0 -> models/base_models/model_0.json
WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:4185: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:131: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default
 instead.

WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be remove
d in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Model file: 1 -> models/base_models/model_1.json
Model file: 2 -> models/base_models/model_2.json
Model file: 3 -> models/base_models/model_3.json
WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

Model file: 4 -> models/base_models/model_4.json
Model file: 5 -> models/base_models/model_5.json
Model file: 6 -> models/base_models/model_6.json
Model file: 7 -> models/base_models/model_7.json
Model file: 8 -> models/base_models/model_8.json
Model file: 9 -> models/base_models/model_9.json
Training ...
Train model: 0
Build deep learning model ...
Save aa coding data to file RT_model//aa.tsv
AA types: 21
Longest peptide in training data: 30

['1', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
RT range: 0 - 101

X_train shape:
(333047, 30, 21)
X_test shape:
(37006, 30, 21)
Modeling start ...
Use input model ...
Use optimizer provided by user: adam
WARNING:tensorflow:From C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

optimizer: <class 'keras.optimizers.Adam'>
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv1d_1 (Conv1D)            (None, 30, 256)           16384
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 30, 256)           0
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 256)           0
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 30, 256)           196864
_________________________________________________________________
activation_1 (Activation)    (None, 30, 256)           0
_________________________________________________________________
dropout_2 (Dropout)          (None, 30, 256)           0
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 30, 512)           393728
_________________________________________________________________
batch_normalization_1 (Batch (None, 30, 512)           2048
_________________________________________________________________
activation_2 (Activation)    (None, 30, 512)           0
_________________________________________________________________
dropout_3 (Dropout)          (None, 30, 512)           0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 100)           169200
_________________________________________________________________
dropout_4 (Dropout)          (None, 30, 100)           0
_________________________________________________________________
flatten_1 (Flatten)          (None, 3000)              0
_________________________________________________________________
dense_1 (Dense)              (None, 512)               1536512
_________________________________________________________________
batch_normalization_2 (Batch (None, 512)               2048
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 512)               0
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 256)               131328
_________________________________________________________________
batch_normalization_3 (Batch (None, 256)               1024
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU)    (None, 256)               0
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257
=================================================================
Total params: 2,449,393
Trainable params: 2,446,833
Non-trainable params: 2,560
_________________________________________________________________
{'scaling_method': 'min_max', 'rt_max': 101.0, 'rt_min': 0.0}
Use ReduceLROnPlateau!
Use EarlyStopping: 20
Train on 333047 samples, validate on 37006 samples
Epoch 1/100
2020-09-02 23:37:34.011976: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-09-02 23:37:34.014701: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2020-09-02 23:37:34.096352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.83
pciBusID: 0000:01:00.0
2020-09-02 23:37:34.096535: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-09-02 23:37:34.097299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-09-02 23:37:34.779256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 23:37:34.779322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2020-09-02 23:37:34.779497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2020-09-02 23:37:34.780103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6692 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci b
us id: 0000:01:00.0, compute capability: 7.5)
2020-09-02 23:37:38.321624: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.4.1.  CuDNN library major and minor version needs to match or have higher minor version
in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2020-09-02 23:37:38.323234: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.3.1 but source was compiled with: 7.4.1.  CuDNN library major and minor version needs to match or have higher minor version
in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Traceback (most recent call last):
  File "autort.py", line 133, in <module>
    main()
  File "autort.py", line 99, in main
    add_reverse=add_reverse,add_ReduceLROnPlateau=add_ReduceLROnPlateau)
  File "D:\AutoRT\autort\RTModels.py", line 341, in ensemble_models
    add_ReduceLROnPlateau=add_ReduceLROnPlateau)
  File "D:\AutoRT\autort\RTModels.py", line 238, in train_model
    callbacks=all_callbacks)
  File "C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\engine\training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\Users\cct\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv1d_1_10/convolution}}]]
         [[loss/mul/_1123]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv1d_1_10/convolution}}]]
0 successful operations.
0 derived errors `ignored.

Answer 1 · 2020-09-03T04:51:03.000Z

Hi @cctsou, AutoRT requires GPU and corresponding CUDA/cuDNN libraries installed in the computer to train models. You could follow the instruction at https://www.tensorflow.org/install/gpu to install the libraries if they are not installed on your computer. Please make sure the libraries are compatible with the following version of tensorflow.

keras==2.2.4
tensorflow-gpu==1.13.1

Answer 2 · 2020-11-24T06:24:20.000Z

Hi @cctsou , I have updated AutoRT to support TensorFlow 2.X (2.3.1). The new version of AutoRT also supports CPU.

Answer 3 · 2020-11-27T02:50:03.000Z

Please reopen this issue if you still have the same problem.