Training fails
Tomas0413 opened this issue · 11 comments
Hi, I'm having this issue when I run training:
python3 train.py ./data/train.csv.zip ./training_config.json
CRITICAL:root:Accuracy on test set: 0.9971641706053186
Traceback (most recent call last):
File "train.py", line 161, in
train_cnn_rnn()
File "train.py", line 151, in train_cnn_rnn
os.rename(path, trained_dir + 'best_model.ckpt')
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'
I'll spend a bit of time tomorrow to see how t fix this problem.
Did you check the saved model directory? Looks like model-2700 doesn't exist.
os.rename(path, trained_dir + 'best_model.ckpt')
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'
@jiegzhan yes, model-2700 files do exist. but there is no model-2700 file as such nor it's a directory:
ls -lrt ./checkpoints_1486165230/
total 71404
-rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1600.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1600.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1600.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1700.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1700.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1700.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2200.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2200.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2200.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2400.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2400.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2400.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001
-rw-r--r-- 1 root root 241 Feb 3 23:42 checkpoint
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta
I'll have a look if train.py didn't write something correctly or if os.rename command is incorrect.
python3 -c 'import tensorflow as tf; print(tf.version)'
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
0.12.1
My tensorflow version is 0.9, it only produce two training files.
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001
The newer version has three training files, instead of two.
Do you get more files created in checkpoints directory? I see *.meta, *.index, .data- and checkpoint.
My tensorflow version is 0.9, it only produces two training files.
Found this:
https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md
New checkpoint format becomes the default in tf.train.Saver. Old V1 checkpoints continue to be readable; controlled by the write_version argument, tf.train.Saver now by default writes out in the new V2 format. It significantly reduces the peak memory required and latency incurred during restore.
set up the write_version argument if you are in a hurry.
I will try to upgrade the tensorflow and make changes soon.
Thanks for pointing this out.
Yep, testing it with V1 now.
Yep, works fine with :
saver = tf.train.Saver(tf.all_variables(), write_version=tf.train.SaverDef.V1)
This is the how the warning message looks like:
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2)
WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2)
WARNING:tensorflow:now on by default.
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:*******************************************************
Thanks
Hai Guys
Any Solution for the above issue . If yes please reply.