Training fails

Question

Training fails

Tomas0413 opened this issue 8 years ago · 11 comments

Hi, I'm having this issue when I run training:

python3 train.py ./data/train.csv.zip ./training_config.json

CRITICAL:root:Accuracy on test set: 0.9971641706053186
Traceback (most recent call last):
File "train.py", line 161, in
train_cnn_rnn()
File "train.py", line 151, in train_cnn_rnn
os.rename(path, trained_dir + 'best_model.ckpt')
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

I'll spend a bit of time tomorrow to see how t fix this problem.

Answer 1 · 2017-02-04T01:20:51.000Z

Did you check the saved model directory? Looks like model-2700 doesn't exist.

os.rename(path, trained_dir + 'best_model.ckpt')
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

Answer 2 · 2017-02-04T10:00:17.000Z

@jiegzhan yes, model-2700 files do exist. but there is no model-2700 file as such nor it's a directory:

ls -lrt ./checkpoints_1486165230/
total 71404
-rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1600.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1600.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1600.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1700.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1700.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1700.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2200.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2200.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2200.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2400.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2400.data-00000-of-00001
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2400.meta
-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001
-rw-r--r-- 1 root root 241 Feb 3 23:42 checkpoint
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta

I'll have a look if train.py didn't write something correctly or if os.rename command is incorrect.

Answer 3 · 2017-02-04T10:51:14.000Z

python3 -c 'import tensorflow as tf; print(tf.version)'
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
0.12.1

Answer 4 · 2017-02-04T18:32:58.000Z

My tensorflow version is 0.9, it only produce two training files.

-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index
-rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta
-rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001

The newer version has three training files, instead of two.

Answer 5 · 2017-02-04T18:35:38.000Z

Do you get more files created in checkpoints directory? I see *.meta, *.index, .data- and checkpoint.

Answer 6 · 2017-02-04T18:36:15.000Z

My tensorflow version is 0.9, it only produces two training files.

Answer 7 · 2017-02-04T18:38:26.000Z

Found this:
https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md

New checkpoint format becomes the default in tf.train.Saver. Old V1 checkpoints continue to be readable; controlled by the write_version argument, tf.train.Saver now by default writes out in the new V2 format. It significantly reduces the peak memory required and latency incurred during restore.

Answer 8 · 2017-02-04T18:42:58.000Z

set up the write_version argument if you are in a hurry.

I will try to upgrade the tensorflow and make changes soon.

Thanks for pointing this out.

Answer 9 · 2017-02-04T18:43:41.000Z

Yep, testing it with V1 now.

Answer 10 · 2017-02-04T18:47:48.000Z

Yep, works fine with :

saver = tf.train.Saver(tf.all_variables(), write_version=tf.train.SaverDef.V1)

This is the how the warning message looks like:

WARNING:tensorflow:*******************************************************
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2)
WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2)
WARNING:tensorflow:now on by default.
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:*******************************************************

Thanks

Answer 11 · 2017-08-30T04:46:47.000Z

Hai Guys
Any Solution for the above issue . If yes please reply.