tensorflow/models

model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

wpq3142 opened this issue ยท 26 comments

System information

  • What is the top-level directory of the model you are using: /home/wpq/workspace/models-master/research
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.4.0-rc1
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version:cuDNN v7.0.3 (Sept 28, 2017), CUDA 9.0
  • GPU model and memory:gtx650 2g
  • Exact command to reproduce:
    python3 object_detection/train.py
    --clone_on_cpu true
    --logtostderr
    --pipeline_config_path /home/wpq/data/potato/model/rfcn_resnet101_coco.config
    --train_dir /home/wpq/data/potato/model/train

Describe the problem

download the new :faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017.tar.gz

rfcn_resnet101_coco.config :
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}

Source code / logs

2017-11-01 15:11:40.186072: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/wpq/workspace/models-master/research/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/wpq/workspace/models-master/research/object_detection/trainer.py", line 254, in train
var_map, train_config.fine_tune_checkpoint))
File "/home/wpq/workspace/models-master/research/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint
ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/wpq/data/potato/data/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Process finished with exit code 1

File format is inconsistent,Look at posts๏ผš
http://votec.top/2016/12/24/tensorflow-r12-tf-train-Saver/

slim.get_or_create_global_step() change to: tf.train.get_or_create_global_step()

@wpq3142
this exception raised at here:

ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path)

I don't dive into the implementation of this API, but I suppose this API is for new format.

jart commented

I'm assuming the model code here would need to be updated to maybe determine which format the checkpoint is written in, and if so, use the correct API? If so, that sounds like a straightforward change and we'd welcome contributions helping to clean up the model.

@wpq3142 Can you tell us how you are configuring this particular entry in the config:
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt".

It should look like
fine_tune_checkpoint: "/home/wpq/data/potato/data/model.ckpt"

Moreover, it also looks like you are using rfcn_resnet101_coco.config with a faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017 checkpoint. These two are not compatible. You need use rfcn_resnet101_coco_11_06_2017.tar.gz with the rfcn_resnet101_coco.config

@tombstone

I downloaded the latest model๏ผŒIt's working right now๏ผŒConfiguration is as follows:
--clone_on_cpu true
--logtostderr
--pipeline_config_path /home/wpq/data/potato/model/faster_rcnn_nas_coco.config
--train_dir /home/wpq/data/potato/model/train

For one reason, I seem to lack a space between keys and values๏ผŒ

you just need to restore (.ckpt) not (.ckpt.meta)
something like this ๐Ÿ‘
sess = tf.Session()
saver.restore(sess, 'mymodel/model100-500-0.998.ckpt')

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

@pbashivan thank you so much

I have fixed the issue by this:
replace model.ckpt the model.ckpt-200000
where 20000 is your checkpoint number

Solved on #7696

Hello all, just follow the below video and export your own model with in a 10 seconds

https://youtu.be/w0Ebsbz7HYA๏ปฟ

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

This works, and in my case, I used the longest common prefix among my check point related files which was model.ckpt-1000000 and it worked for me. I had the three following files in my folder:

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

I just thought this might be the case for some folks.

I was running into this and this worked for me. All I had to do was run the following on my windows 10 x64 machine and it worked:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000 --output_directory tuned_model

Instead of:

python export_inference_graph.py --input_type image_tensor --pipeline_config_path ssd_mobilenet_v1_coco.config --trained_checkpoint_prefix models\model.ckpt-1000.data-###-### --output_directory tuned_model

tl;dr Dont reference single files in the --trained_checkpoint_prefix flag. Just reference the batch (the prefix) of those three files.

Hope it helps.

@phosseini is correct. The model itself is made up of three different files with three different extensions showing what kind of model data each file stores.

For me too, using the longest shared file name prefix solved the issue.

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./model_dir/model.ckpt-1000000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

I am trying to run opened project properly, the code saved files as model-10.data-0000-of-0001, .index, .meta.
and The part in code to save files is described as below:

saver = tf.train.Saver(max_to_keep=50)

if self.pretrained_model is not None:
        print("Start training with pretrained Model..")
        saver.restore(sess, self.pretrained_model)



if (e + 1) % self.save_every == 0:
          saver.save(sess, self.model_path + 'model', global_step=e + 1)
          print("model-%s saved." % (e + 1))

One of solution in this issue is to change the file name.

model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta

How to touch the code in my situation? How to change the file name? It looks the save method determine file name automatically. Or should i change the file name manually?

/////////////////////////////////////////////////////////////////////////////////////////////

It can be

if (e + 1) % self.save_every == 0:
                    saver.save(sess, self.model_path + 'model.ckpt', global_step=e + 1)
                    print("model-%s saved." % (e + 1))

but not enough

saver.restore(sess, self.model_path + cur_model2)

cur_model is 'model.ckpt-50.data-0000-of-0001', .index, .meta.

cur_model2 = cur_model[0:cur_model.find('-') + cur_model[cur_model.find('-'):].find('.')]
saver.restore(sess, self.model_path + cur_model2)

Just include file name in restore.

cur_model2 is 'model.ckpt-50'

none of the above worked.
model.ckpt-1000000
model.ckpt-1000000.index
model.ckpt-1000000.meta
solved this problem for me..

Apparently in V2 checkpoints, you should only include the filename up to ".ckpt". For instance if the checkpoint filename is model.ckpt.data-00000-of-00001 then you should only use model.ckpt. Using the full filename leads to getting a DataLossError.

you are a legend

in some models, it could also be caused by lacking a .meta file and / or a .index file.

Please all,
After I trained the tensrflow session , I do not have the name of files as .ckpt.data
model.ckpt-1000000.data-00000-of-00001
model.ckpt-1000000.index
model.ckpt-1000000.meta
but instead
Pretrained.data-00000-of-00001
Pretrained.index
Pretrained.meta
what should I do to solve the above problem of Data loss with my these saved files ??

none of the above worked.
model.ckpt-1000000
model.ckpt-1000000.index
model.ckpt-1000000.meta
solved this problem for me..

@Rajput245 I have the same problem. Were you able to fix it?

Hi guys, I don't know if it is still a problem for you, but I had the following files:
model.ckpt-100000.data-00000-of-00001
model.ckpt-100000.index
model.ckpt-100000.meta

When I used the following code:

import tensorflow.compat.v1 as tf
import tf_slim as slim

checkpoint_path = absolute_path_to/model.ckpt-100000

init_fn = slim.assign_from_checkpoint_fn(
        checkpoint_path, slim.get_model_variables(model_variables))
sess = tf.Session()
init_fn(sess)

I hope this helps you!

In my situation I don't have "ckpt" at all.

I just have the following 2 files:
image

What do I do?

I would maybe try to just add the ckpt after 'variables'.

I just resolved this issue. I saved the model as a .h5 file and that worked.

import tensorflow as tf
from tensorflow.python.training import checkpoint_utils as cp
print(cp.list_variables('path/model_name.ckpt'))
#use only the model name up to the .ckpt part. Do not other magical numbers