tfrecords files are multiple times larger than the raw image data

Question

tfrecords files are multiple times larger than the raw image data

Closed this issue 5 years ago · 4 comments

Hi, I am implementing an OCR system for my language using your code base as a reference.
My train_images_size is around 18 GB and test and val folders each having around 4GB. I used write_tfrecords.py function to generate tfrecords from those data using 4 threads(P.S I have 4-core CPU so). train_images_size exploded into 4 tfrecords file with around 45 GB each. Is it normal or am I doing something wrong?

Answer 1 · 2019-12-18T15:05:03.000Z

@HtutLynn normal:)

Answer 2 · 2019-12-18T16:34:19.000Z

@MaybeShewill-CV Alright, thanks. I made some modifications to write_tfrecords.py. Instead of writing the raw numpy array into tfrecords, I encoded those numpy data, obtained from cv2.imread by using cv2.imencode func before writing to tfrecords. It significantly reduced the storage that generated tfrecords take.

Btw, how long did it take to train the model to converge on your system? I have around 5 millions image data and planning to train on GTX 1070.
train_image : 32 x 960
seq_length : 120
sample image:

Answer 3 · 2019-12-18T18:15:47.000Z

@HtutLynn sorry I have forgotten the time consuming:)

Answer 4 · 2019-12-18T18:23:21.000Z

@MaybeShewill-CV Ok. Thanks for the great repo.