MaybeShewill-CV/CRNN_Tensorflow

tfrecords files are multiple times larger than the raw image data

Closed this issue · 4 comments

Hi, I am implementing an OCR system for my language using your code base as a reference.
My train_images_size is around 18 GB and test and val folders each having around 4GB. I used write_tfrecords.py function to generate tfrecords from those data using 4 threads(P.S I have 4-core CPU so). train_images_size exploded into 4 tfrecords file with around 45 GB each. Is it normal or am I doing something wrong?
Screenshot from 2019-12-18 20-46-38

@HtutLynn normal:)

@MaybeShewill-CV Alright, thanks. I made some modifications to write_tfrecords.py. Instead of writing the raw numpy array into tfrecords, I encoded those numpy data, obtained from cv2.imread by using cv2.imencode func before writing to tfrecords. It significantly reduced the storage that generated tfrecords take.
Screenshot from 2019-12-18 23-01-45

Btw, how long did it take to train the model to converge on your system? I have around 5 millions image data and planning to train on GTX 1070.
train_image : 32 x 960
seq_length : 120
sample image:
sample

@HtutLynn sorry I have forgotten the time consuming:)

@MaybeShewill-CV Ok. Thanks for the great repo.