AndresPMD/StacMR

Training data

aleks73337 opened this issue · 6 comments

Hello!
I downloaded flickr30k dataset from kaggle, I searched all google, but I can't understand how to create the dataset for training your model. Can you pleese provide fully prepared archieve with all data for training?

Hello @aleks73337,

I added the full datasets with the OCR tokens ready for training. It consists of Flickr, Textcaps and CTC. You can download it here:

https://drive.google.com/file/d/1K66sBXZ9XcfDke7pg8DBmA8nBpnp-z9r/view?usp=sharing

However, you can obtain the features by running your desired OCR and visual regions extractor, similar to the approach taken by the Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering paper: https://arxiv.org/abs/1707.07998

You can find various implementations of this Faster R-CNN trained on VG for this matter.

Thanks for your interest in our work.

@AndresPMD, thank you so much! Pleese don't close the issue, I'll try to use your data soon and maybe will have any questions. One more time, thank you!

@AndresPMD, Hello! Can you pleese explain me how to create my own ocr_feats file? If you can, send me script that you used for it. Thank you :)

@aleks73337 ,
Sorry, at this moment I don't have the script the extract the scene-text from an image. You could you GoogleOCR or any other free repo in Github to do so.

The ocr_feats that I provide are obtained from the GoogleOCR API and fasttext embedding. However, this repo includes the json annotation file for the CTC-Dataset. With this annotation file, you can map each image file to a specific split (train, test, val). You will need to run an OCR on top of these images. If you desire you can do it for Flickr30k as well since TextCaps provides the scene-text in the dataset annotations.

The ocr_feats_file is just the text contained in an image. The text proposals are sorted according to the confidence level, and only the top 20 proposals are used. For each detected word, a word embedding was used. You can just embed each word with Word2vec, GloVe, Fasttext, BERT, etc.

Similarly as if you were working with region level features, you will need to store the features as a npy array of shape (top_proposals, word_embedding_dim, datasplit_size), eg in the case of the CTC_train split (20, 300, 5000).

@AndresPMD, thank you! I'll try :)

@AndresPMD, is there any prebilded classes for OCR? How can I map OCR with image annotations? Can I use another feature extractor for example YOLO 5 without any mistakes?