roatienza/deep-text-recognition-benchmark

Is the network suit for long-text recognition?

WudiJoey opened this issue · 6 comments

Thanks for your work!
I read your paper and notice that input images are resized to [224, 224]. In the case of long text line,does it influence the accuracy?
Look forward to your reply!

Addding: the width of the text image is often greater than the height. Can image information be preserved to the greatest extent if image is resized to square?
Look forward to your reply~

Hi, The resized images (224x224) are still human readable. The attention maps on square images also appear to be giving proper weights on each character region. Other than these, there was no empirical proof on how is the resizing affecting the accuracy. The alternative way is to resize to (100, 32) and use padding to scaled up to 224x224.

Thanks for your reply~
I will try your work.

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this?
Screen Shot 2021-11-10 at 12 00 15

I'm trying to resize a very long sentence , i resized the image to fixed apsect ratio of height 32 and padded the image to 224,224 for example the image shows like this, @WudiJoey have you ever try to train on long width image? Does it effect the accuracy even the image is squeeze something like this? Screen Shot 2021-11-10 at 12 00 15

I haven't try your resize method because i think maybe large blank area will introduce useless infomation. I just resize my images to square directly and it can work. But i think there is a better way to process those long width images, like cutting the image and arrange them by rows.

Thank you for reply! Cutting the image and arrange by rows seems like a very good way to do so, I would like to take a try.

Hmm...however currently it seems like the inputs is fixed by the base VisionTransformer, maybe we should find out a way to handle variable image just like convolution.... maybe the base Vision Transformer can be improved by using other latest vision transformer based network architecture