clovaai/deep-text-recognition-benchmark

different accuracy between paper and competition website

bliu3650 opened this issue · 19 comments

Hi Author,
Great work from you, and thanks for the sharing.
I noted that the accuracy of your best model on IC13 is 93.6% in the paper, while it's 95.98% on the robust reading competition website.
Could you please explain about this difference?
Thanks.

Good question!
I had waited for this question.
(so I would not close this issue for the people who come form ICDAR website)

3 major points make different accuracy between our paper and ICDAR challenge.

  1. In our paper, we only used the images which contain alphanumeric label in MJSynth and SynthText.
    For ICDAR challenge, training/evaluation datasets are different.
    Evaluation dataset of ICDAR challenge contains special characters such as '!', '?'
    but training dataset in our paper does not contain special characters.
    To compensate special characters, we generated more synthetic data and used it as the training dataset.
    We also knew that real data improve the accuracy, thus we used additional real data for ICDAR challenge.

  2. We conducted hyper-parameter tuning (ex. channel size of feature extraction, hidden size of LSTM).
    We used a bigger model for ICDAR challenge.

  3. We used ADAM optimizer instead of ADADELTA.

Best,
Baek.

@ku21fan Understood. Thanks for the clarification.
May us know the amount of extra generated/real data, and also the size of your bigger model for that ICDAR challenge? Thanks.

@brianliu3650
Yes, we used extra about 10M generated data and about 200K real data.

[Bigger model configuration]
channel size of feature extraction: 1024
hidden size of BiLSTM: 1024 (or 512 would be enough)

P.S. we used different character sets from our paper (--sensitive mode for ICDAR challenge), thus we needed more training data.

opt.character = string.printable[:-6] # same with ASTER setting (use 94 char).

--sensitive mode results in

opt.character = 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

Best

@ku21fan
You have done awesome work on both text detection and recognition sides and even loved the demo you have hosted online.
Can you please share some tips on synthetic data generation like the various noises you try to simulate and your thought process behind this task.
Thanks.

@tjdevWorks Thank you for your attention to our works.
Actually, I did only recognition side :D
Detection side is done by Youngmin et al.

We referred to MJSynth and SynthText, and their code to generate synthetic data.
One basic tip is that a default hyperparameter setting of their code would be not that good enough for other purposes, thus testing with various hyperparameter for your intention would be needed.

Best

@ku21fan
Thank you for sharing your awesome works. I checked your demo with Japanese words(https://demo.ocr.clova.ai/), it works really well actually, even with printed and handwritten characters. Out of curiosity, did you also use synthetic data for training/validating the model for detecting Japanese words? Do you have any tips or tricks behind that?
Thanks

@hoainamken Thank you for your attention to our works :)
I am not sure that text detection part used synthetic data for Japanese.
(it is Youngmin's part plz ask to him :D)
For text recognition part, yes we used synthetic data for Japanese words.
As same as the above comment, we referred to MJSynth and SynthText, and their code to generate synthetic data.
Basically, we used their code, with our materials such as vocabulary/corpus for Japanese.

Best

@ku21fan
Thank you so much.
I am working on the recognition part for Japanese. My model accuracy is low, says ~60% on the testing image(real-world image), I applied CRAFT for text detection beforehand. In this case, I always wonder whether the low performance comes from:

  1. Not having enough synthetic data or
  2. The model is too complex while the number of words used for training is small(my toy project has just around 150 words and I have generated around 7500 synthetic images, use only 1 font and 3 backgrounds, added noise such as Gaussian, median filter, sharpen, smooth. The images are cropped by the word length).

Should I generate more synthetic data or reduce the complexity of the model instead. May I ask what would you do in this situation?
My model uses configuration as below: Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC
Best

@hoainamken
If I was in your situation, I would try 2 things first.

  1. Generate more words with diverse fonts and backgrounds, since 150 words too small compared to MJSynth and SynthText.
  2. While generating more data, try our best model, --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn, since It usually has higher accuracy than your current configuration (CRNN).

Hope it helps :)

@ku21fan
Thanks a lot for your help. I tried your suggestions and it helps me improve the accuracy significantly.
I only have one more question. Since I am creating synthetic data to predict only 150 sentences, I have created exactly the same 150 sentences for synthetic data but with different fonts and backgrounds.
The model after being trained performs well on two separate sentences(as expected), but when the image contains two sentences, it could not.
For example: if predicting image A(シャウエッセン) and image B(御堂筋事件) separately, the accuracy is 100%, when it comes to predicting image C (シャウエッセン 御堂筋事件), it failed.
I think the reason may come from using RNN model (BiLSTM), "御堂筋事件" has never been learned to stay after "シャウエッセン".
imageA
predicted: シャウエッセン
actual: シャウエッセン
imageB
predicted: 御堂筋事件
actual:御堂筋事件
imageC
predicted: けしゴム(消しゴム)冷蔵庫
actual: シャウエッセン 御堂筋事件
My question is, when generating synthetic data, in the case of English recognization, each synthetic image has one word, it can be "school", "teacher" or "student"..etc. But in the case of Japanese, words are not separated by the white space as it is in English, how do you generate synthetic data from the corpus of Japanese?
Sorry for keep commenting on this issue.
Best.

@hoainamken
This repository runs for an academic purpose, not for business.
So, I’m afraid that we can not answer all of your questions.

@ku21fan
No worries, thank you so much anyway. Once again, great works!

what is the learning rate for adam, In addition, I noticed that the learning rate decay is not used when training.

You have mentioned that opt.character = 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

However, the output of your method on the competition website seems to contain characters not present in the list, like chinese characters, for example:

https://rrc.cvc.uab.es/?ch=2&com=evaluation&view=method_sample&task=3&m=50493&gtv=10&file=1&eval=1&sample=673

Did you train the model on multi-language datasets, too?

@klarajanouskova Hello,

No, I did not train with this model on multi-language dataset.

The character that you mentioned is '몲'
Firstly, I used [UNK] token for the unknown character, which is commented out here
(since we didn't use [UNK] token in the paper experiments).

Secondly, I just replace [UNK] token with '몲', because [UNK] is counted as 5 characters.
In other words, '몲' is just the result of simple post-processing to count [UNK] as 1 character.
Thus, instead of '몲', the other characters such as '1' or 'a' or 'b' would be also possible,
but for the strict evaluation, I wanted the character which is not in opt.character, so I used '몲'.
('몲' is Korean, I shortened the '모르겠음' = 'don't know' as '몲')

Best

@ku21fan Thanks a lot for the explanation!

@ku21fan
Thanks a lot for your help. I tried your suggestions and it helps me improve the accuracy significantly.
I only have one more question. Since I am creating synthetic data to predict only 150 sentences, I have created exactly the same 150 sentences for synthetic data but with different fonts and backgrounds.
The model after being trained performs well on two separate sentences(as expected), but when the image contains two sentences, it could not.
For example: if predicting image A(シャウエッセン) and image B(御堂筋事件) separately, the accuracy is 100%, when it comes to predicting image C (シャウエッセン 御堂筋事件), it failed.
I think the reason may come from using RNN model (BiLSTM), "御堂筋事件" has never been learned to stay after "シャウエッセン".
imageA
predicted: シャウエッセン
actual: シャウエッセン
imageB
predicted: 御堂筋事件
actual:御堂筋事件
imageC
predicted: けしゴム(消しゴム)冷蔵庫
actual: シャウエッセン 御堂筋事件
My question is, when generating synthetic data, in the case of English recognization, each synthetic image has one word, it can be "school", "teacher" or "student"..etc. But in the case of Japanese, words are not separated by the white space as it is in English, how do you generate synthetic data from the corpus of Japanese?
Sorry for keep commenting on this issue.
Best.
hi,can you tell me ,how do you overcome it,i have try use long sense in train data

Hi, Thanks for your awesome work. I'd really like to know where I can find the generation code for MJSynth and SynText? And could you share how you modify the code to generate the training data that you use in ICDAR contest?

Best.

Hello @ku21fan,

I have scanned images of electronic theses and dissertations (ETDs) and it contains the typewritten text. I used this website (https://github.com/clovaai/deep-text-recognition-benchmark) to perform OCR. Based on the instruction, it seems it only works on ICDAR and Imdb datasets. Correct me if I am wrong. I tried demo.py on the scanned ETDs and it returns word per ETDs with a low confidence score. If I am not using the right URL, could you please provide me the link which does general OCR?

I also found this website (https://clova.ai/ocr) which does the general OCR. So, is the General OCR not released yet?