intel/handwritten-chinese-ocr-samples

Data Preparation

YusenZhang826 opened this issue · 7 comments

Hello,
Could you please tell me how do I run the script dgr2png.c and preparation_flow.py? I have troubles dealing with the dataset.
Thank you!

Hello, Could you please tell me how do I run the script dgr2png.c and preparation_flow.py? I have troubles dealing with the dataset. Thank you!

Before your data preparation, please check the format of the data you downloaded from CAISA-HWDB. And note that the dgr2png which is compiled from dgr2png.c is only used for the DGR format which is the previous version of these dataset.

But if you've downloaded the latest CASIA-HWDB, which should be the DGRL format, the same dgr2png would not be compatible. We will fix this issue shortly.

Thanks for your interest.

Hello, I have the question with the synthesized data.

Does the preparation_flow.py can generate the synthesized data?

Thank you.

Hello, I have the question with the synthesized data.

Does the preparation_flow.py can generate the synthesized data?

Thank you.

The preparation_flow.py can help generate the synthesized data, but the whole flow is a little complicated.

Step 1: preparation_flow.py without "synthesize" option but need both hwdb1x and hwdb2x datasets -> hwdb1x_img_gt_codes.txt and selected_alpha_symbol_codes.txt
Step 2: dgr2png.c -> synthesized_data folder (images)
image

Step 3: preparation_flow.py with "synthesize" option -> synthesized_data labels

Please let us know if you meet further problem.

Hello, I have the question with the synthesized data.
Does the preparation_flow.py can generate the synthesized data?
Thank you.

The preparation_flow.py can help generate the synthesized data, but the whole flow is a little complicated.

Step 1: preparation_flow.py without "synthesize" option but need both hwdb1x and hwdb2x datasets -> hwdb1x_img_gt_codes.txt and selected_alpha_symbol_codes.txt Step 2: dgr2png.c -> synthesized_data folder (images) image

Step 3: preparation_flow.py with "synthesize" option -> synthesized_data labels

Please let us know if you meet further problem.

Thank you for your reply.
I would like to conduct further research on the basis of your research.
I think the quality of synthetic data has a crucial impact on accuracy, so for the convenience of following, can you open source your synthetic data in the form of Baidu Cloud or Google Cloud Disk?

I'm afraid we could not open the dataset with DGR format which was protected by CASIA for research only. If possible, you can try to contact with the Professor Liu for downloading previous version of HWDB dataset.

To all who might want to reproduce this paper work,
Sorry about the trouble. The current version of dataset with DGRL format which only includes line-level labels, so the data augmentation method mentioned in our paper could not help. Maybe the more practical way is to synthesize new text images from isolated characters with a natural and robust strategy.

Thanks.

Hello, I downloaded the hwdb data, but I encountered some problems during the analysis and I would like to ask for advice.

When parsing first, there are \oxff characters. I found that this should be an abnormal type (incomplete characters or crossed out characters), and there are a lot of this kind of data in the training set. How do you deal with this kind of data? So, how do you deal with it when calculating cer?

In addition, when testing ICDAR2013, your training data is only the data under the train folder of hwdb2.x, or the total data of train and test.

When parsing first, there are \oxff characters. I found that this should be an abnormal type (incomplete characters or crossed out characters), and there are a lot of this kind of data in the training set. How do you deal with this kind of data? So, how do you deal with it when calculating cer?

[A]: Yes. Those "FFFF" characters are not meaningful. You can check line 73 in the function "generate_text_img_gt" in the preparation_flow.py. We just skipped them. And in our experimental dictionary, this character is not included either.

In addition, when testing ICDAR2013, your training data is only the data under the train folder of hwdb2.x, or the total data of train and test.

[A]: Aim to a fair comparison, we did not include test database of hwdb2.x for training in our paper work. But definitely, you can add them for further training for even better accuracy (about 0.5%+ burst).