GXYM/DRRG

Could you please provide the detailed training steps for getting the final results for CTW1500

bt2567 opened this issue · 1 comments

Hi, good day,

I have achieved your promising results in the ablation study for CTW1500 (baseline+GCN r:81.45, p:83.75, f:82.58 ) by following the instructions in the paper. (pretrained on synthtext for 2 epoch and finetune on CTW1500)

But it seems hard to get your final results for CTW1500 ( r:83.02, p:85.93, f:84.45 ). I feel a little bit confused that you mentioned all experiments are performed with a single image resolution. However, I quote these from your paper " In fine-tuning, for multi-scale training, we randomly crop the text region, and resize them to 640×640 (batch is 8), 800×800 (batch is 4), and 960 × 960 (batch is 4), respectively. " Does this means in the finetune stage you manually resized the image to 800×800 and 960 × 960 and trained each resolution for several epochs?

Could you please provide the detailed training steps of how do you improve the results from your ablation study to your final results? I guess the results gap may be caused by lacking of multi-scaled training

And there seems a big improvement for your final results for CTW1500 and Total-Text compared with the results in ablation study. But for TD500 your results are the same in both tables. Could you please kindly explain? Many thanks

GXYM commented

Hi, good day,

I have achieved your promising results in the ablation study for CTW1500 (baseline+GCN r:81.45, p:83.75, f:82.58 ) by following the instructions in the paper. (pretrained on synthtext for 2 epoch and finetune on CTW1500)

But it seems hard to get your final results for CTW1500 ( r:83.02, p:85.93, f:84.45 ). I feel a little bit confused that you mentioned all experiments are performed with a single image resolution. However, I quote these from your paper " In fine-tuning, for multi-scale training, we randomly crop the text region, and resize them to 640×640 (batch is 8), 800×800 (batch is 4), and 960 × 960 (batch is 4), respectively. " Does this means in the finetune stage you manually resized the image to 800×800 and 960 × 960 and trained each resolution for several epochs?

Could you please provide the detailed training steps of how do you improve the results from your ablation study to your final results? I guess the results gap may be caused by lacking of multi-scaled training

And there seems a big improvement for your final results for CTW1500 and Total-Text compared with the results in ablation study. But for TD500 your results are the same in both tables. Could you please kindly explain? Many thanks

It is worth noting that the result of final results is to use mlt2017 for pre training, which is mentioned in the paper. the results of TD500 both use mlt2017 for pre training。So there will be an improvement in final the results. " In fine-tuning, for multi-scale training, we randomly crop the text region, and resize them to 640×640 (batch is 8), 800×800 (batch is 4), and 960 × 960 (batch is 4), respectively. " it means in the finetune we first 640×640 images to train model , and then use the 800×800 to further finetune the model。