博主好,关于训练过程中遇到的一些问题,请指点一下,多谢啦!
Opened this issue · 14 comments
@zhang0jhon
博主您好,首先特别感谢您做的工作,您开源的模型,效果确实很好。
我想尝试复现一下训练流程,但遇到如下3个问题:
1)速度特别慢,我只用了LSVT的数据,一个epoch都要大约6个小时
2)我尝试用多卡训练,但与单卡速度相当,我用的2080ti的卡
3)我测试了30个epoch后的效果,识别精度很差
想请教下:
1)模型训练,需要多少个epoch才合适,初始lr,还有batchsize的大小
2)您在多卡下也是这么慢吗?有没有提升训练速度的方法
3)lsvt中弱标注的数据怎么使用呢,没有文字区域的坐标,如何做mask处理
多谢啦!!
Did you select your gpu at config file? It should be around 10-15 mins per epoch. @zjz5250
@ etatbak Thank you!
yes,i use gpu,set gpus = [0] in config.py.
and how many steps of one epoch when you train the model.
I found that it cost a lot of time when read images in every step.
I set batchsize as 16, and it need about 2 seconds when read 16 images.
@etatbak do you change the steps_per_epoch's value? the default value is 1500, but actually it should be a big number。 for example,if the total number of the training set is 16000,bachsize is 16,then the steps_per_epoch should be 1000,am I right?
I use the lstv dataset,the total number is 238790,I set batchsize as 16,so the steps_per_epoch is 14924.
when I train the model, I found that one epoch need about 6 hours。what is worse,after 11 epoches,the model can not work at all。
@zjz5250 I didn't change many parameters. But I only used rects dataset, so I think if I use lstv it will also take longer.
Step_per_epoch is 500 I think so. My batch_size is 10.
I trained 1000 epochs but it doesn't work well, even not at average, I am not sure how to improve the performance.
@ etatbak
did you use bp file transform from your new model,when you test the accuracy?
“You must feed a value for placeholder tensor 'label' with dtype int32 and shape [?,33]”
did you meet this problem? and how to fix it
@zjz5250 @etatbak @zhang0jhon Hi, I used all ReCTS, ArT, LSVT and IC2017MLT data and trained for 5 epochs on a single GPU (takes a day). I got training loss around 2 but very high validation loss. Do you have any idea on this?
@zhang0jhon Could you please share what level of training and validation loss did you get with the final model? Thanks!
@zjz5250 您好,我训练的时候报错,没有icdar_datasets.npy,您方便把这个文件发到我的邮箱 zhou19920226@126.com给我吗,感激不尽.
@zhang0jhon Hello, thank you for sharing the codes. I fail to train the model, can you send me the icdar_datasets.npy to my email: zhou19920226@126.com ? Thank you very much.
@ustczhouyu Hi, you will need to run dataset.py first to generate the npy file
I got a validation loss around 1.3. The model can recognize some part of the text but the overall accuracy is relatively poor.
I checked the pretrained recognition model has a loss around 0.5 so that should be the goal.
@zhang0jhon
博主您好,首先特别感谢您做的工作,您开源的模型,效果确实很好。
我想尝试复现一下训练流程,但遇到如下3个问题:
1)速度特别慢,我只用了LSVT的数据,一个epoch都要大约6个小时
2)我尝试用多卡训练,但与单卡速度相当,我用的2080ti的卡
3)我测试了30个epoch后的效果,识别精度很差
想请教下:
1)模型训练,需要多少个epoch才合适,初始lr,还有batchsize的大小
2)您在多卡下也是这么慢吗?有没有提升训练速度的方法
3)lsvt中弱标注的数据怎么使用呢,没有文字区域的坐标,如何做mask处理
多谢啦!!
我觉得应该改变读取数据的方式,我看作者的数据读取方式是将整个图像load,这太慢了,我准备改一下改成load裁剪之后的图像
@zhang0jhon
博主您好,首先特别感谢您做的工作,您开源的模型,效果确实很好。
我想尝试复现一下训练流程,但遇到如下3个问题:
1)速度特别慢,我只用了LSVT的数据,一个epoch都要大约6个小时
2)我尝试用多卡训练,但与单卡速度相当,我用的2080ti的卡
3)我测试了30个epoch后的效果,识别精度很差
想请教下:
1)模型训练,需要多少个epoch才合适,初始lr,还有batchsize的大小
2)您在多卡下也是这么慢吗?有没有提升训练速度的方法
3)lsvt中弱标注的数据怎么使用呢,没有文字区域的坐标,如何做mask处理
多谢啦!!
你好,我使用过程中有两个问题请教一下:
- test.py过程中使用作者docker中的模型text_recognition_5435.pb,在_ = tf.import_graph_def(graph_def, name='')时报错 InvalidArgumentError (see above for traceback): The second input must be a scalar, but it has shape [1,33]
2.在train.py时报错
File "/usr/local/lib/python3.5/dist-packages/tensorpack/train/config.py", line 119, in init
assert_type(model, ModelDescBase, 'model')
File "/usr/local/lib/python3.5/dist-packages/tensorpack/train/config.py", line 107, in assert_type
name, tp.name, v.class.name)
AssertionError: model has to be type 'ModelDescBase', but an object of type 'AttentionOCR' found.