Training
rustagiadi95 opened this issue · 9 comments
Can u tell me exact steps to train the model?
with all the datasets and upto what extent it should be trained along with learning rates and all...plz help me put brother
Which dataset do you want to do experiments on? What did you try until now?
https://drive.google.com/open?id=16PwjdAR7UWrovgHNumuLE_9u6Q7uyVj9
https://drive.google.com/open?id=1gkORLYpovnIQ2FNSD6YfPhBYzmhqsMID
These are the two versions of the net you created on the stn-ocr paper. Thy are practically same. You can make open either of them.
I am working on all of the datasets. Both the 32x32 without bonding box and only label dataset and the variable size multiple bounding box dataset. I have extracted the dataset out of the second one successfully too. Next I want to work upon the fsns dataset that you mentioned.
I tried to train the net on the 32x32 svhn dataset and the training losses are not good. I understand it is the first dataset that this net has encountered, I have used only 20000 images of this dataset and with 5 epochs. The learning rate range(0.00001 - 0.0000005) and the optimizer(SGD) u asked to work with has not shown me the results till this point. I am just really curious that if I trained it on the full training dataset (~73K images) of this dataset. Will it improve? and if I am gonna work, how may epochs should I use?
It is requiring a lot of computing power, that is why I am very cautious about this.
Secondly, what should I do to make it almost completely accurate?
I know these are a lot of questions, but I think ur research is really commendable and it should be get appreciation. Plz help out.
Hmm,
looking at your code I can only say the following:
- try to use a lower learning rate like
0.0001
or even0.00001
- increase your batch size! Will never work, because the network uses BatchNorm. A batch size of
32
should work quite nicely - try to use
Adam
instead ofSGD
. Adam converges more quickly. - try to create a similar tool like the
BBoxPlotter
that I created (you can find it in theinsights
folder). This tool lets you observe the progress of the training. It does so by using the network to do a prediction on a given image for each iteration of the training. This image is then saved to the hard disk, so you can inspect the state of the network at a given time step. With such a tool you can very quickly determine whether the network diverges or not. This is something you can not directly see from the loss values. So I highly recommend doing this!
Yes I can have a look at some sample data, but you'll need to attach them 😉
Sorry for that...i mailed you the data that time....i was thinking...that whether can we train the recognition part of net individually? without the localisation net?
Oh you send me a mail with the data? I think I did not receive such a mail...
Could you send it again?
Of course you can train the recognition part without the localization part, but then your model will not be different from other recognition models. Or do I get you wrong?
You got me right.
regarding the data, there is no need to disturb you with all the hassle of going through the data. I understand that that my model will not be different than any other model, but in my situation, I am already getting the localized images, not at the character level, but at the word level among the whole image.
But i still think that I would need the localization part if I wanna get the individual characters within localized word.
Anyways, i have some questions which i think i know the answer of but I wanna hear your answers on those questions...
q1) How the LSTM network in the localization net, will be able to distinguish that whether it has detected the same character/word in the previous timesteps or not, coz it is important to choose number of timesteps one would think will be needed in the image?
q2) Will the WHOLE model would work on char74k dataset?
Okay, let me try to answer your questions:
- You cannot be entirely sure that the LSTM is able to distinguish that it already detected a character in a previous timestep, because there is no inhibition of return mechanism. We do know, however, that the LSTM is trained under a very harsh constraint. The loss for the whole network is the recognition loss of the network. In the case of locating and recognizing single characters from an already cropped text line, we explicitly tell the network to use another character if we train the system with SoftmaxCrossEntropy, if we use CTC Loss this constraint is not that harsh, the network only learns that it should span its localizations over the text region. So the number of timesteps is actually a hyperparameter. It depends on the languge you are dealing with. That's all I can tell your right now...
- The whole model should work on the char74K dataset. If you use one timestep for the localization network and only predict one character, it should be able to zoom into a single character and maybe increase recognition accuracy, but I'm not sure it will make a huge difference.