tobiasvanderwerff/full-page-handwriting-recognition

Can't get a decent output

EmreVurgun opened this issue · 10 comments

Hello. i have trained the model on froms for 6 epoch with batch size 4 and also tried around 50 epoch on line level which shows good wer and cer in training but when i use my own hadwriting to test i get rubbish output, why is that? Is there a language model that tries to "correct" my output?

this is the image i am using:

eng_line

Sorry to hear you are not getting good output. It's possible that your handwriting is sufficiently different from the training data such that it is not recognized well by the model. You are right that there is an implicit language model, which could be biased towards continuations as they occur in the training data.

If you really want the model to work better on your own handwriting, you could try out the following: annotate a handful of images of your handwriting, and include these images in the training data, or finetune a model on these images. This should make the model work better for your particular style of writing. Hope that helps.

Thank you so much for your response. Is there a way to disable the langueage model? it would suit my needs better that way anyway. Also how could i go about annotating the images of my handwriting and using it with this project. Also can i increase the vocabulary size somehow to train this implementation for another language? I'm sorry if i'm asking too many questions, this is the first time i'm doing something like this.

For example i wasn't able to find the place where the labels are gotten from as the data set i downloaded doesnt have any text annotation but the training works fine. is it downloaded from somewhere temporarily in each training?

Let me try to answer your questions :)

It is not possible to disable the language model, because it is part of the decoder model, which transforms the image into text output.

Increasing the vocabulary size should be possible without much trouble. As far as I remember, the vocabulary is determined purely by the unique set of characters that occur in the training data.

Regarding where the text training data is stored: it should be part of the IAM dataset. If it wasn't there, you wouldn't be able to train the model. If I recall correctly, the text annotations are stored in XML files along with the images (perhaps in a different folder). If you want to annotate your own training data, I suggest you have a look at how the data is stored in the XML files, and try to duplicate that format. If you really want to go into detail, you could look at the dataset class in the code to see how the XML file is processed exactly.

Thank you for the answers, it's been really helpful. I tried testing with the IAM dataset's test set and it seems fine for the most part. I get around the same accuracy as you show in this repo, it indeed seems to be that my style of writing is so different that causes the issue.

I don't know why i didn't check the data folder, there actually are the text annotation in xml files, i'll use it as you suggested.

So if its not possible to disable the language model, can this project not be used for any language other than english or can i tweak the code some way to work on a similar language?

The language model is not tied to a specific language and you should be able to train on other languages. This would require you to find a sufficiently large dataset containing examples in that language. If you are able to find such a dataset, you will probably also have to tweak the code where the dataset is loaded, because your dataset may not have the same structure as IAM.

Great! That's exactly what i was looking for. I am creating a new dataset on turkish language with my university professor. If you want i can share the results with you when we are finished.

One last question if you would answer. Where exactly is the language model in the decoder structure? I'd like to study it if possible.

That's very interesting! I hope the project goes well building the Turkish dataset.

You asked where the language model is located in the decoder. There is no straightforward answer to this question. The job of the decoder is to predict the next token in a sequence based on the tokens it has produced previously, also looking at the visual features produced by the encoder. This is why the language model is "implicit" (I believe they also say this in the paper). But in the end, the decoder is just a big Transformer model, which makes it hard to interpret.

I see. That's what i assumed it was, so i'll look at the whole structure and try to understand it. Thank you for all the help.