Help with OCR, please
Opened this issue · 1 comments
Hey Emil,
thanks a lot for sharing the code. This is extremely helpful!
I know this question has been asked before but thought maybe I could ask it one more time as you might have spent more time on it now.
Were you able to make a model that does both OCR and markup? My main challenge relies in the fact that your approach relies on a vocabulary so somehow this seems to request an a-priori knowledge of the vocabulary.
On my quest to try and make it generalisable, I tried the following:
- only use the printable characters and the HTML tags
- changed
fit
to thefit_generator
method ofModel
(otherwise it wouldn't fit in RAM). I actually only use an NVIDIA P100 and a training batch size of 2 or 3 for RAM reasons grayscale=True
forload_image
(for RAM reasons, also my text is just black over white background)- increase the max length of each sequence (as now it's not words but characters)
- I didn't change the rest of the code
- for now, I'm training with like 350 images all generated automatically from HTML snippets. The snippet only contains a few tags (
<p>
,<h1>
...). I also do not pass to the training the<body>
tag to the training as this is the same for all images.
With this, I'm not able to bring the model loss below ~1.8, even after 100 epochs.
Do you have any leads, ideas on how to go forward?
Thanks!
Hi Adam,
I haven't tried it with OCR and markup, but I think it's an interesting area to explore.
There are a few things that come to mind:
- It learns the tokens, so it should not be a problem with a larger vocabulary/OCR
- Using the loss to measure progress is not ideal. Use a BLEU score on a test set, explained here: https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/
- The current image size can recognize layouts, divs and buttons, but I think it's too small for characters
- Start with fewer characters, and then gradually add more characters
- More training samples
HarvardNLP created a dataset with markup and characters. I'd try it on their dataset with 100K HTML samples. Here is the code and data: https://github.com/harvardnlp/im2markup . I've also uploaded the dataset on floydhub: https://www.floydhub.com/emilwallner/datasets/100k-html