emilwallner/Screenshot-to-code

Help with OCR, please

Opened this issue · 1 comments

Hey Emil,

thanks a lot for sharing the code. This is extremely helpful!

I know this question has been asked before but thought maybe I could ask it one more time as you might have spent more time on it now.

Were you able to make a model that does both OCR and markup? My main challenge relies in the fact that your approach relies on a vocabulary so somehow this seems to request an a-priori knowledge of the vocabulary.

On my quest to try and make it generalisable, I tried the following:

  • only use the printable characters and the HTML tags
  • changed fit to the fit_generator method of Model (otherwise it wouldn't fit in RAM). I actually only use an NVIDIA P100 and a training batch size of 2 or 3 for RAM reasons
  • grayscale=True for load_image (for RAM reasons, also my text is just black over white background)
  • increase the max length of each sequence (as now it's not words but characters)
  • I didn't change the rest of the code
  • for now, I'm training with like 350 images all generated automatically from HTML snippets. The snippet only contains a few tags (<p>, <h1>...). I also do not pass to the training the <body> tag to the training as this is the same for all images.

With this, I'm not able to bring the model loss below ~1.8, even after 100 epochs.

Do you have any leads, ideas on how to go forward?

Thanks!

Hi Adam,

I haven't tried it with OCR and markup, but I think it's an interesting area to explore.

There are a few things that come to mind:

  • It learns the tokens, so it should not be a problem with a larger vocabulary/OCR
  • Using the loss to measure progress is not ideal. Use a BLEU score on a test set, explained here: https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/
  • The current image size can recognize layouts, divs and buttons, but I think it's too small for characters
  • Start with fewer characters, and then gradually add more characters
  • More training samples

HarvardNLP created a dataset with markup and characters. I'd try it on their dataset with 100K HTML samples. Here is the code and data: https://github.com/harvardnlp/im2markup . I've also uploaded the dataset on floydhub: https://www.floydhub.com/emilwallner/datasets/100k-html