How the mark-up look like?

Question

How the mark-up look like?

Closed this issue 2 years ago · 1 comments

Hi,
From your paper you have mentioned some of the scenarios where text can be appeared in different region. I assume that you label each region in a full page as etc. I also assume that your network only works on text region. It ignore all other regions.
Now, the confusion for me is how your network differentiate among different region? How the annotation of the full page looks like which you are providing during training?

Thanks in advance

Answer 1 · 2022-10-11T21:50:22.000Z

Hi there,

The ground truth data does not take into account the location of the text regions. The annotation only contains the raw text. For example, an annotation for a full document may look like this:

The quick brown fox jumps
over the lazy dog

That's it. The annotation contains no information about the location of the text regions. I guess the second part of your question is then how it is possible to transcribe such a document without providing the text locations during training. The answer is that the so-called attention mechanism employed by the Transformer decoder can attend to specific regions of the image when making text predictions. Although it is not explicitly labeled where the text is in the image, the raw text annotation should provide enough information for the model to find the location of the corresponding text regions. If this is still unclear to you, I suggest you have a closer look at the attention mechanism (for example, Lilian Weng wrote a very nice blog post about it).

Hope this clears things up.

Best, Tobias