Logical Tags/classes
Opened this issue · 7 comments
I don't understand how the logical tags in hOCR should be used. Moreover, I see potential conflicts with other nested tags from the layout. AFAIK ocropus itself does not use any logical tags and tesseract only supports ocr_par
. For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.
Some more specific questions about the logical hocr classes:
- Is the
ocr_document
the same as the html document or can there be multipleocr_document
s in the same html document? - "The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements." How exactly are these elements used? Are the just marking the beginning of something new or should the be nested into each other? What happens for examplem, if the page break happens inside the abstract, i.e. the abstract is spread among two images?
- Should
ocr_authors
be used to indicate some "byline" area or should there be some metadata about the authors given there? - What is
ocr_display
? - Is
ocr_linear
a special case ofocr_par
or why is it inside this subsection?
What do you think?
As @mittagessen said, the semantics of these tags are pure guesswork, since there is little in the spec beyond "These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others."
For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.
If you don't need the logical classes, you can just use the typesetting classes. hOCR obviously comes from a time before HTML5 (newer tags, data-
attributes, microdata etc.), it's more like microformats. You can then use e.g. nested <section|article|address|div>
or some other tag/mechanism for organising logical structure. It would only be relevant if any tools expected these classes to have meaning but since no one produces them, no one consumes them.
Is the ocr_document the same as the html document
No, I wouldn't introduce that restriction, plus it would be redundant. I think it's more of an optional indicator where the OCR document begins vs. where the pages are.
can there be multiple ocr_documents in the same html document?
Yes, since we have no semantics, I would not restrict unless there's a good reason.
Should ocr_authors be used to indicate some "byline" area or should there be some metadata about the authors given there?
What is ocr_display?
I think @tmbdev was strongly inspired by LaTeX, both in terms of hierarchical structure as well as typesetting. In LaTex, display math mode means block level formulas as opposed to inline. C.f. http://kba.github.io/hocr-spec/1.2/#ocr_math
"The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements." How exactly are these elements used? Are the just marking the beginning of something new or should the be nested into each other?
Not sure if I understand your question. I think they tags in square brackets represent the HTML tag name you should use for an element with this class.
ocr_part [<h1>]
->
<h1 class='ocr_part'> ... </h1>
EDIT Now I understand. that paragraph is in the wrong section, I'll fix it.
Is ocr_linear a special case of ocr_par or why is it inside this subsection?
No, that is just an error. We should turn those into something more compact, as done for metadata and HTML markup section.
There are too many things in the spec which are not very clear.
I see two possible solutions:
- Try to understand the original author intent. This is often a guesswork.
- Ask the original author (Tom) to clarify a few things for us.
I didn't see @kba comment before sending mine (he send his while I was editing mine).
It's funny we both used the word 'guesswork'.
- Ask the original author (Tom) to clarify a few things for us.
Agreed, but I'd say the semantics of the logical structuring elements are low priority and should probably just be handled in a subsection. HTML has good mechanisms for expressing the logical structure of a document.
Not saying, we shouldn't ask Tom, but that I think it's more important to have the semantics and mechanics of features specified that might actually be used but aren't (?), such as reading order (ocr_linear) and grouping and so on.
It's Funny we both used the word 'guesswork'.
It is kinda telling :)
Okay, it looks that we agree that this section involves a lot of "guesswork" but the logical structure elements are not used much anyhow and therefore is only low priority. I added the label "postpone" here for now.
On Thu, Oct 20, 2016 at 4:40 AM, Konstantin Baierer <
notifications@github.com> wrote:
As @mittagessen https://github.com/mittagessen said
#17 (comment), the
semantics of these tags are pure guesswork, since there is little in the
spec beyond "These logical tags have their standard meaning as used in the
publishing industry and tools like LaTeX, MS Word, and others."I'm not sure what additional semantics you are looking for. The logical
markup in hOCR is basically the same as that found in LaTeX and is intended
to have the same semantics.
For most hocr logical classes there are equivalent html tags and therefore
I don't see any advantage to add special logical hocr classes there.
The reason hOCR defines how to encode logical markup as either HTML tags
or as hOCR classes is because there are different use cases that require
one or the other. Keep in mind that hOCR isn't just an encoding of OCR
output in HTML, it is actual HTML that can be displayed in a browser. When
you display it in a browser, you can copy and paste it, and the OCR
metadata gets copied along with the text itself (this is not true for
formats like ALTO). Sometimes, in such use cases, it is OK to use HTML tags
directly, in other cases, you want to keep the logical layout information
around but not have it affect the HTML presentation.If you don't need the logical classes, you can just use the typesetting
classes. hOCR obviously comes from a time before HTML5 (newer tags, data-
attributes, microdata etc.), it's more like microformats
http://microformats.org/.hOCR was developed around the time HTML5 came out, but it seemed important
at the time to still support older versions of HTML. I'm not sure that is
still true. It may be worth revisiting that question.Is the ocr_document the same as the html document
can there be multiple ocr_documents in the same html document?
Should ocr_authors be used to indicate some "byline" area or should there
be some metadata about the authors given there?What is ocr_display?
I think @tmbdev https://github.com/tmbdev was strongly inspired by
LaTeX, both in terms of hierarchical structure as well as typesetting. In
LaTex, display math mode means block level formulas as opposed to inline.
C.f. http://kba.github.io/hocr-spec/1.2/#ocr_mathCorrect. Basically, for any of these tags, the intent is to follow what
LaTeX does. For example, as in LaTeX, ocr_author does not encode document
metadata, it merely indicates that an area of the page contains author
information, in no particular format (it might even be an image). For
actual, machine readable document metadata, hOCR uses Dublin Core, but that
is unrelated to the logical layout tags.Is ocr_linear a special case of ocr_par or why is it inside this
subsection?No, that is just an error. We should turn those into something more
compact, as done for metadata and HTML markup section.The nesting hierarchy is indicated in the figure below; probably the list
above should be merged with the hierarchy into a single figure.