kba/hocr-spec

Classes for Inline Representation

Opened this issue · 3 comments

6 Inline Representations

6.1 Classes for Inline Representation
    6.1.1 ocr_glyph
    6.1.2 ocr_glyphs
    6.1.3 ocr_dropcap
    6.1.4 ocr_chem
    6.1.5 ocr_math
    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

'classes' => have class="..." attribute.
So,

    6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation
    6.1.9 Superscript and Subscript
    6.1.10 Ruby characters

should not be under 'Classes for Inline Representation'.

A proper subsection header for these might be:
'Special characters and inline markup'
still under 'Inline Representations'.

kba commented

This structure with the section titles matching class/property names, is fundamentally flawed. I did that mostly to have a reference to all terms via the table of contents. But I think that it's not necessary anymore. E.g. instead of

    6.1.1 ocr_glyph
    6.1.2 ocr_glyphs

one section "Unrecognized text as image" or similar.

kba commented

6.1.10 Ruby characters should be in "Font, Text Color, Language, Direction".

The remaining

6.1.6 Non-breaking space
    6.1.7 Non-default spaces
    6.1.8 Hyphenation

could be like "Word segmentation" or "Spacing"