kba/hocr-spec

Terminology: Glyphs, characters, codepoints

Opened this issue · 0 comments

kba commented

#17 (comment)

None of these are "per-glyph" because "glyph" isn't a uniquely defined
concept independent of font. As far as hOCR is concerned, you need to
output information per codepoint. There is no single correct way of doing
that: it depends on the script, the encoding, and the OCR engine.

For bounding boxes (or cuts) on accented Western scripts, my recommendation
would be: (1) view the whole accented letter as a single glyph, (2) use
normalized unicode with composed characters, (3) if a single glyph
corresponds to multiple codepoints, output a bounding box for the first
codepoint and output empty bounding boxes for the remaining codepoints.

We should define it and s/character/codepoint in the spec.