Discussion: Transcription of decimal dot in numbers
wollmers opened this issue · 0 comments
There are some numbers in the original images, where the decimal dot is not sitting near the baseline. Either at the hight of the hyphen, or at the top edge (height of capitals).
IMHO for broader use of the GT files (OCR training, benchmark) an intermediate transcription should be used, i. e. Unicode without PUA and as near as possible to the original glyphs (long s), spelling etc. Conversion into basic level (current spelling, German keyboard) is easier than conversion in the other direction.
What dots are available in Unicode:
cpoint name
'.' U+002E FULL STOP (Other_Punctuation)
'·' U+00B7 MIDDLE DOT (Other_Punctuation)
'˙' U+02D9 DOT ABOVE (Modifier_Symbol)
'·' U+0387 GREEK ANO TELEIA (Other_Punctuation)
'᛫' U+16EB RUNIC SINGLE PUNCTUATION (Other_Punctuation)
'․' U+2024 ONE DOT LEADER (Other_Punctuation)
'‧' U+2027 HYPHENATION POINT (Other_Punctuation)
'∙' U+2219 BULLET OPERATOR (Math_Symbol)
'⋅' U+22C5 DOT OPERATOR (Math_Symbol)
'⸱' U+2E31 WORD SEPARATOR MIDDLE DOT (Other_Punctuation)
'⸳' U+2E33 RAISED DOT (Other_Punctuation)
'・' U+30FB KATAKANA MIDDLE DOT (Other_Punctuation)
'ꞏ' U+A78F LATIN LETTER SINOLOGICAL DOT (Other_Letter)
MIDDLE DOT appears frequently in current and old typography and is available in most fonts.
But I hesitate to use DOT ABOVE, because it's a modifier symbol. We can use it now and maybe convert later after consulting some opinions.