ascii-string in the properties grammar BNF
Opened this issue · 0 comments
wollmers commented
The spec version 1.2 has
ascii-string = +(%x01-FF - semicolon) ; printable ascii without semicolon
delimited-string = doublequote ascii-string doublequote
delimited-string
id mostly used in the title
attribute for filenames or links.
The spec for HTML 4.01 and XHTML has for the title string: CDATA depending on character encoding.
XML has a better definition:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
First, the name ascii in the current hOCR definition %x01-FF
is misleading, because ascii ends at x7F
. Seems more to target at bytewise parsing, or 8-bit encodings, not unicode codepoints.
Seems it should better be defined as any char without semicolon and without doublequote.