kba/hocr-spec

ascii-string in the properties grammar BNF

Opened this issue · 0 comments

The spec version 1.2 has

ascii-string     = +(%x01-FF - semicolon)  ; printable ascii without semicolon
delimited-string = doublequote ascii-string doublequote

delimited-string id mostly used in the titleattribute for filenames or links.

The spec for HTML 4.01 and XHTML has for the title string: CDATA depending on character encoding.

XML has a better definition:

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

First, the name ascii in the current hOCR definition %x01-FF is misleading, because ascii ends at x7F. Seems more to target at bytewise parsing, or 8-bit encodings, not unicode codepoints.

Seems it should better be defined as any char without semicolon and without doublequote.