2.0: Replace title= props with data-ocr-* attributes
Opened this issue · 2 comments
Reusing the title=
attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.
In a (possible) next major revision of the standard, we could use data-ocr-*
attributes for that purpose.
<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>
could be expressed as
<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>
This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:
var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];
I think the data-ocr-*
attributes would be a good way to continue. But is there any reason to change the class
as well? This is standard HTML and has very good support like document.getElementsByClassName("ocr_line")
.
It would make it easier to map between formats (ALTO) and serializations, if the OCR application profile of the HTML would be uniform, i.e. you wouldn't force a naming convention on class
, id
or title
.