2.0: Replace title= props with data-ocr-* attributes

Question

2.0: Replace title= props with data-ocr-* attributes

Opened this issue 8 years ago · 2 comments

Reusing the title= attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.

In a (possible) next major revision of the standard, we could use data-ocr-* attributes for that purpose.

<span id="line1" class="ocr_line" title="bbox 0 0 100 100">...</span>

could be expressed as

<span id="line1" data-ocr-tag="line" data-ocr-bbox="[0,0,100,100]"> ... </span>

This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do:

var line = document.querySelector("#line1");
var bbox = JSON.parse(line.dataset.ocrBbox);
var width = ocrBbox[2] - ocrBbox[0];

Answer 1 · 2016-10-22T13:01:35.000Z

I think the data-ocr-* attributes would be a good way to continue. But is there any reason to change the class as well? This is standard HTML and has very good support like document.getElementsByClassName("ocr_line").

Answer 2 · 2016-10-22T13:30:47.000Z

It would make it easier to map between formats (ALTO) and serializations, if the OCR application profile of the HTML would be uniform, i.e. you wouldn't force a naming convention on class, id or title.