kba/hocr-spec

cuts and x_bboxes

Opened this issue · 4 comments

kba commented

Why have mechanisms for relative and absolute positioning of codepoints within a word/cinfo?

Why not a bboxes attribute without the engine-specific prefix?

Related to #69

kba commented

#17 (comment)

The "cuts" attribute is for representing cuts. It exists as a compact,
pixel-accurate representation of a character segmentation. Cuts are not
bounding boxes, and, in fact, are not all that useful unless you have the
original page image available.

kba commented

#17 (comment)

Cuts are for pixel-accurate segmentation in the presence of kerning,
something bounding boxes can't represent.

def decode_cuts(s, x=0, ymax=None):
    print repr(x)
    cuts = []
    for path in s.split():
        turns = [int(p) for p in path.split(",")]
        print repr(x), repr(turns)
        x += turns[0]
        pos = [x, 0]
        cut = [tuple(pos)]
        for i, d in enumerate(turns[1:]):
            pos[(i+1)%2] += d
            cut.append(tuple(pos))
        if ymax is not None:
            pos[1] = ymax
            cut.append(tuple(pos))
        cuts.append(cut)
    return cuts

To convert these to tight bounding boxes, you need the original binary
image (it's another 10-20 lines to do that conversion).

kba commented

@mttagessen in #17 (comment)

My point with the x_cuts, x_confs, x_* still stands even if you cut it down to a single engine and reencoding existing output. Without access to the particular model it is still impossible to align confidences/bboxes with code points even when you can make sure that nobody "tampered" with the file by renormalizing it to another Unicode normalization. The fundamental reason is that there is no mapping between Unicode code points and recognition units. Formats like AbbyyXML actually allow this alignment by being designed bottom-up (glyph-first) instead of top down like hOCR. I use "glyph" as the lowest level of label an engine may produce.

While per-character bounding boxes are indeed rather useless (and techniques like CTC layers may or may not produce them randomly), quite a few people seem keen on confidences for postprocessing.