cisocrgroup/ocrd_cis

wrap scale estimation as separate processor for DPI estimation

Opened this issue · 0 comments

It would be useful to have a dedicated processor for DPI estimation in OCR-D. That's because we cannot rely on DPI metadata, although we need to. (Most Ocropy segmentation steps now zoom in on the annotated DPI value in order to forego the 300 DPI assumption. This situation is likely similar with other modules.)

Tesseract already has such a functionality, which is based on its internal line segmentation: first the average scale gets estimated, then it gets multiplied by a constant to yield the DPI. This is based under the assumption that xheight is more or less homogeneous across the page. (Which it is not!) But Tesseract's API does not export that estimation, and does not give access to the TO_BLOCK_LIST which holds the average line_size.

So it's probably best to use ocrolib.psegutils.estimate_scale for this in the same fashion.

But since we know that pages can have widely varying font sizes, we should look at scales more locally, and then find a better statistic than just median to give us the mean xheight of a 12pt text line.

This could be achieved as follows: in estimate_scale, we add an option to look at the np.histogram of blob sizes (square root of box areas for connected components), trying to filter out both the tiny boxes originating from noise and the huge boxes from headings and drop-caps. Then we use that in a dedicated processor ocrd-cis-ocropy-estimate-density, multiplying the estimated scale with a configurable constant factor (which defaults e.g. to 10) to yield the DPI estimation. We annotate this in PAGE-XML under PcGts/Page/@imageXResolution and PcGts/Page/@imageYResolution with PcGts/Page/@imageResolutionUnit="PPI". A future OcrdExif in core can then use that information to override the EXIF data found in the binary image.