OCR-D/ocrd_anybaseocr

Stricter cropping

beckstefan opened this issue · 2 comments

A DFG requirement when scanning is to show a part of the opposite page. On some pages this tends to be a problem, since anybaseocr-crop does not crop the text and later tools detect text/characters where they shouldn't.

Here are two examples.

cropping_1
cropping_2

What would be a strategy to tackle this?

AFAICT this processor tries to avoid textual noise via separator line detection. There are a couple of (crappy and badly documented) parameters for this (rular...), but IMHO your best shot here would be trying to increase the contrast so the binarized image shows a distinct, contiguous vertical line where the gutter/spine is.

Besides binarization settings, there is a second workflow detail that might help: If you deskew before cropping, these lines should be easier to detect.

@beckstefan is this gone with the reimplementation of the cropper?

(If you could post or link to the originals, I could run it...)