OCR-D/ocrd_segment

Processor segment-repair end with Exception

j-panzer opened this issue · 7 comments

The processor 'segment-repir' ends wirh Exception "Exception: ocrd-segment-repair exited with non-zero return value 1" if it comes after processor 'cis-ocropy-segment' in the workflow. In a changed workflow.

In a modified workflow, where processor 'cis-ocropy-segment' is replaced by processor 'tesserocr-segment-line', the processing runs.

kba commented

The root cause of the error is

shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x2aaae5b3f898>

Which is caused by _child_within_parent not getting a valid polygon from the coordinates of the parent region. Likely, the parent region's @coords has invalid points. I'll try to reproduce.

kba commented

input data: https://owncloud.gwdg.de/index.php/s/k96zk4XILHi3let

This was the workflow:

time ocrd process\
    "olena-binarize -I PRESENTATION -O OCR-D-BIN -P impl sauvola"\
    "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"\
    "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim"\
    "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"\
    "cis-ocropy-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P level-of-operation page"\
    "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG"\
    "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true"\
    "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region"\
    "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region"\
    "cis-ocropy-segment -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE -P level-of-operation region"\
    "segment-repair -I OCR-D-SEG-LINE -O OCR-D-SEG-REPAIR-LINE -P sanitize true"\
    "cis-ocropy-dewarp -I OCR-D-SEG-REPAIR-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP"\
    "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /usr/users/jpanzer/ocrd/models/calamari-models/\*.ckpt.json"

Thanks everyone for the detailed report! @kba we are looking for a TextLine from the else part at the bottom of ocrd_cis.ocropy.segment._process_element. This should be detectable via page validation, yes. (The polygons themselves are produced in ocrd_cis.ocropy.segment.masks2polygons and polygon_for_parent)

EDIT And we know that we are not looking for a bad TextRegion because the error goes away when replacing the ocrd_cis line segmentation with the one from ocrd_tesserocr.

Note: repair has long since included a mechanism for PAGE input validation and automatic fixing – I have not tested this again, but I'm pretty sure it has been solved.