Processor segment-repair end with Exception
j-panzer opened this issue · 7 comments
The processor 'segment-repir' ends wirh Exception "Exception: ocrd-segment-repair exited with non-zero return value 1" if it comes after processor 'cis-ocropy-segment' in the workflow. In a changed workflow.
In a modified workflow, where processor 'cis-ocropy-segment' is replaced by processor 'tesserocr-segment-line', the processing runs.
The root cause of the error is
shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x2aaae5b3f898>
Which is caused by _child_within_parent
not getting a valid polygon from the coordinates of the parent region. Likely, the parent region's @coords
has invalid points. I'll try to reproduce.
input data: https://owncloud.gwdg.de/index.php/s/k96zk4XILHi3let
This was the workflow:
time ocrd process\
"olena-binarize -I PRESENTATION -O OCR-D-BIN -P impl sauvola"\
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"\
"olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim"\
"cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"\
"cis-ocropy-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P level-of-operation page"\
"tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG"\
"segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true"\
"cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region"\
"cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region"\
"cis-ocropy-segment -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE -P level-of-operation region"\
"segment-repair -I OCR-D-SEG-LINE -O OCR-D-SEG-REPAIR-LINE -P sanitize true"\
"cis-ocropy-dewarp -I OCR-D-SEG-REPAIR-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP"\
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /usr/users/jpanzer/ocrd/models/calamari-models/\*.ckpt.json"
Thanks everyone for the detailed report! @kba we are looking for a TextLine from the else
part at the bottom of ocrd_cis.ocropy.segment._process_element
. This should be detectable via page validation, yes. (The polygons themselves are produced in ocrd_cis.ocropy.segment.masks2polygons
and polygon_for_parent
)
EDIT And we know that we are not looking for a bad TextRegion because the error goes away when replacing the ocrd_cis line segmentation with the one from ocrd_tesserocr.
Note: repair has long since included a mechanism for PAGE input validation and automatic fixing – I have not tested this again, but I'm pretty sure it has been solved.