TopologicalError: GEOSIntersection_r could not be performed
Closed this issue · 2 comments
Environment
- Version: included in Docker Image
ocrd/all
from 2020-08-04 (docker image id: 158ea3d64eae)
Current Behavior:
When executing something like: docker run --rm -u "40366" -w /data -v "/home/aqayv/project/ulb-it-migration/WORKSPACE_OCR/203074":/data -v /usr/share/tesseract-ocr/4.00/tessdata:/usr/local/share/tessdata/ ocrd/all:2020-08-04 ocrd-make -f ulb-ocrd-vd18-02.mk .
:
make: Entering directory '/data'
make -R -C . -I /data/ -f /data/ulb-ocrd-vd18-02.mk 2>&1 | tee ..ulb-ocrd-vd18-02.log
make[1]: Entering directory '/data'
building OCR-D-SEGMENT-OCROPY from OCR-D-CLIP with pattern rule for ocrd-cis-ocropy-segment
STAMP=`test -e OCR-D-SEGMENT-OCROPY && date -Ins -r OCR-D-SEGMENT-OCROPY`; ocrd-cis-ocropy-segment -I OCR-D-CLIP -p OCR-D-SEGMENT-OCROPY.json -O OCR-D-SEGMENT-OCROPY --overwrite 2>&1 | tee OCR-D-SEGMENT-OCROPY.log && touch -c OCR-D-SEGMENT-OCROPY || { if test -z "$STAMP"; then rm -fr OCR-D-SEGMENT-OCROPY; else touch -c -d "$STAMP" OCR-D-SEGMENT-OCROPY; fi; false; }
05:42:29.063 WARNING matplotlib - Matplotlib created a temporary config/cache directory at /.config/matplotlib because the default path (/tmp/matplotlib-ib2pg3_l) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
05:42:39.158 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 238 1073 at 238 1073
Traceback (most recent call last):
File "/usr/bin/ocrd-cis-ocropy-segment", line 8, in <module>
sys.exit(ocrd_cis_ocropy_segment())
File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 54, in ocrd_cis_ocropy_segment
return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/processor/base.py", line 61, in run_processor
processor.process()
File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 387, in process
region.id, file_id + '_' + region.id, zoom)
File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 653, in _process_element
line_polygon = polygon_for_parent(line_polygon, element)
File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 676, in polygon_for_parent
interp = childp.intersection(parentp)
File "/usr/lib/python3.6/site-packages/shapely/geometry/base.py", line 649, in intersection
return geom_factory(self.impl['intersection'](self, other))
File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 70, in __call__
self._check_topology(err, this, other)
File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fca99544160>
Makefile:320: recipe for target 'OCR-D-SEGMENT-OCROPY' failed
make[1]: *** [OCR-D-SEGMENT-OCROPY] Error 1
make[1]: Leaving directory '/data'
make: *** [.] Error 2
Makefile:205: recipe for target '.' failed
make: Leaving directory '/data'
Expected Behavior:
Please do not crash, but log an Error and move on gracefully
Thanks @M3ssman for the full report!
Looks similar to #62 and OCR-D/ocrd_tesserocr#149. I'd very much like to hunt this down, but the problem is with the producers of invalid coordinates, we cannot make each and every consuming processor robust to that kind of error.
Looking into your workflow and PAGE results, there's a self-intersection in TextRegion region0010
with 238,1073 240,1935 1929,1931 1927,932 1719,932 1719,909 238,913 238,936 238,1074 238,1073
(see last 2 points). That region was introduced by ocrd-segment-repair
(when reducing overlaps from bbox to polygon). I'll try to transfer the issue there and look what I can do.
@M3ssman I could run your workflow to completion with OCR-D/ocrd_segment#43. Can you please try this with a full document?