cisocrgroup/ocrd_cis

TopologicalError: GEOSIntersection_r could not be performed

Closed this issue · 2 comments

Environment

  • Version: included in Docker Image ocrd/all from 2020-08-04 (docker image id: 158ea3d64eae)

Current Behavior:

When executing something like: docker run --rm -u "40366" -w /data -v "/home/aqayv/project/ulb-it-migration/WORKSPACE_OCR/203074":/data -v /usr/share/tesseract-ocr/4.00/tessdata:/usr/local/share/tessdata/ ocrd/all:2020-08-04 ocrd-make -f ulb-ocrd-vd18-02.mk .:

make: Entering directory '/data'
make -R -C . -I /data/ -f /data/ulb-ocrd-vd18-02.mk  2>&1 | tee ..ulb-ocrd-vd18-02.log
make[1]: Entering directory '/data'
building OCR-D-SEGMENT-OCROPY from OCR-D-CLIP with pattern rule for ocrd-cis-ocropy-segment
STAMP=`test -e OCR-D-SEGMENT-OCROPY && date -Ins -r OCR-D-SEGMENT-OCROPY`; ocrd-cis-ocropy-segment   -I OCR-D-CLIP -p OCR-D-SEGMENT-OCROPY.json -O OCR-D-SEGMENT-OCROPY --overwrite 2>&1 | tee OCR-D-SEGMENT-OCROPY.log && touch -c OCR-D-SEGMENT-OCROPY || { if test -z "$STAMP"; then rm -fr OCR-D-SEGMENT-OCROPY; else touch -c -d "$STAMP" OCR-D-SEGMENT-OCROPY; fi; false; }
05:42:29.063 WARNING matplotlib - Matplotlib created a temporary config/cache directory at /.config/matplotlib because the default path (/tmp/matplotlib-ib2pg3_l) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
05:42:39.158 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 238 1073 at 238 1073
Traceback (most recent call last):
  File "/usr/bin/ocrd-cis-ocropy-segment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_segment())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 54, in ocrd_cis_ocropy_segment
    return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/processor/base.py", line 61, in run_processor
    processor.process()
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 387, in process
    region.id, file_id + '_' + region.id, zoom)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 653, in _process_element
    line_polygon = polygon_for_parent(line_polygon, element)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 676, in polygon_for_parent
    interp = childp.intersection(parentp)
  File "/usr/lib/python3.6/site-packages/shapely/geometry/base.py", line 649, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fca99544160>
Makefile:320: recipe for target 'OCR-D-SEGMENT-OCROPY' failed
make[1]: *** [OCR-D-SEGMENT-OCROPY] Error 1
make[1]: Leaving directory '/data'
make: *** [.] Error 2
Makefile:205: recipe for target '.' failed
make: Leaving directory '/data'

Expected Behavior:

Please do not crash, but log an Error and move on gracefully

2020-09-10-bug-203074.zip

Thanks @M3ssman for the full report!

Looks similar to #62 and OCR-D/ocrd_tesserocr#149. I'd very much like to hunt this down, but the problem is with the producers of invalid coordinates, we cannot make each and every consuming processor robust to that kind of error.

Looking into your workflow and PAGE results, there's a self-intersection in TextRegion region0010 with 238,1073 240,1935 1929,1931 1927,932 1719,932 1719,909 238,913 238,936 238,1074 238,1073 (see last 2 points). That region was introduced by ocrd-segment-repair (when reducing overlaps from bbox to polygon). I'll try to transfer the issue there and look what I can do.

@M3ssman I could run your workflow to completion with OCR-D/ocrd_segment#43. Can you please try this with a full document?