cisocrgroup/ocrd_cis

Segment crashes

rue-a opened this issue · 3 comments

rue-a commented
ocrd_cis/ocropy/common.py:643: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  sepslices = np.array(sepslices)
15:55:07.138 INFO processor.OcropySegment - Found 170 text lines for page "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
15:56:49.378 INFO processor.OcropySegment - Found 84 text regions for page "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
15:56:55.435 WARNING processor.OcropySegment - Label 1 contour 1 is too small (157/4808) in region "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv/bin/ocrd-cis-ocropy-segment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_segment())
  File "click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "ocrd_cis/ocropy/cli.py", line 53, in ocrd_cis_ocropy_segment
    return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
  File "ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/helpers.py", line 107, in run_processor
    processor.process()
  File "ocrd_cis/ocropy/segment.py", line 406, in process
    input_file.pageId, zoom, rogroup=rogroup)
  File "ocrd_cis/ocropy/segment.py", line 680, in _process_element
    min_area=640/zoom/zoom)
  File "ocrd_cis/ocropy/segment.py", line 232, in masks2polygons
    for baseline in baselines], name)
  File "ocrd_cis/ocropy/segment.py", line 232, in <listcomp>
    for baseline in baselines], name)
  File "shapely/geometry/base.py", line 582, in intersection
    return shapely.intersection(self, other, grid_size=grid_size)
  File "shapely/decorators.py", line 77, in wrapped
    return func(*args, **kwargs)
  File "shapely/set_operations.py", line 133, in intersection
    return lib.intersection(a, b, **kwargs)
FloatingPointError: invalid value encountered in intersection

I am facing a similar issue:

WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropyResegment:baseline part crosses existing x in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:852: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  baseline.type in ['Point', 'MultiPoint']):
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:859: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  if (baseline.type == 'GeometryCollection' or
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:860: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  baseline.type.startswith('Multi')):
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:852: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  baseline.type in ['Point', 'MultiPoint']):
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:859: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  if (baseline.type == 'GeometryCollection' or
/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py:860: ShapelyDeprecationWarning: The 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead.
  baseline.type.startswith('Multi')):
WARNING:processor.OcropySegment:Label 204 contour 10 is too small (133/2097) in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
WARNING:processor.OcropySegment:Label 204 contour 9 is too small (193/2097) in region "FILE_0025_OCR-D-BIN-DENOISE-DESKEW"
12:03:54.743 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-cis-ocropy-segment'
Traceback (most recent call last):
  File "/home/mm/Desktop/core/ocrd/ocrd/processor/helpers.py", line 129, in run_processor
    processor.process()
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py", line 322, in process
    input_file.pageId, zoom, rogroup=rogroup)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py", line 596, in _process_element
    min_area=640/zoom/zoom)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py", line 148, in masks2polygons
    for baseline in baselines], name)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py", line 148, in <listcomp>
    for baseline in baselines], name)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/shapely/geometry/base.py", line 582, in intersection
    return shapely.intersection(self, other, grid_size=grid_size)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/shapely/decorators.py", line 77, in wrapped
    return func(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/shapely/set_operations.py", line 133, in intersection
    return lib.intersection(a, b, **kwargs)
shapely.errors.GEOSException: TopologyException: Input geom 1 is invalid: Ring Self-intersection at or near point 657 659 at 657 659

for the following image (FILE_0025_DEFAULT.jpg of mets):
FILE_0025_DEFAULT

in a workflow having the following steps:

cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN
anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP
skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li
skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page
tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page
cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page
cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP
calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0
stweil commented

cis_ocropy_segment also crashes in the QuiVer benchmark tests, see OCR-D/quiver-benchmarks#22:

[...]
Launching `/app/workflows/workspaces/ballenstedt_delatio_1777_selected_pages_ocr/data/ballenstedt_delatio_1777/selected_pages_ocr.txt.nf` [stoic_turing] DSL2 - revision: 8ad3dbf42c
[...]
executor >  local (6)ESC[K
[88/c15647] process > ocrd_cis_ocropy_binarize_0 [100%] 1 of 1 ✔ESC[K
[a7/023237] process > ocrd_tesserocr_crop_1      [100%] 1 of 1 ✔ESC[K
[e4/720726] process > ocrd_skimage_binarize_2    [100%] 1 of 1 ✔ESC[K
[86/34c9af] process > ocrd_skimage_denoise_3     [100%] 1 of 1 ✔ESC[K
[05/44d14c] process > ocrd_tesserocr_deskew_4    [100%] 1 of 1 ✔ESC[K
[92/b8e041] process > ocrd_cis_ocropy_segment_5  [  0%] 0 of 1ESC[K
[-        ] process > ocrd_cis_ocropy_dewarp_6   -ESC[K
[-        ] process > ocrd_calamari_recognize_7  -ESC[K
ESC[31mERROR ~ Error executing process > 'ocrd_cis_ocropy_segment_5'ESC[K
ESC[K
Caused by:ESC[K
  Process `ocrd_cis_ocropy_segment_5` terminated with an error exit status (1)ESC[K
ESC[K
Command executed:ESC[K
ESC[K
  ocrd-cis-ocropy-segment -m mets.xml -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -p '{"level-of-operation": "page"}'ESC[K
ESC[K
Command exit status:ESC[K
  1ESC[K
ESC[K
Command output:ESC[K
  (empty)ESC[K
ESC[K
Command error:ESC[K
  21:31:06.567 INFO processor.OcropySegment - Found 5 separators for page "OCR-D-BIN-DENOISE-DESKEW_00005"ESC[K
  21:31:06.674 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-cis-ocropy-segment'ESC[K
  Traceback (most recent call last):ESC[K
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processorESC[K
      processor.process()ESC[K
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 404, in processESC[K
      self._process_element(page, ignore, page_image, page_coords,ESC[K
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 750, in _process_elementESC[K
      sep_polygons, _ = masks2polygons(seplines, None, element_bin,ESC[K
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in masks2polygonsESC[K
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)ESC[K
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in <listcomp>ESC[K
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)ESC[K
  cv2.error: OpenCV(4.7.0) :-1: error: (-5:Bad argument) in function 'pointPolygonTest'ESC[K
  > Overload resolution failed:ESC[K
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong typeESC[K
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong typeESC[K
[...]

I'm pretty sure the OP's problem happened on an outdated version (so the original problem has been fixed).

Regarding @MehmedGIT's description, thanks for the detailled report. This likewise does not look like the version we have been using in ocrd_all (from fix-alpha-shape branch with last change in August). Also, in my case the workflow runs through. Here's the result for that page (OCR-D-OCR):

page0025-segmentation

– pretty bad indeed, but not crashing. (Ocropy cannot cope with empty pages, because it relies on connected-component statistics, which in this case will be just noise from the binarization, no actual glyphs.)

@stweil your version is definitely outdated, I remember having fixed that long ago.