cisocrgroup/ocrd_cis

resegment: running for 155 minutes(?)...

Opened this issue · 6 comments

and still running.

Workflow:

. /usr/local/ocrd_all/venv/bin/activate
export TMPDIR=/dwork/tmp
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ocrd-create-mets.xml
( /usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"pc-segmentation -I OCR-D-N5 -O OCR-D-N6" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"

) >cmd.log 2>&1
ps axf
ls       66073  0.0  0.0   4384   744 pts/0    S    14:40   0:00                                  |   \_ /usr/bin/time ocrd process olena-binarize -I O[44/1843]
-O OCR-D-N1 -P impl wolf anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4
-P level-of-operation page cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR
-D-N6 -O OCR-D-N7 -P level-of-operation region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I
OCR-D-N9 -O OCR-D-N10 calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt
.json
ls       66074  0.0  0.0 2423620 68968 pts/0   S    14:40   0:05                                  |       \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/
venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd process olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf anybaseocr-crop
 -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page cis-ocropy-de
skew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation
region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 calamari-recognize
 -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json
ls        2747  116  0.3 11505348 519324 pts/0 Rl   16:44 160:53                                  |           \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_
all/venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment --working-dir /_digi8+9/digitalisate8/ocr-d/testset
/x,pc-segmentation,tesserocr-segment-line,calamari-frak19th --mets mets.xml --input-file-grp OCR-D-N8 --output-file-grp OCR-D-N9 --parameter {"dpi": 0, "min_fra
ction": 0.8, "extend_margins": 3}

@bertsky: same image set as in last email.

PS: no cis-ocropy-clip for obvious reasons :-)

Finally went through; took hours.

Since this only occurs in combination with pc-segmentation and pc-segmentation seems to be currently the weakest segmentation method, I'll close this case.

Finally went through; took hours.

Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.

I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp OCR-D-N8 file with pageId OCR-D-N8_00062 – only the PAGE-XML (since you gave me the images already)...

Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please ocrd workspace validate, esp. OCR-D-N8?

I've let it run again...

Note: complete workflow took longer than sbb_textline, resegment alone 3:30 wallclock time.

I don't know which page exactly affects resegment execution time. Perhaps a consequence of too bad input to resegment. Let's wait if someone else complaines in combination with sbb_textline or similar.

21:19:33.979 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.068 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p 
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, 
"operation_level": "page"}''
21:28:15.482 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p 
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, 
"operation_level": "page"}''
21:28:15.497 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.981 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p 
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.989 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p 
'{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.901 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 
-p '{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.910 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p 
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.411 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p 
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.421 INFO ocrd.task_sequence.run_tasks - Start processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p 
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.789 INFO ocrd.task_sequence.run_tasks - Finished processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p 
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.816 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p 
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.059 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p 
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 
-p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.340 INFO ocrd.task_sequence.run_tasks - Finished processing task 'tesserocr-segment-line -I OCR-D-N7 -O 
OCR-D-N8 -p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.356 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 -p 
'{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.500 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 
-p '{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.533 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 -p 
'{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.010 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 
-p '{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.061 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -p 
'{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter": 
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.186 INFO ocrd.task_sequence.run_tasks - Finished processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR 
-p '{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter": 
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.268 INFO ocrd.cli.process - Finished
32186.10user 9665.10system 5:42:03elapsed 203%CPU (0avgtext+0avgdata 12763172maxresident)k
6303240inputs+45550456outputs (12729major+450089605minor)pagefaults 0swaps

Finally went through; took hours.
Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.

I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp OCR-D-N8 file with pageId OCR-D-N8_00062 – only the PAGE-XML (since you gave me the images already)...

Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please ocrd workspace validate, esp. OCR-D-N8?

I was able to run ocrd-pc-segmentation now. I can reproduce the extremely long runtime of resegment afterwards.

From what I see, this is somewhat related to bad segmentation quality (undetected multi-column layouts). ocrd-pc-segmentation does produce invalid PAGE (negative coordinates etc).

But this also exposes a weakness in the resegmentation algorithm: if input regions are quite large, then the new line segmentation plus pair-wise comparison with existing lines and majority vote is inefficient.

I'll have to think about his.

Could you please revisit with the current master version @jbarth-ubhd ?