resegment: running for 155 minutes(?)...
Opened this issue · 6 comments
and still running.
Workflow:
. /usr/local/ocrd_all/venv/bin/activate
export TMPDIR=/dwork/tmp
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ocrd-create-mets.xml
( /usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"pc-segmentation -I OCR-D-N5 -O OCR-D-N6" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"
) >cmd.log 2>&1
ps axf
ls 66073 0.0 0.0 4384 744 pts/0 S 14:40 0:00 | \_ /usr/bin/time ocrd process olena-binarize -I O[44/1843]
-O OCR-D-N1 -P impl wolf anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4
-P level-of-operation page cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR
-D-N6 -O OCR-D-N7 -P level-of-operation region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I
OCR-D-N9 -O OCR-D-N10 calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt
.json
ls 66074 0.0 0.0 2423620 68968 pts/0 S 14:40 0:05 | \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/
venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd process olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf anybaseocr-crop
-I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page cis-ocropy-de
skew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation
region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 calamari-recognize
-I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json
ls 2747 116 0.3 11505348 519324 pts/0 Rl 16:44 160:53 | \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_
all/venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment --working-dir /_digi8+9/digitalisate8/ocr-d/testset
/x,pc-segmentation,tesserocr-segment-line,calamari-frak19th --mets mets.xml --input-file-grp OCR-D-N8 --output-file-grp OCR-D-N9 --parameter {"dpi": 0, "min_fra
ction": 0.8, "extend_margins": 3}
@bertsky: same image set as in last email.
PS: no cis-ocropy-clip for obvious reasons :-)
aborting after 244 minutes...
Log file https://digi.ub.uni-heidelberg.de/diglitData/v/cmd-0026e786738187ab1652ac53ccc5184f.log
Finally went through; took hours.
Since this only occurs in combination with pc-segmentation and pc-segmentation seems to be currently the weakest segmentation method, I'll close this case.
Finally went through; took hours.
Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.
I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp OCR-D-N8
file with pageId OCR-D-N8_00062
– only the PAGE-XML (since you gave me the images already)...
Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please ocrd workspace validate
, esp. OCR-D-N8
?
I've let it run again...
Note: complete workflow took longer than sbb_textline, resegment alone 3:30 wallclock time.
I don't know which page exactly affects resegment execution time. Perhaps a consequence of too bad input to resegment. Let's wait if someone else complaines in combination with sbb_textline or similar.
21:19:33.979 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.068 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1 -p
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:24:12.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow":
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95,
"operation_level": "page"}''
21:28:15.482 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 -p
'{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow":
0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95,
"operation_level": "page"}''
21:28:15.497 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.981 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3 -p
'{"impl": "wolf", "k": 0.34, "win-size": 0, "dpi": 0}''
21:33:18.989 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p
'{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.901 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4
-p '{"level-of-operation": "page", "noise_maxsize": 3.0, "dpi": 0}''
21:34:40.910 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.411 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p
'{"level-of-operation": "page", "maxskew": 5.0}''
21:48:11.421 INFO ocrd.task_sequence.run_tasks - Start processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.789 INFO ocrd.task_sequence.run_tasks - Finished processing task 'pc-segmentation -I OCR-D-N5 -O OCR-D-N6 -p
'{"overwrite_regions": true, "xheight": 8, "model": "__DEFAULT__", "gpu_allow_growth": false, "resize_height": 300}''
21:55:00.816 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.059 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -p
'{"level-of-operation": "region", "maxskew": 5.0}''
22:08:02.073 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8
-p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.340 INFO ocrd.task_sequence.run_tasks - Finished processing task 'tesserocr-segment-line -I OCR-D-N7 -O
OCR-D-N8 -p '{"dpi": -1, "overwrite_lines": true}''
22:09:49.356 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 -p
'{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.500 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9
-p '{"dpi": 0, "min_fraction": 0.8, "extend_margins": 3}''
01:39:31.533 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 -p
'{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.010 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10
-p '{"dpi": 0, "range": 4.0, "max_neighbour": 0.05}''
01:58:02.061 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -p
'{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter":
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.186 INFO ocrd.task_sequence.run_tasks - Finished processing task 'calamari-recognize -I OCR-D-N10 -O OCR-D-OCR
-p '{"checkpoint": "/usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json", "voter":
"confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
03:01:22.268 INFO ocrd.cli.process - Finished
32186.10user 9665.10system 5:42:03elapsed 203%CPU (0avgtext+0avgdata 12763172maxresident)k
6303240inputs+45550456outputs (12729major+450089605minor)pagefaults 0swaps
Finally went through; took hours.
Since this only occurs in combination with pc-segmentation and pc-segmentation seems currently the weakest segmentation method, I'll close this case.I would really like to debug this, but unfortunately I have not been able to run ocrd-pc-segmentation in the past. So could you please provide me with the last input file? I.e. fileGrp
OCR-D-N8
file with pageIdOCR-D-N8_00062
– only the PAGE-XML (since you gave me the images already)...Before we close, we should make sure this is not a bug on ocrd_cis side. Could you please
ocrd workspace validate
, esp.OCR-D-N8
?
I was able to run ocrd-pc-segmentation
now. I can reproduce the extremely long runtime of resegment
afterwards.
From what I see, this is somewhat related to bad segmentation quality (undetected multi-column layouts). ocrd-pc-segmentation
does produce invalid PAGE (negative coordinates etc).
But this also exposes a weakness in the resegmentation algorithm: if input regions are quite large, then the new line segmentation plus pair-wise comparison with existing lines and majority vote is inefficient.
I'll have to think about his.
Could you please revisit with the current master version @jbarth-ubhd ?