qurator-spk/sbb_textline_detection

Good segmentation results but bad OCR when using sbb-textline-detection with OCR-D

Opened this issue · 11 comments

I'm not sure whether this is the right place to ask as sbb-textline-detector itself worked perfectly in our OCR-D workflows and the produced segmentation results look good as well but running any recognition (calamari-recognize as well as tesserocr-recognize) afterwards yields weird text output that seems worse than it should be (regarding the good segmentation results).

I basically used the (formerly) recommended workflow and substituted everything starting from the region segmentation up to the line segmentation with sbb-textline-detector.

The region segmentation produced by this looks pretty good and this impression is confirmed by the pixel accuracy evaluation we ran for several segmentation workflows (with cis-ocropy-segment, tesserocr-segment-region, …). The line segmentation looks pretty good as well and should probably be a good basis for running OCR on it but as stated above the results are somehow surprisingly bad. I tried to run the recognition directly on the produced segmentation (OCR-D-SEG-LINE) without dewarping first but the results are even worse that way.

Am I missing something obvious (e.g. adding a certain step after running sbb-textline-detector)?

Workflow steps
"olena-binarize -I input -O OCR-D-BIN -P impl sauvola"
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
"olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim"
"cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"
"cis-ocropy-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P level-of-operation page"
"sbb-textline-detector -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-LINE -P model /home/mn/Desktop/sbbmodels/mixed"
"cis-ocropy-dewarp -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP"
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /home/mn/Desktop/ocrd_calamari/gt4histocr-calamari/\*.ckpt.json"
    
Region segmentation output
Line segmentation output
Text output!




) weglaſſen. — Und — — großen, oder bald einen groͤßern, kleinern Raum dazwiſchen laſſen. ..C 2 ..H. A.. .. ſ. A
rts, ſondern gerade auf das Papier
rr uue— rſ — ewoͤhnlichſte Schrift iſt die Current— n .nehr von der Rechten zur Linken g me o ene i die unter oder uͤber die Linie hervor— uchſtaben alle gleich weit hervorragen. roßer Fehler, wenn die Buchſtaben zu br ao ſ — ſ o i ui — in Wort ausmachen, einzeln zu ſchrei— ern ſie muͤſſen, ſo viel moͤglich iſt, ſo en vorhergehenden, als mit den fol— — . ſRa o. ſ X2 itt — — —

The input image for the example page and the produced PAGE XML can be found here in case it helps.

Could you upload the result before the dewarping step, too? My hunch is that the dewarping produces too thick lines. Ideally, upload the whole workspace contents for this page.

There is also an issue that the line texts aren't matching the line images, but this could just be an issue with PAGEViewer:

image

The text box should give the result for the first line, but gives some text from the second line.

Could you upload the result before the dewarping step, too?

OCR-D-SEG-LINE_0005.xml (sbb-seg.zip) is the the PAGE XML outputted by sbb-textline-detector, OCR-D-OCR2_0005.xml and OCR-TXT2_0005.txt is the OCR output when running calamari-recognize directly on OCR-D-SEG-LINE_0005.xml.

There is also an issue that the line texts aren't matching the line images

That bug (?) appears in LAREX as well but my first thought was that it's a problem caused by LAREX as it's not really 100% compatible with OCR-D yet.

Here is the result with my (a lot simpler) my_ocrd_workflow.

2020-10-sbb_textline_detection-issue-42.zip

The result is fine (paragraphs 2+3):

Die gewoͤhnlichſte Schrift iſt die Current⸗
ſchrift, deren Buchſtaben nicht zu gerade herun—
ter, ſondern mehr von der Rechten zur Linken
herabliegend geſchrieben werden muͤſſen — Es iſt
gut, wenn die unter oder uͤber die Linie hervor⸗
ragende Buchſtaben alle gleich weit hervorragen.
Es iſt ein großer Fehler, wenn die Buchſtaben zu
gedraͤngt ſtehen, oder zu weit gedehnt ſind. Auch
muß man ſich huͤten, die Buchſtaben, die zu—
ſammen Ein Wort ausmachen, einzeln zu ſchrei—
ben, ſondern ſie muͤſſen, ſo viel moͤglich iſt, ſo
wohl mit den vorhergehenden, als mit den fol—
genden zuſammenhaͤngen. —
Man muß den Currentbuchſtaben nicht un—
nuͤtze Zierrathen anhaͤngen, oder ihre Schwei⸗
fungen zu ſehr vergroͤßern.

So the problem is somewhere in all the cropping/dewarping/deskewing or the handling thereof. This is going to take some time to debug. But I wanted to check out the dewarping anyway ;-)

Using your more minimal workflow with sbb-textline-detector gave me the same results (which look a bit more like the result I expected :D )

Yeah, superficially I only see problems with the hyphens.

I tried switching off different pre-processing steps before segmenting (seeing that minimal pre-processing seems to work just fine in this case) and it seems that cropping is responsible for the bad results.
The above workflow without anybaseocr-crop yields good results, turning off the other pre-processing steps but leaving cropping in the workflow always yields bad results for this page.

Thanks for the analysis. I'll look into the problem, could be an API problem in ocrd-sbb-texline-detector.

cneud commented

Other @OCR-D users also reported issues with anybaseocr-crop. But if the expected results can in fact be achieved with https://github.com/mikegerber/my_ocrd_workflow/, this rather hints at a problem in the OCR-D workflow or in the way ocrd-sbb-textline-detector writes its output PAGE-XML (cc @kba).

Btw there is also this nice fork https://github.com/sulzbals/gbn which provides a more granular API that is @OCR-D compliant, in case this may be useful for testing/debugging.

Regarding the way cropping and line-deskewing/dewarping are applied by sbb-textline-detector, @vahidrezanezhad can fill in the details much better than me.

I changed the code to retrieve the image and to calculate the coordinates, could you try again with current master/ 020ffbc? (I don't have a setup of anybaseocr + cis-ocropy yet, so it would help if you could try again.)

cneud commented

Possibly relates to #48