cisocrgroup/ocrd_cis

ocrd-cis-align producing unexpected xml

Closed this issue · 3 comments

b2m commented

When using ocrd-cis-align together with ocrd-dinglehopper I noticed some unexpected behavior regarding the generated XML by ocrd-cis-align.

See qurator-spk/dinglehopper#37 for the way that led me here.

Here is the minimal workflow to reproduce the problem (also in the attached zip file as workflow.sh).

ocrd workspace init
ocrd workspace set-id "OCR-D-CIS-ALIGN-BUG"

ocrd workspace add --file-grp OCR-D-IMG --file-id OCR-D-IMG_f0001 --mimetype image/jpg --page-id PAGE_0001 OCR-D-IMG/FILE_0001.jpg

ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola -P k 0.2
ocrd-cis-ocropy-segment -I OCR-D-BIN  -O OCR-D-SEG-REG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS1 -P textequiv_level word
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS2 -P textequiv_level word
ocrd-cis-align -I OCR-D-OCR-TESS1,OCR-D-OCR-TESS2 -O OCR-D-ALIGN

I added a minimal example on how to reproduce the problem using docker and the attached data:

 docker run --rm -it -v ${WORKSPACE}:/data -w /data -- ocrd/all:maximum bash workflow.sh

The unexpected part is, that the information from the text line from OCR-D-OCR-TESS2 is split into two XML nodes:

<pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000"/>
<pc:TextEquiv conf="0.639380130767822" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000">
  <pc:Unicode>上оrеm টрsum</pc:Unicode>
</pc:TextEquiv>
finkf commented

Thanks for reporting. I'll look into this issue.

finkf commented

I could fix the issue see #78

b2m commented

Confirmed, I could not reproduce this issue with the newest version of OCR-D.