ocrd-cis-align producing unexpected xml
Closed this issue · 3 comments
b2m commented
When using ocrd-cis-align together with ocrd-dinglehopper I noticed some unexpected behavior regarding the generated XML by ocrd-cis-align.
See qurator-spk/dinglehopper#37 for the way that led me here.
Here is the minimal workflow to reproduce the problem (also in the attached zip file as workflow.sh
).
ocrd workspace init
ocrd workspace set-id "OCR-D-CIS-ALIGN-BUG"
ocrd workspace add --file-grp OCR-D-IMG --file-id OCR-D-IMG_f0001 --mimetype image/jpg --page-id PAGE_0001 OCR-D-IMG/FILE_0001.jpg
ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola -P k 0.2
ocrd-cis-ocropy-segment -I OCR-D-BIN -O OCR-D-SEG-REG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS1 -P textequiv_level word
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS2 -P textequiv_level word
ocrd-cis-align -I OCR-D-OCR-TESS1,OCR-D-OCR-TESS2 -O OCR-D-ALIGN
I added a minimal example on how to reproduce the problem using docker and the attached data:
docker run --rm -it -v ${WORKSPACE}:/data -w /data -- ocrd/all:maximum bash workflow.sh
The unexpected part is, that the information from the text line from OCR-D-OCR-TESS2 is split into two XML nodes:
<pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000"/>
<pc:TextEquiv conf="0.639380130767822" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000">
<pc:Unicode>上оrеm টрsum</pc:Unicode>
</pc:TextEquiv>
finkf commented
Thanks for reporting. I'll look into this issue.
b2m commented
Confirmed, I could not reproduce this issue with the newest version of OCR-D.