Calamari-OCR/calamari

[Not an issue] Train a calamari model on RTL languages print or handwritten

Opened this issue · 4 comments

Sorry I open up as an issue.
I want to train a calamari model on RTL languages print or handwritten:

  • arabic
  • hebrew
  • samaritan
  • syriac

Are there any additional steps to do for RTL languages and scripts? Do I have to specify somehow the RTL direction, or do I have to reverse my texts? Use pyton-bidi or anything?
I'm used to kraken and other OCR/HTR but I never tried calamari.

Thank you so much!

Training works out of the box for RTL texts but you should set the direction in the preprocessing option bidi_direction to RTL to prevent the bidi algorithms from guessing the wrong order in some cases. Have a look at my demo notebook for an example training procedure and the command line options needed for Arabic texts. Ignore the printouts of the example predictions between epochs, this has been broken for ages and does not represent the performance of the model at all.

You might also want to start from the def_arabic-model in calamari_models_experimental – at least for Arabic printed material.

Good luck with the training!

Thank you so much!!!
I'm a console guy but anyway... is there for calamari something like eScriptorium for kraken? I mean a web UI where you can adnotate transcriptions etc.
And last question: for calamari I need to train only recognition models and not segmentation if I'm right?

Many thanks!

Calamari needs the input to be pre-segmented, either with coordinates in PAGE XML or as line images. I've been using LAREX for the semi-automatic region segmentation, ocropus line segmentation, calamari for the training/recognition, and a custom web app for the manual post correction. Especially for Arabic handwriting I'd expect kraken to perform much better for the segmentation task.

If you want a full pipeline and user interface, there is https://www.ocr4all.org/ – it contains calamari and other OCR engines, segmentation tools etc., but I'm not sure if it has been thoroughly tested with RTL texts.

Thank you so much for your help!