[Not an issue] Train a calamari model on RTL languages print or handwritten

Question

[Not an issue] Train a calamari model on RTL languages print or handwritten

Opened this issue 3 months ago · 4 comments

Sorry I open up as an issue.
I want to train a calamari model on RTL languages print or handwritten:

arabic
hebrew
samaritan
syriac

Are there any additional steps to do for RTL languages and scripts? Do I have to specify somehow the RTL direction, or do I have to reverse my texts? Use pyton-bidi or anything?
I'm used to kraken and other OCR/HTR but I never tried calamari.

Thank you so much!

Answer 1 · 2024-10-16T15:16:44.000Z

Training works out of the box for RTL texts but you should set the direction in the preprocessing option bidi_direction to RTL to prevent the bidi algorithms from guessing the wrong order in some cases. Have a look at my demo notebook for an example training procedure and the command line options needed for Arabic texts. Ignore the printouts of the example predictions between epochs, this has been broken for ages and does not represent the performance of the model at all.

You might also want to start from the def_arabic-model in calamari_models_experimental – at least for Arabic printed material.

Good luck with the training!

Answer 2 · 2024-10-16T15:20:27.000Z

Thank you so much!!!
I'm a console guy but anyway... is there for calamari something like eScriptorium for kraken? I mean a web UI where you can adnotate transcriptions etc.
And last question: for calamari I need to train only recognition models and not segmentation if I'm right?

Many thanks!

Answer 3 · 2024-10-16T15:35:03.000Z

Calamari needs the input to be pre-segmented, either with coordinates in PAGE XML or as line images. I've been using LAREX for the semi-automatic region segmentation, ocropus line segmentation, calamari for the training/recognition, and a custom web app for the manual post correction. Especially for Arabic handwriting I'd expect kraken to perform much better for the segmentation task.

If you want a full pipeline and user interface, there is https://www.ocr4all.org/ – it contains calamari and other OCR engines, segmentation tools etc., but I'm not sure if it has been thoroughly tested with RTL texts.

Answer 4 · 2024-10-16T15:41:00.000Z

Thank you so much for your help!