plamola/ocr-joplin-notes

OCR with different languages

Opened this issue · 2 comments

Hey,

firstly I wanted to thank you for your effort :).

In my case, there are images with text in two different languages (english and german). But the problem is, that the OCR can only process my notes once and the language can only be one of the two.

Or does it even matter which language I choose if they have the same characters? I don't know why but I had problems if I set the language setting to german while with english it was fine. It was even possible to get german words.

Thanks

I would expect the German OCR to also support the 'scharfes s' and 'umlaut', while the English OCR would not.

If the OCR is only intended to be used by the search in Joplin, it could also be an option to OCR the file with multiple languages in a single pass. Results of all the processed languages could then be added as the meta data.

Something to consider for a future improvement

according to https://stackoverflow.com/questions/24379781/how-can-i-run-tesseract-with-multiple-languages-one-time and other souces like https://nanonets.com/blog/ocr-with-tesseract , since tesseract 3.02 it is possible to use several languages at the same time. The single three-digit language codes should be separated with a "+".

modified examples i found in the linked sources:

`
tesseract myscan.png out -l deu+eng

custom_config = r'-l deu+eng --psm 6'
pytesseract.image_to_string(img, config=custom_config)

custom_config = r'-l deu+eng --psm 6'
txt = pytesseract.image_to_string(img, config=custom_config)

from langdetect import detect_langs
detect_langs(txt)
`