Modern Greek data issues
chopinesque opened this issue · 10 comments
There are 2 major issues with the Greek data.
They tend to produce µ (micro sign) instead of μ (Greek m letter) and despite choosing Modern Greek (ell), some characters have accents that belong to polytonic Greek.
https://github.com/tesseract-ocr/langdata_lstm/tree/main/ell contains training text and a word list with the same issues, so the model was trained to produce such results.
Right, so how can this be fixed? For example I can see in https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/desired_characters and https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/ell.unicharset the existence of polytonic characters which should not be there.
In a first step you could send a pull request for langdata_lstm
which fixes the files there. But finally new trainings are required, maybe based on the existing models for Greek.
OK, I may need some guidance please. I created a fork. So do I simply have to remove non-valid characters from above mentioned files?
I also see
tessedit_load_sublangs grc
https://github.com/chopinesque/langdata_lstm_modern_greek/blob/main/ell/ell.config#L2
I am not sure whether this line should be there going forward.
So do I simply have to remove non-valid characters from above mentioned files?
Remove or replace, what fits better.
Thanks. If I replace, I need to know about the structure, for example,
ὶ 3 0,255,0,255,0,0,0,0,0,0 Greek 124 0 124 ὶ # ὶ [1f76 ]a
How is the 124 0 124
derived?
You can keep the unicharset file unmodified. A replacement will be created when a new training is run.
tessedit_load_sublangs grc
That line tells Tesseract to always use grc
in addition to ell
. Therefore wrong glyphs can also come from grc
as long as that configuration is there.
You can keep the unicharset file unmodified. A replacement will be created when a new training is run.
Not sure then which files I should change. I don't think I have the knowledge to do any training (I also use Windows).
tessedit_load_sublangs grc
That line tells Tesseract to always use
grc
in addition toell
. Therefore wrong glyphs can also come fromgrc
as long as that configuration is there.
So this line should be removed.