Wordlist cleaning (lots of incomplete words found in Thai wordlist)
bact opened this issue · 8 comments
I spotted a bunch of instances in langdata/tha/tha.wordlist that for sure they are invalid Thai words, since they go against word formation rules (like having a vowel that require an immediate consonant, but that consonant is missing).
For example [line number][instance]:
165 ส์
207 ต์
335 ย์
404 ท์
428 ด์
527 น์
580 ห์
629 อั
658 ล์
774 ค์
787 นั
798 ชั่
863 เอ็
886 สั
986 มั
1114 ว์
1187 ฮั
1244 ชั
1305 กั
1310 ษ์
1380 ลั
1487 บั
1554 ดั
...
7649 ลั่
7656 ยั
7666 ฉั
7733 เกี๋
7914 ล่
7931 น๊
8008 ส่
8045 ญั
8100 ข์
8148 ด่
...
Should we remove all these instances? They seems to have some patterns as well, like:
- char + u0e31 : c ั
- char + u0e31 + tonemarks : c ั่
- char + u0e4c : c ์
- u0e40 + char + u0e47 : เc็
Do instance in this wordlist meant to be a word in itself, or it suppose to be a component of a larger word? If it's the latter case, it's totally ok to leave them as they are. But if it's the first case, we should remove them, as they are not words.
I'm not entirely sure how tesseract utilizes XXX.wordlist in langdata, so please correct me if this is irrelevant. Thank you.
This is a file from 3.04. I would suggest that you unpack the current traineddata, extract the wordlist from it and see if the list is the same.
If it is, try removing the error words from that list, combine the traineddata again and test for accuracy.
The commands should be similar to the following, please change as per the paths in your setup.
combine_tessdata -u ./tessdata_best/tha.traineddata ./tessdata_TEST/tha.
dawg2wordlist ./tessdata_TEST/tha.lstm-unicharset ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-word-list
REVIEW & EDIT wordlist
wordlist2dawg ./tessdata_TEST/tha.lstm-word-list ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-unicharset
combine_tessdata ./tessdata_TEST/tha.
COMPARE accuracy of ./tessdata_best/tha.traineddata and ./tessdata_TEST/tha.traineddata
Thank you for detailed instructions. I will try that accordingly.
Saw those error words in current tha.traineddata (from https://github.com/tesseract-ocr/tessdata_best) as well.
Current ./tessdata_best/tha.lstm-word-list : 9083 lines
Modified ./tessdata_TEST/tha.lstm-word-list : 8811 lines (272 error words removed)
Compare the two tessdata with a screenshot of short text from https://prachatai.com/journal/2018/02/75448 (chose two paragraphs with Thai text only), with options "--oem 1 -l tha" (LSTM, Thai).
No much difference in accuracy, as both went as bad :(
Although the modified tessdata is slightly (very slightly) better.
Example original text:
กิติภูมิ กล่าวว่า มาร์กบอกว่าการต่อสู้ทางชนชั้น
Output text from current tessdata:
ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั ้ น
Output text from modified tessdata:
ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั้น
Characters got recognized perfectly in both tessdata.
But as you can see, most of the time characters are separated by space. It shouldn't.
The only difference between outputs from current tessdata and modified tessdata here is that the last word "ชั้น" from modified tessdata is actually comes combined as a proper word, no spaces in between.
In general, by removing impossible combination of characters in Thai language from the word list, the output is a little more accurate. But maybe I need to adjust some config.
Current tha.config:
segsearch_max_futile_classifications 10
language_model_ngram_on 1
language_model_ngram_space_delimited_language F
chop_enable 0
These are patterns of words that got removed:
^.[่้๊๋็ํั์]$
^.[ัื][่้๊๋]$
^เ.[็ิีื][่้๊๋]?$
Extra spaces could be related to issue reported earlier (for a different language) - see tesseract-ocr/tesseract#1009
You may want to try ocr with
-c preserve_interword_spaces=1
to remove extra spaces
Thank you! Extra spaces solved with -c preserve_interword_spaces=1
From the same web page, tested with several different parts of text,
current tessdata and modified tessdata produced exactly the same output.
No improvement in terms of accuracy can be measured from the test.
so looks like that wordlist is not used much in recognition.
preserve_interword_spaces=1 should be added to the config files in tessdata_fast for CJK languages and Thai.
Extra space problem identified in the comment above - #106 (comment)
Characters got recognized perfectly in both tessdata.
But as you can see, most of the time characters are separated by space. It shouldn't.
Fixed via
tesseract-ocr/tessdata_fast#7
@zdenop Please close this issue, after PR is merged in tessdata_fast.