How to Add or Edit [script].unicharset in langdata folder?
sethleech opened this issue · 3 comments
How to Add or Edit [script].unicharset in langdata folder?
- I want to know How to get 'glyph_metrics' data from [font or several fonts].
Dear all,
I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset
I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0
Please refer : when training tesseract, I tried this
1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
Warning: properties incomplete for index 4 = 𥮗
output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed
2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning
output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed
I found out
- [script].unicharset file is officially supported.
- entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
How to get 'glyph_metrics' data from [font or several fonts]?
Thank you in advance.
Regards,
My project is running on android-device.
By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE".
So I can use Tesseract 3.05.01.
Pls any information?
I have the same question.