tesseract-ocr/langdata

How to Add or Edit [script].unicharset in langdata folder?

sethleech opened this issue · 3 comments

How to Add or Edit [script].unicharset in langdata folder?

  • I want to know How to get 'glyph_metrics' data from [font or several fonts].

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed

2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning

output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed

I found out

  1. [script].unicharset file is officially supported.
  2. entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

My project is running on android-device.
By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE".
So I can use Tesseract 3.05.01.

Pls any information?

I have the same question.