How to Add or Edit [script].unicharset in langdata folder?

Question

How to Add or Edit [script].unicharset in langdata folder?

sethleech opened this issue 7 years ago · 3 comments

I want to know How to get 'glyph_metrics' data from [font or several fonts].

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed

2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning

output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed

I found out

[script].unicharset file is officially supported.
entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

Answer 1 · 2017-10-24T08:27:44.000Z

Based on comments by @theraysmith, all other properties are not required for lstm training.

…

On 24-Oct-2017 12:34 PM, "sethleech" ***@***.***> wrote: How to Add or Edit [script].unicharset in langdata folder? - I want to know *How to get 'glyph_metrics' data from [font or several fonts]*. Dear all, I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F. CJK Unified Ideographs Extension B: U+20000–U+2A6D6 CJK Unified Ideographs Extension C: U+2A700–U+2B734 CJK Unified Ideographs Extension D: U+2B740–U+2B81D CJK Unified Ideographs Extension E: U+2B820–U+2CEA1 CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0 Please refer : when training tesseract, I tried this 1st try : ** unicharset_extractor ** tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x ** set_unicharset_properties ** tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights Warning: properties incomplete for index 4 = 𥮗 output is [lang].unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x => not changed 2nd try : I edited file langdata/han.unicharset line 0 : 23514 -> 23515 add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67 ** unicharset_extractor ** tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x ** set_unicharset_properties ** tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights no warning output is [lang].unicharset : 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x => changed I found out 1. [script].unicharset file is officially supported. 2. entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form' How to get 'glyph_metrics' data from [font or several fonts]? Thank you in advance. Regards, — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#99>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_oyd9YSyC9dte2FuL_UxEzgIEoZguks5svYvhgaJpZM4QD8GV> .

Answer 2 · 2017-10-25T02:31:26.000Z

My project is running on android-device.
By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE".
So I can use Tesseract 3.05.01.

Pls any information?

Answer 3 · 2020-02-11T21:17:30.000Z

I have the same question.