Slow building of user dictionary with surface form split info
yokomotod opened this issue · 0 comments
yokomotod commented
It seems take 30+ seconds per entry to build user dictionary, if its words have split info with word info (構成語情報).
$ sudachipy ubuild user.csv -o user.dic
reading the source file...1 words
writing the POS table...2 bytes
writing the connection matrix...4 bytes
building the trie...done
writing the trie...1028 bytes
writing the word-ID table...9 bytes
writing the word parameters...10 bytes
writing the word_infos...70 bytes
writing word_info offsets...4 bytes
real 0m38.654s
user 0m38.499s
sys 0m0.139s
user.csv:
舞台藝術,5146,5146,8000,舞台藝術,名詞,普通名詞,一般,*,*,*,ブタイゲイジュツ,舞台芸術,*,C,"舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ",*
If I have two lines then it takes 1m.
※ No problem with word id split info
舞台藝術,5146,5146,8000,舞台藝術,名詞,普通名詞,一般,*,*,*,ブタイゲイジュツ,舞台芸術,*,C,647312/659236,647312/659236,647312/659236,*
$ sudachipy ubuild user.csv -o user.dic
real 0m0.925s
user 0m0.776s
sys 0m0.159s
$ sudachipy --version
sudachipy 0.5.2