WorksApplications/SudachiPy

Slow building of user dictionary with surface form split info

yokomotod opened this issue · 0 comments

It seems take 30+ seconds per entry to build user dictionary, if its words have split info with word info (構成語情報).

$ sudachipy ubuild user.csv -o user.dic
reading the source file...1 words
writing the POS table...2 bytes
writing the connection matrix...4 bytes
building the trie...done
writing the trie...1028 bytes
writing the word-ID table...9 bytes
writing the word parameters...10 bytes
writing the word_infos...70 bytes
writing word_info offsets...4 bytes

real	0m38.654s
user	0m38.499s
sys	0m0.139s

user.csv:

舞台藝術,5146,5146,8000,舞台藝術,名詞,普通名詞,一般,*,*,*,ブタイゲイジュツ,舞台芸術,*,C,"舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ","舞台,名詞,普通名詞,一般,*,*,*,ブタイ/藝術,名詞,普通名詞,一般,*,*,*,ゲイジュツ",*

If I have two lines then it takes 1m.

※ No problem with word id split info

舞台藝術,5146,5146,8000,舞台藝術,名詞,普通名詞,一般,*,*,*,ブタイゲイジュツ,舞台芸術,*,C,647312/659236,647312/659236,647312/659236,*

$ sudachipy ubuild user.csv -o user.dic
real	0m0.925s
user	0m0.776s
sys	0m0.159s

$ sudachipy --version
sudachipy 0.5.2