Breaking by word a string containing Japanese and Latin characters
edouard opened this issue · 3 comments
Describe the bug
We’re using TwitterCldr::Segmentation::BreakIterator
’seach_word
method to count words in multiple languages. We just got an exception for a string in Japanese, which contains both Japanese and Latin characters. This is common for when using Western brand names for instance.
To Reproduce
Steps to reproduce the behavior:
string = 'TWITTERド'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)
Also, this string works:
string = 'WINDYのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> WINDY
の
アカウントを作成する
=> #<Enumerator: ...>
Interestingly enough, taking that string above and replacing WINDY
with TWITTER
doesn’t work 🤔:
string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)
Expected behavior
The BreakIterator
shouldn't raise an exception
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-darwin21]
Additional context
Add any other context about the problem here.
Answering my own questions here...
Interestingly enough, taking that string above and replacing WINDY with TWITTER doesn’t work 🤔:
string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)
It seems to be due to the length of the latin word:
string = 'TWITTのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> TWITT
の
アカウントを作成する
Looks like the error we see has to do with the length of the latin word.
twitter-cldr-rb/lib/twitter_cldr/segmentation/cj_break_engine.rb
Lines 149 to 155 in 09a1db0
Cool! Thanks for fixing it so quickly! 👍🏽