twitter/twitter-cldr-rb

Breaking by word a string containing Japanese and Latin characters

edouard opened this issue · 3 comments

Describe the bug

We’re using TwitterCldr::Segmentation::BreakIterator’seach_word method to count words in multiple languages. We just got an exception for a string in Japanese, which contains both Japanese and Latin characters. This is common for when using Western brand names for instance.

To Reproduce

Steps to reproduce the behavior:

string = 'TWITTERド'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

Also, this string works:

string = 'WINDYのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> WINDY

アカウントを作成する
 => #<Enumerator: ...> 

Interestingly enough, taking that string above and replacing WINDY with TWITTER doesn’t work 🤔:

string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

Expected behavior

The BreakIterator shouldn't raise an exception

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-darwin21]

Additional context
Add any other context about the problem here.

Answering my own questions here...

Interestingly enough, taking that string above and replacing WINDY with TWITTER doesn’t work 🤔:

string = 'TWITTERのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> /Users/edouard/.rvm/gems/ruby-3.1.2@webtranslateit.com/gems/twitter_cldr-6.11.3/lib/twitter_cldr/segmentation/cj_break_engine.rb:110:in `<': comparison of Integer with nil failed (ArgumentError)

It seems to be due to the length of the latin word:

string = 'TWITTのアカウントを作成する'
iterator = TwitterCldr::Segmentation::BreakIterator.new(:ja)
iterator.each_word(string) {|word| puts word }
#=> TWITT

アカウントを作成する

Looks like the error we see has to do with the length of the latin word.

def get_katakana_cost(word_length)
if word_length > MAX_KATAKANA_LENGTH
MAX_KATAKANA_COST
else
KATAKANA_COSTS[word_length]
end
end

Hey @edouard, thanks for reporting this. Please see #261 for fix details. The fix has been published in v6.11.4.

Cool! Thanks for fixing it so quickly! 👍🏽