Feature Request: UAX 29 word boundaries

Question

Feature Request: UAX 29 word boundaries

maxshortzp opened this issue 7 years ago · 2 comments

Hi,

The README mentions that UAX 29 word boundaries could potentially be implemented if someone wants them.

These would be very helpful for my use case.

I don't know enough about UAX 29/text parsing in general to accomplish this by myself but could potentially contribute code if given direction.

Answer 1 · 2018-05-03T21:24:16.000Z

Hey @maxshortzp, thanks for writing in. Looks like the README is out-of-date - word boundaries are currently supported:

"The quick brown fox".localize.each_word.to_a
# => ["The", " ", "quick", " ", "brown", " ", "fox"]

Unfortunately only rule-based word segmentation is supported, meaning strings written with Japanese, Chinese, Thai, Khmer, etc scripts won't work. As luck would have it, I'm currently (in my spare time) trying to add dictionary-based word segmentation support, but it's rather slow going. You can follow the progress on the dictionary_segmentation branch.

Answer 2 · 2018-05-04T17:49:14.000Z

Thanks @camertron. That works for our use case.