Feature Request: UAX 29 word boundaries
maxshortzp opened this issue · 2 comments
Hi,
The README mentions that UAX 29 word boundaries could potentially be implemented if someone wants them.
These would be very helpful for my use case.
I don't know enough about UAX 29/text parsing in general to accomplish this by myself but could potentially contribute code if given direction.
Hey @maxshortzp, thanks for writing in. Looks like the README is out-of-date - word boundaries are currently supported:
"The quick brown fox".localize.each_word.to_a
# => ["The", " ", "quick", " ", "brown", " ", "fox"]
Unfortunately only rule-based word segmentation is supported, meaning strings written with Japanese, Chinese, Thai, Khmer, etc scripts won't work. As luck would have it, I'm currently (in my spare time) trying to add dictionary-based word segmentation support, but it's rather slow going. You can follow the progress on the dictionary_segmentation
branch.
Thanks @camertron. That works for our use case.