/uax29

A tokenizer based on Unicode text segmentation (UAX 29), for Go

Primary LanguageGoMIT LicenseMIT

This package tokenizes words, sentences and graphemes, based on Unicode text segmentation (UAX 29), for Unicode version 13.0.0.

This is a fork off of github.com/clipperhouse/uax29/words. Modifcations have been made to the words package:

  • A max token length can be passed in. Tokens will be split upon hitting this limit.
  • Separators will be marked, so they can be omitted from the token stream if desired.