Package tokenizer
is a golang library for multilingual tokenization. It is based on the segment package of blevesearch, whose implementation follows the description at Unicode Standard Annex #29.
go get github.com/liuzl/tokenizer
package main
import (
"fmt"
"github.com/liuzl/tokenizer"
)
func main() {
c := `Life is like a box of chocolates. You never know what you're gonna get.`
var ret = tokenizer.Tokenize(c)
for _, term := range ret {
fmt.Println(term)
}
}
- Segment UTF-8 string as described at Unicode Standard Annex #29.
- Deal with English contractions.
- Deal with English possessives.
- Deal with Numbers with unit.
- SBC case to DBC case conversion.
This package is licenced under the Apache License 2.0.