Multilingual Tokenizer

Introduction

Package tokenizer is a golang library for multilingual tokenization. It is based on the segment package of blevesearch, whose implementation follows the description at Unicode Standard Annex #29.

Usage

go get github.com/liuzl/tokenizer

package main

import (
    "fmt"

    "github.com/liuzl/tokenizer"
)

func main() {
    c := `Life is like a box of chocolates. You never know what you're gonna get.`
    var ret = tokenizer.Tokenize(c)
    for _, term := range ret {
        fmt.Println(term)
    }
}

Implementation Details

Segment UTF-8 string as described at Unicode Standard Annex #29.
Deal with English contractions.
Deal with English possessives.
Deal with Numbers with unit.
SBC case to DBC case conversion.

Licence

This package is licenced under the Apache License 2.0.