/tokenizer

Natural Language Tokenizer

Primary LanguageGoApache License 2.0Apache-2.0

Multilingual Tokenizer

Introduction

Package tokenizer is a golang library for multilingual tokenization. It is based on the segment package of blevesearch, whose implementation follows the description at Unicode Standard Annex #29.

Usage

go get github.com/liuzl/tokenizer
package main

import (
    "fmt"

    "github.com/liuzl/tokenizer"
)

func main() {
    c := `Life is like a box of chocolates. You never know what you're gonna get.`
    var ret = tokenizer.Tokenize(c)
    for _, term := range ret {
        fmt.Println(term)
    }
}

Implementation Details

  1. Segment UTF-8 string as described at Unicode Standard Annex #29.
  2. Deal with English contractions.
  3. Deal with English possessives.
  4. Deal with Numbers with unit.
  5. SBC case to DBC case conversion.

Licence

This package is licenced under the Apache License 2.0.