Word segmentation
Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like "thisisatest". Humans can immediately identify that the correct phrase should be "this is a test".
Source and credits
This package is heavily inspired by the Python module grantjenks/wordsegment.
Copyright (c) 2015 by Grant Jenks under the Apache 2 license
The package is based on code from the chapter Natural Language Corpus Data by Peter Norvig from the book Beautiful Data (Segaran and Hammerbacher, 2009).
Copyright (c) 2008-2009 by Peter Norvig
Getting started
You can grab this package with the following command:
go get gopkg.in/antoineaugusti/wordsegmentation.v0
Usage
If you wanna use the default English corpus:
package main
import (
"fmt"
"github.com/antoineaugusti/wordsegmentation"
"github.com/antoineaugusti/wordsegmentation/corpus"
)
func main() {
// Grab the default English corpus that will be created thanks to TSV files
englishCorpus := corpus.NewEnglishCorpus()
fmt.Println(wordsegmentation.Segment(englishCorpus, "thisisatest"))
}
Unigrams and bigrams
Information: an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
This package ships with an English corpus by default that is ready to use. Data files are derived from the Google web trillion word corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Using a custom corpus
If you want to use a custom corpus, you will need to implement the Corpus
interface to give to the Segment
method.
The interface is as follow:
// The corpus interface that lets access bigrams,
// unigrams, the total number of words from the corpus
// and a function to clean a string.
type Corpus interface {
Bigrams() *models.Bigrams
Unigrams() *models.Unigrams
Total() float64
Clean(string) string
}
Take a look at the English corpus source code to help you start!
Documentation
The documentation of this package can be found on GoDoc. Here is a list of links for the different modules:
corpus
- the default English corpushelpers
- little functions to get the length of a string, remove special characters of a string, get the minimum between 2 given integersmodels
- the various objects used (Unigrams, Bigrams, Arrangement, Candidate, Possibility)parsers
- parsers to read tab-separated files into Unigrams and Bigramssegment
- the 'main' package