Word segmentation

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like "thisisatest". Humans can immediately identify that the correct phrase should be "this is a test".

Source and credits

This package is heavily inspired by the Python module grantjenks/wordsegment.

The package is based on code from the chapter Natural Language Corpus Data by Peter Norvig from the book Beautiful Data (Segaran and Hammerbacher, 2009).

Getting started

You can grab this package with the following command:

go get gopkg.in/antoineaugusti/wordsegmentation.v0

Usage

If you wanna use the default English corpus:

package main

import (
    "fmt"

    "github.com/antoineaugusti/wordsegmentation"
    "github.com/antoineaugusti/wordsegmentation/corpus"
)

func main() {
    // Grab the default English corpus that will be created thanks to TSV files
    englishCorpus := corpus.NewEnglishCorpus()
    fmt.Println(wordsegmentation.Segment(englishCorpus, "thisisatest"))
}

Unigrams and bigrams

Information: an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

This package ships with an English corpus by default that is ready to use. Data files are derived from the Google web trillion word corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Using a custom corpus

If you want to use a custom corpus, you will need to implement the Corpus interface to give to the Segment method.

The interface is as follow:

// The corpus interface that lets access bigrams,
// unigrams, the total number of words from the corpus
// and a function to clean a string.
type Corpus interface {
    Bigrams() *models.Bigrams
    Unigrams() *models.Unigrams
    Total() float64
    Clean(string) string
}

Take a look at the English corpus source code to help you start!

Documentation

The documentation of this package can be found on GoDoc. Here is a list of links for the different modules:

corpus - the default English corpus
helpers - little functions to get the length of a string, remove special characters of a string, get the minimum between 2 given integers
models - the various objects used (Unigrams, Bigrams, Arrangement, Candidate, Possibility)
parsers - parsers to read tab-separated files into Unigrams and Bigrams
segment - the 'main' package