/WordSegmenter

A linear chain conditional random field model for word segmentation based on syllables

Primary LanguageC++

TABLE OF CONTENTS

1. About
2. Quickstart
3. Introduction
4. Compiling
5. Training
6. Testing
7. Additional remarks


1. ABOUT
   WordSegmenter is a word segmentation system that identifies
   sequences of syllables in English words. A common application
   of this is in hyphenation of english words.

2. QUICKSTART
   Download this project, and navigate to the root directory
   through the command-line (the directory that contains this
   README.txt file).

   Then type:
     $> make all
     $> ./bin/WordSegmenter ./data/w.data

   Now, start typing english words (without punctuation marks 
   or spaces) and press enter after each word. The hyphenated
   word will be displayed on the next line.

3. INTRODUCTION
   This program addresses the task of automatic syllabification
   of words. These words are represented using the BIO method of
   coding syllables, which relies on tagging each letter of a 
   word based on its position in the syllable. 

   We use a linear chain conditional random field model, which
   uses a large number indicator functions that rely on (1) the
   letter sequences in a word, (2) the position of these letters 
   in the word, and (3) the encoded tag sequence of syllables for
   the letter sequence. 

   This program evaluates two different methods to train the 
   parameters of this model(Contrastive Divergence and 
   Collin's Perceptron) using a subset of the CELEX European lexical 
   resource data set. 

   While evaluation, the best word level accuracy achieved using 
   this model, on a separate test set is 85.69%. When evaluated 
   on the training dataset, the word level accuracy  achieved 
   is 94.3%.

   Further details are provided in the paper: 
   http://dl.dropbox.com/u/3091691/Papers/WordSyllabificationUsingCRF.pdf

4. COMPILING
   The program can be compiled using the Unix make utility (or gmake)
   'make all' will create all the binary files in the 'bin' directory.
   Note that the program is sensitive to the name of the binary, and
   so they must always be named 'WordSegmenter', 'Train' and 'Test'

5. TRAINING
   The CRF model can be trained by providing a training data file
   and specifying which method to use for training.
   Command:
     $> ./bin/Train <training data file> [c|d] <output file>
   training data file: this is a file that contains words and their
                       correct output labels. A sample training
                       data file is provided in the 'data' 
                       directory, called 'train_set_60k.data'. This
                       file was obtained after pre-processing the
                       CELEX dataset, as explained in the paper
                       whose link you can find at the end of this file.
   [c|d]: This specifies the training method to use. 'c' indicates
                       Collin's Perceptron, while 'd' indicates 
                       Contrastive Divergence. Both of these are 
                       methods to approximate the stochastic gradient
                       descent update to the model parameters as 
                       explained in the paper.
   output file: This specifies the file where the parameter values
                       will be written. This file is needed to perform
                       testing, and to run 'WordSegmenter'
6. TESTING
   After training, the model can be tested by
     $> ./bin/Test <test set file> <parameter file>
   test set file: A sample test set file is available in the 'data'
                       directory: 'test_set_6k.data'
   parameter file: This should be the file obtained after training.
                       The file 'data/w.data' is the file we provide
                       after our training.
7. ADDITIONAL REMARKS
   The hyphenation task has a high accuracy, but in terms of recall. 
   The precision still needs improvement. So, when you run the WordSegmenter, 
   you will find that it places hyphens correctly at locations which
   need one, but it also place a few stray additional hyphens in some
   cases (false positives). In general, this tends to occur after
   the first letter.

   A further description of the tool can be found here:
   http://dl.dropbox.com/u/3091691/Papers/WordSyllabificationUsingCRF.pdf

   In order to hyphenate many words, you would need to pass these
   words to the standard input of the program, separated by a newline.
   Another way to pass words is to use the second argument
     $> ./bin/WordSegmenter ./data/w.data yourwordgoeshere