/sbd-maxent

Sentence Boundary Detector using a Maximum Entropy model

Primary LanguagePython

Build and Usage guide
    Building and installing this program is somewhat complex.  It will only run on a Unix-based operating system.  Tools that need to be installed before running make are:

    * Python (development version) (version 2.4x, 2.5x, or 2.6x)
    * enchant (c library. `brew install enchant` on OSX)

All source code can be downloaded from the sbd-maxent github repository found at:
        git@github.com:hoylemd/sbd-maxent
Once all dependencies have been installed, cd into the sbd-maxent directory and run:
        $make install
This will install all programs and copy the c++-based tools into the classififer directory.
In the included makefile, settings can be changed to control what the three main commands to.  These commands are “train”, “test”, “demo, and “execute”. All three of them can be run by:
$make <command>

The “train” command will generate a model.  The model will be based upon the corpus file pointed at by the ${Corpus} variable in the makefile.  It will be stored at the location pointed to by the ${modelName} variable.  The training script will not use the entire corpus for training by default.  It begins by splitting the corpus up into a test, train, and execute sample.  The size of the train sample can be specified by the ${trainSize} variable.  The ${remainderSize} variable size should be set to the number of sentences left in the corpus after the removal of the training sample.  The ${executeSample} variable can be used to set the number of sentences set aside from the test sample for execution.  This is meant to be used for demonstration of the execute module.  The ${trainSize} variable can be set to any number equal to or less that the length of the corpus, but training times increase exponentially with very little gain in accuracy above 5000 sentences.

The “test” command will then use the model generated by the train command to attempt to disambiguate all of the sentences in the test corpus.  A report will then be generated at the location provided by ${results}.  This command relies upon the train command being run previously.

The “demo” command will use the model generated to attempt disambiguation on a text pointed to by ${demoSample}.  It will output the disambiguated file to ${output}.  This command relies upon the train command being run previously.

The “execute” command is used to apply a model to a text for disambiguation.  It will use the file found at ${Input}, and will output a disambiguated file to ${output}.

Lastly, the “tidy” command can be used to “clean up” the directory. It will remove all files specific to any particular run of the system, while leaving the model and all installed components intact.  This should be run after a train command if no test or demo runs are to be made, or after the test or demo runs if they are.  This will tidy up the directory for easy executions.