/short-answer

short answer extraction component

Primary LanguageJava

#short-answer

Short Answer component

based on: http://www.adampease.org/professional/GlobalWordNet2016.pdf

Build

Run "mvn package" and use the assembled jar (second fat jar) in your class path.

Demo

Run the Demo with the following (adapting the paths to your local structure):

java -Xmx7G -cp target/cobra-0.98.4-jar-with-dependencies.jar nlp.scripts.Demo / /home/user/data/short-answer/short-answer-data/index / /home/user/data/short-answer/short-answer-data/models / question-classifier.pa770.ser / questions.txt

This simple demo go overs the questions supplied in the questions.txt (optional, otherwise it uses a static small set of questions), which contains a simple question per line, i.e. "What is the estimated population of Egypt?"

See below how to obtains the needed data files, or how to train the models and create the needed supporting text files by your self.

Index

The candidate sentences which potentially contain answers to potential questions, aka the Knowledge Base, resides in a lucene index. This index is created by the IndexSentences script that can be run like this:

java -Xmx7G -cp target/cobra-0.98.4-jar-with-dependencies.jar nlp.scripts.IndexScripts [corpus-path] [index-path]

where corpus-path is a path to a directory which contains text files with sentences to be indexed into the knowledge base, and the index-path is path to a non existing directory which will contain the lucene index and be used as the knowledge base.

A sample index can be found in models dir in the repository.

Models

There are a few static models used and one trained model. The static models, which are used for featurization mainly, are: brown-clusters, word vectors, and gazetteers. They all can be found in the models dir in the repository.

The trained model is the questions classifier, which uses all those features.

NOTE: glove.6B.50d.txt.gz is not included beacause of the size and should be downloaded from: http://nlp.stanford.edu/projects/glove/

Question Classifier

A pre-trained classifier is located inside the models dir in the repo, but its name has to be supplied to the demo script separately. (just the name of the file to be used inside the models dir). During traning multiple versions of the classifier exist inside the models dir, and so the demo asks for the particular verstion to be used.

In order to train the classifier you need to use the QuestionClassifierTrainingScript:

java -Xmx7G -cp target/cobra-0.98.4-jar-with-dependencies.jar nlp.scripts.TrainQuestionClassifier [models-output-path] [questions-train-test-set]

The models output path is the path where all the versions of the classifier will be written to. The dataset in the questions-train-test-set dir should reside within to folders, train and test, inside each files should be formatted in the following way:

DESC:manner How did serfdom develop in and then leave Russia ? ENTY:cremat What films featured the character Popeye Doyle ? DESC:manner How can I find a list of celebrities ' real names ? ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ? ABBR:exp What is the full form of .com ?

in according to: http://cogcomp.cs.illinois.edu/Data/QA/QC/

Testing the classifier

We were unable to reproduce the exact results cited in the paper for classification of the questions: http://www.adampease.org/professional/GlobalWordNet2016.pdf, but non the less we came close. We report 83% on fine, and 89% on gross.

The classifier can be tested using the TestQuestionsClassifier script:

java -Xmx7G -cp target/cobra-0.98.4-jar-with-dependencies.jar nlp.scripts.TestQuestionClassifier [models-path] [classifier-name] [questions-data-path] [type=gross/fine]

This outputs the accuracy on the test set inside questions-data-path dir. The models-path and the classifier-name are as in previous sections. Type is just a string "gross" or "fine" to indicate how to test the classifier.

Alternative Ant-based Build

cd ~ echo "export SIGMA_SRC=/workspace/sigmakee" >> .bashrc echo "export CORPORA=/corpora" >> .bashrc source .bashrc cd ~/workspace/ git clone https://github.com/ontologyportal/sigmanlp wget 'http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip' unzip stanford-corenlp-full-2015-12-09.zip rm stanford-corenlp-full-2015-12-09.zip cd ~/Programs/stanford-corenlp-full-2015-12-09/ unzip stanford-corenlp-3.6.0-models.jar cp ~/Programs/stanford-corenlp-full-2015-12-09/stanford-corenlp-3.6.0.jar ~/workspace/short-answer/lib cp ~/Programs/stanford-corenlp-full-2015-12-09/stanford-corenlp-3.6.0-models.jar ~/workspace/short-answer/lib cd ~/workspace/short-answer/models wget 'http://nlp.stanford.edu/data/glove.6B.zip' unzip glove.6B.zip java -Xmx9G -cp /workspace/short-answer/build/classes:/workspace/short-answer/build/lib/* nlp.scripts.Demo -t