This package provides a suite of tools for evaluating question-answering systems: the Watson Engagement Advisor (WEA), Solr, and the Natural Language Classifier (NLC)
You can install directly from github using the following command
pip install git+https://github.com/cognitive-catalyst/themis.git
You can also install from source by running
python setup.py install
Here is a step-by-step example of a typical Themis experiment.
See themis --help
for more details.
Download the corpus and the truth files from the Watson Experience Manager (XMGR) interface to WEA.
The corpus maps answer IDs to answer text. To download it, run the following command.
themis xmgr download-corpus XMGR-URL USERNAME PASSWORD
The XMGR-URL
points to an XMGR project instance, e.g. https://watson.ihost.com/instance/283/predeploy/$150dd167e4e
.
(Note that these may contain "$" characters that have to be escaped on the command line.)
This will create a corpus.csv
file.
The command may take multiple hours to run.
It saves intermediate state, so if it drops in the middle you can run it again and it will pick up where it left off.
Optionally you may specify a --retries
parameter which automatically restarts a specified number of times.
The truth maps answer IDs to questions they are known to answer. This is the information used to train the WEA instance and will be used to train the NLC model. To download the truth file, run the following command.
themis xmgr truth XMGR-URL USERNAME PASSWORD
This creates truth.json
and truth.csv
files.
The json file is a verbose archive of truth information, while the csv file is used in subsequent Themis commands.
Subsequent actions assume that the answer Ids referenced in the truth are all present in the corpus.
Sometimes this is not the case.
See themis xmgr validate-truth --help
for how to rectify this.
Download a usage report from XMGR.
The QuestionsData.csv
in the zip file contains records of the questions that were asked of Watson and the answers it
provided.
Use this to extract a set of questions that were asked to Watson along with the answers it gave.
themis question extract QuestionsData.csv > qa-pairs.csv
This file also records the number of times each question was asked.
Now we ask the questions in the test set to various Q&A systems and compare the answers they return.
First we "ask" questions to Watson by reading the answers out of the usage logs.
themis answer wea qa-pairs.csv answers.wea.csv qa-pairs.csv
The answers are written to answers.wea.csv
.
The qa-pairs.csv
file created in the previous step appears twice on the command line above because it is used as a
source of both questions and answers.
To ask questions to a Solr instance, import the corpus file using the example solr/schema.xml
.
themis answer solr qa-pairs.csv answers.solr.csv http://localhost:8983/solr/test
The answers are written to answers.solr.csv
.
To ask questions of the NLC we must first train a model using the truth file downloaded from XMGR as training data.
themis answer nlc train NLC-URL USERNAME PASSWORD truth.csv model-name
The NLC-URL
points to an NLC API endpoint,
e.g. https://gateway-s.watsonplatform.net/natural-language-classifier/api
.
This will return a model ID that can be used in subsequent commands.
Note that it may take some time to train the model before it is ready to be presented questions.
The Themis toolkit has utilities to manage NLC models.
See the help for more details.
After the model has been trained, you can submit questions to it.
themis answer nlc use NLC-URL USERNAME PASSWORD qa-pairs.csv answers.nlc.csv MODEL-ID corpus.csv
If the command to ask questions to either Solr or NLC fails you can rerun it and it will pick up where it left off.
A human annotator needs to judge whether the answers to the questions returned by the various systems are correct. The Annotation Assist tool provides a visual interface for doing this.
Annotation Assist takes corpus and question/answer pairs files as input. To generate the corpus file run
themis judge corpus corpus.csv > annotation-assist.corpus.json
To generate the question/answer pairs file run
themis judge pairs answers.wea.csv answers.solr.csv answers.nlc.csv > annotation-assist.pairs.csv
Annotation is a time consuming task, and it may be prohibitively difficult to annotate all system answers. You may optionally take a sample of questions with the following command
themis question sample qa-pairs.csv 1000 > sample.1000.csv
This will sample 1000 unique questions from the set of all questions in qa-pairs.csv
.
Questions are sampled from a distribution determined by the frequency with which they were asked in the usage logs.
The following command will generate annotation assist question/answer input for just these 1000 questions.
themis judge pairs --questions sample.1000.csv answers.wea.csv answers.solr.csv answers.nlc.csv > annotation-assist.pairs.csv
It is also possible to incorporate previous judgments. See the command help for details.
If the file has fields in TopAnswerText that are too long for the Annotation Assist system, you may truncate them with the themis util truncate-answers -h
command.
After annotating the answers you can download the results in a file: call it annotation-assist.judgments.csv
.
This can be converted into the format used by Themis with the following command.
themis judge interpret annotation-assist.judgments.csv > judgments.csv
The correctness judgments along with the question frequencies can then be used to plot precision and (receiver operating characteristic) ROC curves.
First collate the results form multiple systems into a single file using the following command.
themis analyze collate qa-pairs.csv answers.wea.csv answers.solr.csv answers.nlc.csv --labels WEA Solr NLC --judgments judgments.csv > collated.csv
The following command generates precision curve data for the WEA, Solr, and NLC systems from the collated data.
themis analyze plot precision collated.csv
This will create WEA.precision.csv
, Solr.precision.csv
, and NLC.precision.csv
files containing threshold values
and X and Y scatter plot values.
ROC curves can be generated with the roc
option in the place of precision
.
If you specify the --draw
option, the curves will be drawn.
See License.txt.