/NLU-Evaluation-Scripts

Benchmark of Cognigy NLU based on Braun et al. 2017

Primary LanguagePythonMIT LicenseMIT

Methodology

We reproduce analysis from Evaluating Natural Language Understanding Services for Conversational Question Answering Systems by Braun, Daniel and Hernandez-Mendez, Adrian and Matthes, Florian and Langen, Manfred (2017),

We use our own split for the chat corpus provided in TransportCorpusSplit.json as the author's split has not been published and this is the most transparent and reproducible approach. The Ubuntu and Web App corpora use the split from the paper. Note results are therefore not directly comparable to the original paper.

Forking their NLU-Evaluation-Scripts, results are obtained by running the converter scripts, importing and setting up respective bots and finally running the analysis scripts.

We have benchmarked Microsoft LUIS and Google's Dialogflow as of November 2018.

corpus num of intents train test
Chatbot 2 100 106
Ask Ubuntu 5 53 109
Web Applications 8 30 59

Results

The necessary Zip/JSON files to import flows and full annotation and result files are provided in respective folders.

F1-Scores for Cognigy AI, LUIS and DialogFlow as computed in Braun et al.:

Platform\Corpus Chatbot Ask Ubuntu Web Applications Overall
Cognigy NLU 2.0 0.97 0.91 0.92 0.93
DialogFlow 0.93 0.85 0.80 0.87
LUIS 0.98 0.90 0.81 0.91
Watson 0.97 0.92 0.83 0.92