NLU Evaluation Scripts

Methodology

We evaluate Recall and F-scores on a small and large dataset based on the home automation bot dataset from “Benchmarking Natural Language Understanding Services for Building Conversational Agents (2019)". The data is available on github.

The benchmark was conducted between 28th July and 4th August 2022.

The experiment settings were trained on a single-fold, with the test set kept as holdout dataset during the training phase.

Full predictions with its scores on the test dataset are sorted back to its Ground Truth and provided in the folder.

Disclaimer:

Google Cloud AutoML:
- The benchmark results is based on the confidence threshold that yields the best F1-Score.

640 Training Sentences - 10 Sentences per Intent

1076 Test Sentences

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Recall	0.867	0.782	0.789	0.725
F1 (Macro)	0.870	0.799	0.789	0.700

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	calendar_query	general_dontcare	general_dontcare	calendar_remove
Confidence	0.73	0.15	0.49	0.09

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	alarm_query	alarm_set	alarm_set	alarm_set
Confidence	0.7	0.96	1.0	0.27

1908 Training Sentences - ~30 Sentences per Intent

5518 Test Sentences

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Recall	0.901	0.836	0.860	0.876
F1 (Macro)	0.903	0.862	0.860	0.867

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	qa_factoid	qa_currency	qa_maths	general_quirky
Confidence	0.72	0.42	0.34	0.30

	Sprinklr	Google Cloud	Azure Language Studio	AWS Comprehend
Intent (Pred)	calendar_query	calendar_remove	general_quirky	calendar_set
Confidence	0.74	0.85	0.73	0.29