/Chat-bot-Evaluation-Project

Chat-bot evaluation project resources for the GCISD Science Fair in January 2018. In order to ensure that test subjects remain anonymous, the names of people from the original samples are not included.

Primary LanguagePythonMIT LicenseMIT

Chat-bot-Evaluation-Project

Chat-bot evaluation project resources for the GCISD Science Fair in January 2018. In order to ensure that test subjects remain anonymous, the names of people from the original samples are not included.

Key

chatbot_answers.txt: answers given by the chat-bot in response to the questions from the stratified random sample.
comparison.py: the platform used during the test to input the evaluations of the four judges.
deep.txt: conversation starters from https://conversationstartersworld.com/deep-conversation-topics/.
get_to_know.txt: conversation starters from https://conversationstartersworld.com/questions-to-get-to-know-someone/.
good_questions.txt: conversation starters from https://conversationstartersworld.com/good-questions-to-ask/.
ground_answers.txt: responses recorded from humans to compare against the responses of the chat-bot.
philosophical.txt: conversation starters from https://conversationstartersworld.com/philosophical-questions/.
question_sample.txt: the 40 questions sampled using a stratified random sample from the total population of 1089 conversation starters (combination of deep.txt, get_to_know.txt, good_questions.txt, philosophical.txt, and starters.txt).
reformatted_chatbot_answers.txt: reformatted version of chatbot_answers.txt (used in comparison.py).
reformatted_ground_answers.txt: reformatted version of ground_answers.txt (used in comparison.py).
roll.py: the script that samples questions from the total population of 1089 conversation starters and samples the judges using a simple random sample.
starters.txt: conversation starters from https://conversationstartersworld.com/250-conversation-starters/.

Requirements

Numpy 1.13.3+

Tensorflow 1.3.0 +

Natural Language Toolkit 3.2.4+

python -m pip install --upgrade numpy tensorflow nltk

Synopsis

40 conversation starters are sampled using a stratified random sample from the aggregate produced by appending deep.txt, get_to_know.txt, good_questions.txt, philosophical.txt, and starters.txt (Daniels).

python roll.py sample_questions

The artificial chat-bot responses to these questions are recorded by feeding them into the Deep Q&A chat-bot and writing its exact responses in chatbot_answers.txt (Pot). The "ground truth" human responses to these questions are recorded by sampling five people from the 6th period Superchemistry class (names were listed in omitted class.txt file) using a simple random sample and asking each participant to respond to eight of the 40 questions. Every person who participated gave verbal consent and was debriefed afterwards.

python roll.py sample_ground

When it is time to conduct the study, judges are sampled from the file class.txt.

python roll.py sample_judges

The four judges were led to an isolated room where they completed the study by interacting with the prompt controlled by comparison.py. They were given the objective of determining the answer to the question that "is the most appropriate." Verbal consent was given by all four judges (to both having their responses recorded and working with the group) from the three simple random samples conducted. Afterwards, all judges were debriefed because they have prior knowledge of the purpose of the study.

python comparison.py

One person who is not a judge interprets the responses from the judges. If the judges strongly agree that one answer is more appropriate than another answer, then 1+ or 2+ is recorded, whereby the number corresponds to the chosen answer. If the judges agree that one answer is more appropriate than another answer but there is some disagreement or equivocation, then a 1 or 2 is recorded, whereby the number corresponds to the chosen answer. If there is no agreement between the judges, then a 0 is recorded. These recordings are translated by the program into scores. This system resembles a five-point Likert scale is inspired by the implementation of human evaluation done by Li et al. in A Persona-Based Neural Conversation Model.

Scores Key

+2 is scored if the human answer is confidently identified.
+1 is scored if the human answer is equivocally identified.
0 is scored if the judges cannot agree on a single answer.
-1 is scored if the chat-bot answer is equivocally identified.
-2 is scored if the chat-bot answer is confidently identified.

More Implementation Notes

I input and interpretted the decisions for the judges. In order to minimize bias introduced, I could not clarify the meaning of responses (i.e., answer what each response means) and abstained from commenting. The only thing that I clarified was specific vocabulary (e.g., the meaning of the word dystopia). The chat-bot's utterances of cuss-words were censored.

Results

Using a chi-square test and significance value of α=0.005, the null hypothesis (corresponding to the chat-bot being indistinguishable) is rejected twice, but its rejection fails once. Overall, it appears that the chat-bot fails to emulate real replies completely. Qualitatively, a common fail-case is the chat-bot giving nonsensical replies, which corresponds to a failure to interpret the question. Persiyanov, Dmitry (2017) identifies this difficulty with language as common in generative NLP models.

Bibliography

Abadi, Martín et al. "TensorFlow: A system for large-scale machine learning." 12th USENIX Symposium on OSDI, 2016, https://arxiv.org/abs/1605.08695.
Bird, Steven, et al. Natural Language Processing with Python. O'Reilly, 2011.
Daniels, C. B. “Conversation Starters World.” Conversation Starters World, 2017, https://conversationstartersworld.com/.
Hugunin, Jim. "The Python Matrix Object: Extending Python for Numerical Computation." Proceedings of the Third Python Workshop, Reston, Va., Dec. 1995, http://legacy.python.org/workshops/1995-12/papers/hugunin.html.
Li, Jiwei et al. "A Persona-Based Neural Conversation Model." Association for Computational Linguistics, 2016, https://arxiv.org/abs/1603.06155.
Persiyanov, Dmitry. "Chatbots with Machine Learning: Building Neural Conversational Agents." Stats and Bots, Sept. 2017, https://blog.statsbot.co/chatbots-machine-learning-e83698b1a91e
Pot, Etienne. Deep Question and Answer. GitHub repository, 2017, https://github.com/Conchylicultor/DeepQA.
Vinyals, Oriol, and Quoc V. Le. "A Neural Conversational Model." International Conference on Machine Learning, vol. 37, 2015. https://arxiv.org/abs/1506.05869.