This repository was created for a study on understanding indirect answers in NLP. We release a new dataset, Friends-QIA
to enable progress on this task.
The repository contains the code and datasets used in the study focused on the task of classification of indirect answers to polar questions.
The two datasets (Circa and Friends) used for the study can be found in the Data
folder: Circa_data
and Friends_data
, respectively. Friends_data
is further divided into Original_QA_category_files
with the original files containing the data in the original format divided by category (used for Exploratory Data Analysis) and Final_QA_datasets
, with the dataset split into train, tune, dev and test sets, used for the models. The Circa dataset was originally created by researchers at Google, the version presented here is split into train, tune, dev and test sets and is structured differently.
The data_preparation
folder contains Python files with functions used for creating the datasets from the category files as well as the preprocessing functions used for preparing the input for the models.
Files: model_MISC.py
and model_SISC.py
contain the two CNN models used in the study: multi input single channel and single input single channel, respectively.
Predictions
folder contains predictions generated by three runs of CNN with BERT
and CNN with BERT + CrowdLayer
as well as the annotator-specific weights estimated by one of the models with CrowdLayer. All of them are used in performance_bert.ipynb
and performance_bert_crowds.ipynb
to obtain necessary performance scores.
All scripts were created with Python3 and the following packages:
- jupyter-core==4.7.1
- Keras==2.4.3
- Keras-Preprocessing==1.1.2
- matplotlib==3.4.2
- notebook==6.3.0
- numpy==1.19.5
- pandas==1.1.1
- scikit-learn==0.24.2
- seaborn==0.11.1
- tensorflow==2.4.1
- transformers==4.6.1
- npy-append-array==0.9.6
The results for the base CNN models without using BERT embeddings were obtained with the use of GloVe embeddings. They are available here.
The crowds study was inspired by the paper by Filipe Rodrigues and Francisco C. Pereira and the code is strongly based on their implementation which can be found in this GitHub repository.
The generated BERT embeddings are not included in the repository. They were created by following this article and used as input to the CNN models. Due to preprocessing issues, the instance number 4678
was removed from the dataset (train+tune set), before creating the embeddings.
The code used for generating the embeddings for this study can be found in generate_bert_embeddings.py
.
If you use Friends-QIA please cite:
@inproceedings{friendsqia,
title = "{``}{I}{'}ll be there for you{''}: The One with Understanding Indirect Answers",
author = "Damgaard, Cathrine and
Toborek, Paulina and
Eriksen, Trine and
Plank, Barbara",
booktitle = "Proceedings of the 2nd Workshop on Computational Approaches to Discourse (CODI) at EMNLP",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
}
For Circa please cite:
@inproceedings{louis-etal-2020-id,
title = "{``}{I}{'}d rather just go to bed{''}: Understanding Indirect Answers",
author = "Louis, Annie and
Roth, Dan and
Radlinski, Filip",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.601",
doi = "10.18653/v1/2020.emnlp-main.601",
pages = "7411--7425"
}