Code for the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums (COLING 2018).
In this paper, we propose a ContextuAl SarCasm DEtector (CASCADE), which adopts a hybrid approach of both content and context-driven modeling for sarcasm detection in online social media discussions (Reddit).
- Clone this repo.
- Python (2.7 or 3.3-3.6)
- Install your preferred version of TensorFlow 1.4.0 (for CPU, GPU; from PyPI, compiled, etc).
- Install the rest of the requirements:
pip install -r requirements.txt
- Download the FastText pre-trained embeddings and extract it somewhere.
- Download the
comments.json
dataset file [1] and place it indata/
. - If you want to run the Preprocessing steps (optional), install YAJL 2, download the
train-balanced.csv
file, save it underdata/
and continue with the Preprocessing instructions. Otherwise, just download user_gcca_embeddings.npz, place it inusers/user_embeddings/
and go directly to Running CASCADE section.
-
User Embeddings: Stylometric features.
The file
data/comments.json
has Reddit users and their corresponding comments. Per user, there might be multiple number of comments. Hence, we concatenate all the comments corresponding to the same user with the<END>
tag:cd users python create_per_user_paragraph.py
The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
python train_stylometric.py
Generate
user_stylometric.csv
(user stylometric features) using the trained model:python generate_stylometric.py
-
User Embeddings: Personality features
Pre-train a CNN-based model to detect personality features from text. The code utilizes two datasets to train. The second dataset [2] can be obtained by requesting it to the original authors.
python process_data.py [path/to/FastText_embedding] python train_personality.py
Generate
user_personality.csv
(user personality features) using this model:python generate_user_personality.py
To use the pre-trained model from our experiments, download the model weights and unzip them inside the folder
user/
. -
User Embeddings: Multi-view fusion
Merge the
user_stylometric.csv
anduser_personality.csv
files into a single mergeduser_view_vectors.csv
file:python merge_user_views.py
Multi-view fusion of the user views (stylometric and personality) is performed using GCCA (~ CCA for two views). Generate fused user embeddings
user_gcca_embeddings.npz
using the following command:python user_wgcca.py --input user_embeddings/user_view_vectors.csv --output user_embeddings/user_gcca_embeddings.npz --k 100 --no_of_views 2
This implementation of GCCA has been adapted from the wgcca repo.
Finally:
cd ..
-
Discourse Embeddings
Similar to user stylometric features, create the discourse features for each discussion forum (sub-reddit):
cd discourse python create_per_discourse_paragraph.py
The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
python train_discourse.py
Generate
discourse.csv
(user stylometric features) using the trained model:python generate_discourse.py
Finally:
cd ..
Hybrid CNN combining user-embeddings and discourse-features with textual modeling.
cd src
python process_data.py [path/to/FastText_embedding]
python train_cascade.py
The CNN codebase has been adapted from the repo cnn-text-classification-tf from Denny Britz.
If you use this code in your work then please cite the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums with the following:
@InProceedings{C18-1156,
author = "Hazarika, Devamanyu
and Poria, Soujanya
and Gorantla, Sruthi
and Cambria, Erik
and Zimmermann, Roger
and Mihalcea, Rada",
title = "CASCADE: Contextual Sarcasm Detection in Online Discussion Forums",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1837--1848",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1156"
}
[1]. Khodak, Mikhail, Nikunj Saunshi, and Kiran Vodrahalli. "A large self-annotated corpus for sarcasm." Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018.
[2]. Celli, Fabio, et al. "Workshop on computational personality recognition (shared task)." Proceedings of the Workshop on Computational Personality Recognition. 2013.