/CASCADE--Contextual-Sarcasm-Detection

This repo contains code to detect sarcasm from text in discussion forum using deep learning

Primary LanguagePython

CASCADE: Contextual Sarcasm Detection in Online Discussion Forums

Code for the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums (COLING 2018).

Description

In this paper, we propose a ContextuAl SarCasm DEtector (CASCADE), which adopts a hybrid approach of both content and context-driven modeling for sarcasm detection in online social media discussions (Reddit).

Requirements

  1. Clone this repo.
  2. Python (2.7 or 3.3-3.6)
  3. Install your preferred version of TensorFlow 1.4.0 (for CPU, GPU; from PyPI, compiled, etc).
  4. Install the rest of the requirements: pip install -r requirements.txt
  5. Download the FastText pre-trained embeddings and extract it somewhere.
  6. Download the comments.json dataset file [1] and place it in data/.
  7. If you want to run the Preprocessing steps (optional), install YAJL 2, download the train-balanced.csv file, save it under data/ and continue with the Preprocessing instructions. Otherwise, just download user_gcca_embeddings.npz, place it in users/user_embeddings/ and go directly to Running CASCADE section.

Preprocessing

User Embeddings

  1. User Embeddings: Stylometric features.

    The file data/comments.json has Reddit users and their corresponding comments. Per user, there might be multiple number of comments. Hence, we concatenate all the comments corresponding to the same user with the <END> tag:

    cd users
    python create_per_user_paragraph.py

    The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:

    python train_stylometric.py

    Generate user_stylometric.csv (user stylometric features) using the trained model:

    python generate_stylometric.py
  2. User Embeddings: Personality features

    Pre-train a CNN-based model to detect personality features from text. The code utilizes two datasets to train. The second dataset [2] can be obtained by requesting it to the original authors.

    python process_data.py [path/to/FastText_embedding]
    python train_personality.py

    Generate user_personality.csv (user personality features) using this model:

    python generate_user_personality.py

    To use the pre-trained model from our experiments, download the model weights and unzip them inside the folder user/.

  3. User Embeddings: Multi-view fusion

    Merge the user_stylometric.csv and user_personality.csv files into a single merged user_view_vectors.csv file:

    python merge_user_views.py

    Multi-view fusion of the user views (stylometric and personality) is performed using GCCA (~ CCA for two views). Generate fused user embeddings user_gcca_embeddings.npz using the following command:

    python user_wgcca.py --input user_embeddings/user_view_vectors.csv --output user_embeddings/user_gcca_embeddings.npz --k 100 --no_of_views 2

    This implementation of GCCA has been adapted from the wgcca repo.

    Finally:

    cd ..
  4. Discourse Embeddings

    Similar to user stylometric features, create the discourse features for each discussion forum (sub-reddit):

    cd discourse
    python create_per_discourse_paragraph.py

    The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:

    python train_discourse.py

    Generate discourse.csv (user stylometric features) using the trained model:

    python generate_discourse.py

    Finally:

    cd ..

Running CASCADE

Hybrid CNN

Hybrid CNN combining user-embeddings and discourse-features with textual modeling.

cd src
python process_data.py [path/to/FastText_embedding]
python train_cascade.py

The CNN codebase has been adapted from the repo cnn-text-classification-tf from Denny Britz.

Citation

If you use this code in your work then please cite the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums with the following:

@InProceedings{C18-1156,
  author = 	"Hazarika, Devamanyu
		and Poria, Soujanya
		and Gorantla, Sruthi
		and Cambria, Erik
		and Zimmermann, Roger
		and Mihalcea, Rada",
  title = 	"CASCADE: Contextual Sarcasm Detection in Online Discussion Forums",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"1837--1848",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1156"
}

References

[1]. Khodak, Mikhail, Nikunj Saunshi, and Kiran Vodrahalli. "A large self-annotated corpus for sarcasm." Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018.

[2]. Celli, Fabio, et al. "Workshop on computational personality recognition (shared task)." Proceedings of the Workshop on Computational Personality Recognition. 2013.