Machine Learning and Deep Learning Approaches to Sarcasm Detection

This project addresses the problem of sarcasm detection - often quoted as a subtask of sentiment analysis. There are two main scripts used to begin using this code - train.py requires a fair amount of setup, however console.py can be run very quickly, so long as the correct dependencies are installed (listed below)

train.py : trains and evaluates new models on chosen dataset, saving these models to Code/pkg/trained_models/
- To run train.py, follow the Data configuration and Setup instructions before proceeding
console.py : makes predictions using existing trained models, where user input can be provided via a console. A visualisation of attention weights is produced in /colorise.html
- Our best-performing model, the Bidirectional Long Short-Term Memory model trained on ElMo vectors with Attention is provided to get started
- It is possible to interact with other models, however they will need to be trained first using train.py

To use this code, clone this repository then navigate to the root directory.

Move to Code/, then:
- On Windows, execute the command "python console.py" or "python train.py"
- On Linux, execute the command "python3 console.py" or "python3 train.py"

List of dependencies:

en-core-web-md == 2.2.5
Keras == 2.3.1
matplotlib == 3.1.3
numpy == 1.18.0
pandas == 0.25.3
pycorenlp == 0.3.0
scikit-learn == 0.22
spacy == 2.2.3
tensorflow == 2.1.0
tensorflow-hub == 0.7.0
tweepy == 3.8.0
twitter == 1.18.0

Data configuration and Setup

Datasets can be collected from the following sources:

Twitter data - Ptáček et al. (2014):
- Collected from: http://liks.fav.zcu.cz/sarcasm/
- Uses the EN balanced corpus containing 100,000 tweet IDs that must be scraped from the Twitter API - Twitter scraper can be found in Code/pkg/datasets/ptacek/processing_scrips/TwitterCrawler
- Once downloaded, move normal.txt and sarcastic.txt (files from download) into Code/pkg/datasets/ptacek/raw_data\
News headlines - Misra et al. (2019):
- Data is downloaded in JSON format
- https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection
- Once downloaded, move Sarcasm_Headlines_Dataset_v2.json (files from download) into Code/pkg/datasets/news_headlines/raw_data
Amazon reviews - Filatova et al. (2012):
- Data is downloaded in .rar format
- https://github.com/ef2020/SarcasmAmazonReviewsCorpus/
- Only Ironic.rar and Regular.rar is used in this project
- Convert Ironic.rar and Regular.rar (files from download) into regular folders, then move them to Code/pkg/datasets/amazon_reviews/raw_data

After downloading the data - proceed to reformat it into a csv and apply our data cleaning processes:

Run p1_create_raw_csv.py followed by p2_clean_original_data.py to achieve the correct configuration => NOTE: p1_create_raw_csv.py will take some time on the Twitter dataset, as it is slow to scrape the Tweets given their ids
e.g. Code/pkg/datasets/news_headlines/
                                                  ├── /processed_data
                                                                      ├── ...
                                                                      ├── /CleanData.csv
                                                                      ├── /OriginalData.csv
                                                  ├── /processing_scripts/...
                                                  ├── /raw_Data/...

Language models can be downloaded from the following sources:

ELMO:
- Download the ELMo tensorflow-hub module
- Move the elmo contents into a directory named elmo e.g. Code/pkg/language_models/elmo/
GloVe
- Download the GloVe database
- Select the glove.twitter.27B.50d.txt file and place it in a subdirectory called glove, e.g. Code/pkg/language_models/glove/

nouramoubayed/Sarcasm-Detection

Machine Learning and Deep Learning Approaches to Sarcasm Detection

List of dependencies:

Data configuration and Setup