User-Augmented Transformer-based Sarcasm Detector on the SARC Dataset
Link to the subreddit dataset
This is a list of objects that maps 'text' to the raw post, and 'subreddit' to the originating subreddit. This contains all posts, ancestors and responses, that appear only in the unbalanced dataset.
Each text file contains a single line per example:
example_number sarcasm_label list_of_space_separated_tokens
The user_tok
directory contains data of the format:
example_number sarcasm_label user_id list_of_space_separated_tokens
Note: There are 4 RoBERTa-tokenized posts in main_tok that exceed the 512 sequence length constraint.
The data included is the processed data from the SARC dataset. main_tok
and pol_tok
contain tokenized versions of the data in main
and pol
from the SARC dataset by the NLTK word tokenizer and hugging-face RoBERTa tokenizer. The data included are single post responses.
Reddit posts appear in pairs: odd numbered posts are children of the preceeding even-numbered post.
Below is the structure of the data-holding variables found in the load_data
notebook, and the structure of the data in main_tok
and pol_tok
:
Pictured above: the distributions of posts per author (left) and the distributions of posts per subreddit (right)
Number of posts: 321,748 posts
Vocabulary size (using the NLTK tokenizer): 145,542 words
Average post length: 10 tokens across both sarcastic and non-sarcastic
Percentage of users with 10 or more posts: 7.35%
Percentage of subreddits with 10 or more posts: 43.17%
External Dependencies: nltk, transformers, np, matplotlib
Download main and pol folders here to look further into the data and to reproduce the included data.
(Optional) Download a GLoVE embedding from: https://nlp.stanford.edu/projects/glove/ and modify the variable at the top of the notebook depending on the embedding you choose
For our hybrid model we first pretrain a user classification model. The code for this part is accessible in models\ubert.py
.
In this project we defined several classifiers which each of them is also accessble in models\classifier.py
.
Our main training code of hybrid model is in models/main.py
.
To run this code for user embedding
use this command below:
main.py --modeltype uhybrid --epoch 10
And to run this code for subreddit embedding
use --modeltype subybrid
.
Emotion embedding model code is accessible in emotion_vectorization.ipynb
. It is a one time running code meaning that we got our representations and saved it in emo_vecs.npy
. After downloading our data you can use this code to create emotion vectors.
Subreddit classification can be found in SubReddit-Classifier.ipynb
. We checkpoint the model trained here and later use it in our code.