Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision
Each record consists of three attributes:
-
is_sarcastic
: 1 if the record is sarcastic otherwise 0 -
headline
: the headline of the news article -
article_link
: link to the original news article. Useful in collecting supplementary data
-The folder NOTEBOOKS contains many notebooks.
One notebook file 1) Preprocessing.ipynb which contains the loading and preprocessing the data code. One notebook file 2) Feature_extraction.ipynb which contains the feature extraction code. One notebook file 3) BOW Modeling.ipynb which contains the modeling code for the bag of word vectorization algorithm. One notebook file 3) TF-IDF Modeling.ipynb which contains the modeling code for the TFIDF vectorization algorithm.
Common issues that we generally face during the data preparation phase:
- too many spelling mistakes in the text.
- too many numbers and punctuations.
- too many emojis and emoticons and username and links too.
- Some of the text parts are not in the English language. Data is having a mixture of more than one language.
- Some of the words are combined with the hyphen or data having contractions word or Repetitions of words.
Here i will clean the text by doing the following steps:
- Lowecasing the data.
- Removing Puncuatations.
- Removing Numbers.
- Removing extra space.
- Removing Contractions.
- Removing HTML tags.
- Removing & Finding URL and Email id.
- Removing Stop Words
- Removing Extra-spaces
- Stemming
- Spell Correction
We cannot work on texts directly when using machine learning algorithms.So, we need to convert the text to numbers.
-
I used CountVectorizer feature extraction algorithm.
-
I used TF-IDF feature extraction algorithm. Term Frequency (TF): Frequency of a term appearing in one document Inverse Document Frequency (IDF): TFrequency of a term appearing a lot across documents. TF-IDF are word frequency scores that try to highlight words that are more interesting.
The vectorized data is included in VECTORS file.
In this project, I used Support vector machine (SVM) and Multinomial Naive bayes Algorithms to classify the articles.
The best trained model is saved in MODELS file.
-
Accuracy
-
Confusion Matrix
-
Classification Report
This dataset was collected from TheOnion and HuffPost.
@article{misra2019sarcasm,
title={Sarcasm Detection using Hybrid Neural Network},
author={Misra, Rishabh and Arora, Prahal},
journal={arXiv preprint arXiv:1908.07414},
year={2019}
}
@book{book,
author = {Misra, Rishabh and Grover, Jigyasa},
year = {2021},
month = {01},
pages = {},
title = {Sculpting Data for ML: The first act of Machine Learning},
isbn = {978-0-578-83125-1}
}