/Twemoji

Predicting emojis from tweets using ML

Primary LanguageJupyter Notebook

CS3244 Twemoji - Emoji Prediction

About

This project aims to predict corresponding emojis associated with tweets through users’ text and relevant hashtag annotations given by users. We conduct sentiment analysis using two different emebeddings and testing it on various linear and neural network models.

Structure

CS3244-Twemoji
├── Datasets
│   ├── Extracting
│   │   ├── Dataset_Exploration_Notebook
│   │   └── Pre-processing_Notebook
│   ├── full_train_preprocessed_subset.csv
│   ├── full_val_preprocessed_subset.csv
│   └── full_test_preprocessed_subset.csv
│
├── Models
│   ├── BERTweet
|   │     └── BERTweet_Notebook
│   ├── Random forest
|   │     ├── Trees_Glove_Notebook
|   │     └── Trees_TF-IDF_Notebook
│   ├── SVM
|   │     ├── SVM_Glove_Notebook
|   │     └── SVM_TF_IDF_Notebook
│   ├── DistilBert
|   │     └── DistilBERT_Notebook
│   ├── BiLSTM
|   │     ├── BiLSTM_Glove_Notebook
|   │     └── BiLSTM_TF_IDF_Notebook
│   ├── CNN
|   │     └── CNN_Notebook
│   ├── Simple NN
|   │     └── NN_Notebook
|   ├── Logistic Regression
|        └── Logistic_Regression_Notebook
│
├── Academic_Declaration_Group_22.docx
├── README.md
└── Final_Slides.pptx

Dataset

The dataset used has been compiled by the Amsterdam University of Applied Sciences and published in 2018, and is a collection of 13 million tweets (instances) consisting of features like tweets IDs, annotations from the emoji csv and links of attached images in the tweets.

It can be found here - Twemoji Dataset

We also supplemented our dataset with additional tweets using Twitter scraping APIs.

Emojis Used

We started off by first choosing the top 20 most frequently used emojis in the training data. However, model training took up excessive time due to the large amount of data and similarity between some emojis, so we scaled down to 5 emojis that have distinct meaning.

  • 0 - ❤️(186)
  • 1 - 😂 (1381)
  • 2 - 😅 (1384)
  • 3 - 😍 (1392)
  • 4 - 🙄 (1447)

Word Embeddings

For this kind of text classification task, word embeddings are essential to represent words in an encoded way that machine learning models can understand. The two embeddings we ultimately chose for our models are:

1. Pre-trained GLoVe Embeddings

GloVe stands for Global Vectors for Word Representation. It is an unsupervised learning algorithm that calculates the co-occurrences of a word with another word within a corpus. Hence, it is able to obtain semantic relationships between words.

We used the pre-trained GloVe embeddings by Stanford and specifically the pre-trained model of Twitter corpora. It consists of 2 billion tweets, and 27 billion tokens of dimension 50.

It can be found here - GloVe Embeddings

2. TF-IDF Vectorizer

TF-IDF stands for Term Frequency - Inverse Document Frequency . It uses a statistical measure to determine the significance of words in a corpus. It considers how frequent a word appears in a document and giving different weight to those that appear often across documents.

We used sklearn's TfidfVectorizer to convert our collection of preprocessed tweets into a matrix of TF-IDF features.

It can be found here - Sklearn's TFIDF Embedding

Models Considered

We run our twemoji prediction task on both linear and neural network models. They are as specified below.

  • SVM
  • Random forest
  • Logistic Regression
  • Simple NN
  • CNN
  • BiLSTM
  • BerTweet
  • DistilBert

References