/turkish-tweets-sentiment-analysis

This sentiment analysis project determines whether the tweets posted in the Turkish language on Twitter are positive or negative.

Primary LanguageJupyter Notebook

turkish-tweets-sentiment-analysis

Ekran Resmi 2021-06-28 01 15 28

This sentiment analysis project determines whether the tweets posted in the Turkish language on Twitter are positive or negative. Since Turkish is not the most studied language, it has an insufficient amount of data. Therefore I created a new dataset with 15000 tweets by combining multiple datasets.

Text Preprocessing

  • Convert to lower case
  • Remove @ mentions and hyperlinks
  • Remove punctations, emojies, and numbers
  • Remove stop words and rare words
  • Tokenization
  • Sentence normalization
  • Lemmatization

Zemberek

Zemberek is a natural language processing (NLP) tool that works with the Turkish language. This project performs the tokenization, sentence normalization, and lemmatization parts of the text preprocessing using this library.

Data Visualization

Positive Negative Balance

It shows how many percent of tweets are positive or negative.

Ekran Resmi 2021-06-28 01 27 32

Visulazing N-Grams

N-gram is a language model which predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability of seeing the word w given a history of previous words h – where the history contains n-1 words.

Ekran Resmi 2021-06-28 01 27 32

Top Unigrams

When N=1, it becomes a unigram. Unigram considers words in the text one by one. According to the words frequency bar chart, "bir" is the most used unigram.

Ekran Resmi 2021-06-28 01 27 32

Top Bigrams

When N=2, it is called a bigram. In the bigram, words are handled in pairs. In this dataset, the most used bigram is "orospu çocuk".

Ekran Resmi 2021-06-28 01 27 32

Top Trigrams

Finally, if N=3, it is a trigram. In trigrams, the frequencies of the words are evaluated in groups of three words. Accordingly, the most occurred trigram in the dataset is "allah bela vermek".

Ekran Resmi 2021-06-28 01 27 32

WordCloud

WordCloud is a visualization technique for text data wherein each word is picturized with its importance in the context or its frequency. This is a very handy application when it comes to understanding the crux of the text. In the WordCloud example below, a distribution has occurred according to the most used words among all tweets in the dataset.

Ekran Resmi 2021-06-28 01 27 32

Deep Learning

Baseline Model

The Baseline Model has 2 densely connected layers of 64 hidden elements at the beginning. The input_shape for the first layer is equal to the number of words we allowed in the dictionary and for which we created one-hot-encoded features. In order to predict 2 different sentiments, the last layer has 2 hidden elements. The softmax activation function makes sure the three probabilities sum up to 1.

Handling Overfitting

  • Reducing network's size

I reduced the size of the network by removing one layer and reducing the number of hidden elements in the remaining layer to 32. According to the graph, we can observe that loss increases much slower compared to the baseline model. It takes more epochs before the reduced model starts overfitting.

Ekran Resmi 2021-08-10 03 29 54

  • Adding regularization

I added L2 regularization to the model for dealing with overfitting. According to the graph, it starts overfitting earlier than the baseline model. However, the loss increases much slower afterward.

Ekran Resmi 2021-08-10 03 30 43

  • Adding dropout layers

Lastly, I added dropout layers to the model. It starts overfitting a bit later and the loss also increases slower than the baseline model.

Ekran Resmi 2021-08-10 03 31 38

Embedding Layer

Keras provides an Embedding Layer that helps to train specific word embeddings based on the training data. It converts the words in the vocabulary to multi-dimensional vectors.

Ekran Resmi 2021-08-10 20 13 41 Ekran Resmi 2021-08-10 20 13 56

Pre-trained Word Embedding: GloVe

Since the training data is not so big, the model might not be able to learn good embeddings for the sentiment analysis. Luckily we can load pre-trained word embeddings built on much larger training data. The GloVe database contains multiple pre-trained word embeddings and more specific embeddings trained on tweets.

Ekran Resmi 2021-08-10 20 14 20 Ekran Resmi 2021-08-10 20 14 35

Model Performance

According to the accuracy scores obtained at the end of the project, the best model is the Regularized Model with an accuracy of 87,03%. Although the Embedding Layer and Pre-trained Word Embedding: GloVe were modeled separately, the performance result was not as effective as is expected.

Ekran Resmi 2021-08-10 03 31 38