This sentiment analysis project determines whether the tweets posted in the Turkish language on Twitter are positive or negative. Since Turkish is not the most studied language, it has an insufficient amount of data. Therefore I created a new dataset with 15000 tweets by combining multiple datasets.
- Convert to lower case
- Remove @ mentions and hyperlinks
- Remove punctations, emojies, and numbers
- Remove stop words and rare words
- Tokenization
- Sentence normalization
- Lemmatization
Zemberek is a natural language processing (NLP) tool that works with the Turkish language. This project performs the tokenization, sentence normalization, and lemmatization parts of the text preprocessing using this library.
It shows how many percent of tweets are positive or negative.
N-gram is a language model which predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability of seeing the word w given a history of previous words h – where the history contains n-1 words.
When N=1, it becomes a unigram. Unigram considers words in the text one by one. According to the words frequency bar chart, "bir" is the most used unigram.
When N=2, it is called a bigram. In the bigram, words are handled in pairs. In this dataset, the most used bigram is "orospu çocuk".
Finally, if N=3, it is a trigram. In trigrams, the frequencies of the words are evaluated in groups of three words. Accordingly, the most occurred trigram in the dataset is "allah bela vermek".
WordCloud is a visualization technique for text data wherein each word is picturized with its importance in the context or its frequency. This is a very handy application when it comes to understanding the crux of the text. In the WordCloud example below, a distribution has occurred according to the most used words among all tweets in the dataset.
The Baseline Model has 2 densely connected layers of 64 hidden elements at the beginning. The input_shape for the first layer is equal to the number of words we allowed in the dictionary and for which we created one-hot-encoded features. In order to predict 2 different sentiments, the last layer has 2 hidden elements. The softmax activation function makes sure the three probabilities sum up to 1.
- Reducing network's size
I reduced the size of the network by removing one layer and reducing the number of hidden elements in the remaining layer to 32. According to the graph, we can observe that loss increases much slower compared to the baseline model. It takes more epochs before the reduced model starts overfitting.
- Adding regularization
I added L2 regularization to the model for dealing with overfitting. According to the graph, it starts overfitting earlier than the baseline model. However, the loss increases much slower afterward.
- Adding dropout layers
Lastly, I added dropout layers to the model. It starts overfitting a bit later and the loss also increases slower than the baseline model.
Keras provides an Embedding Layer that helps to train specific word embeddings based on the training data. It converts the words in the vocabulary to multi-dimensional vectors.
Since the training data is not so big, the model might not be able to learn good embeddings for the sentiment analysis. Luckily we can load pre-trained word embeddings built on much larger training data. The GloVe database contains multiple pre-trained word embeddings and more specific embeddings trained on tweets.
According to the accuracy scores obtained at the end of the project, the best model is the Regularized Model with an accuracy of 87,03%. Although the Embedding Layer and Pre-trained Word Embedding: GloVe were modeled separately, the performance result was not as effective as is expected.