I conducted a comparison of word weighting using Count Vectorizer and TF-IDF, in classifying sentiment on tweets of motorcycle racing events at Mandalika Circuit. The dataset was taken from Twitter of tweets from February 04, 2022, to March 27, 2022. Then the dataset is cleaned to get a clean dataset that will be used in the next stage, using text preprocessing techniques. After that, sentiment labeling will be carried out, using Vader Lexicon
. The last stage is data modeling. Data modeling is done using the Random Forest Classifier
method to get the final result in the form of a classification model and the results of the comparison between Count Vectorizer and TF-IDF
word weighting. The results of this analysis show that the use of Count Vectorizer is still better at analyzing sentiment on Mandalika Circuit tweets with an accuracy of 92.76% when compared to the use of TF-IDF which only gets an accuracy of 92.36%
.
There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.
NumPy
[https://numpy.org] - Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today.Pandas
[https://pandas.pydata.org] - Fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool.NLTK
[https://www.nltk.org] - Standard Python library with prebuilt functions and utilities for ease of use and implementation.Scikit-learn
[https://scikit-learn.org] - Simple and efficient tools for predictive data analysis.
The library requirements specific to some methods are:
Swifter
[https://pypi.org/project/swifter] - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.Seaborn
[https://seaborn.pydata.org] - It provides a high-level interface for drawing attractive statistical graphics.Matplotlib
[https://matplotlib.org] - Comprehensive library for creating static, animated, and interactive visualizations.
Run preprocessing.ipynb
to see the clean dataset results. The clean dataset can be seen in preprocessing results
. Text pre-processing is carried out to ensure good data quality before being used during data analysis, and is able to change the data to be more structured. The stages carried out in text pre-processing include Case Folding, Tokenizing, Normalization, Stemming, and Filtering
.
Case Folding
=> Case folding is a process where the program will read the text per row in theText
column, if there are characters that are uppercase or capital letters, they will be converted to lowercase or lowercase letters.Tokenizing
=> In the tokenizing process, the text is separated into pieces called tokens, which are then analyzed.Words
,numbers
,symbols
,punctuation marks
, and other important entities can be considered as tokens. In NLP, tokens are defined as "words" although tokenizing can also be done in paragraphs or sentences. The tokenizing process also removessymbols
,numbers
,spaces
, andlinks
.Normalization
=> The normalization process will change words that are not standard into standard words and also correct words that are typos, abbreviations, or "slang" by matching terms or words that come from the previous tokenizing process with the normalization dictionary. If the word is in the normalization dictionary, then the word will be changed to the correct word or the word that should be. In this process, I added a normalization dictionary in Excel (.xlsx), with the namekata_baku.xlsx
. The normalization dictionary is obtained from the source https://github.com/teguhary/Automated-labelling-Inset-Lexicon/blob/master/Data/kamus_kata_alay.xlsx.Stemming
=> Stemming is the process of returning the root word of a word that has successfully gone through the normalization process. At this stage, a word will be converted into a base word until the word that has a prefix and suffix will be removed according to the base word. At this stage, I also use theswifter
library to speed up the stemming process on the dataframe by running tasks in parallel. The processing speed can be twice or even faster if usingswifter
.Filtering
=> Filtering is the process of removing words that have no meaning such as conjunctions. At this stage, I used English stopwords obtained from the NLTK library for filtering the data frame. I added the stopword list in the form of .txt with the namestopword inggris.txt
.
Run vader sentiment.ipynb
to see the labeled dataset results. The labeled dataset can be seen in the sentiment results
. Data labeling uses the Vader Lexicon method
, utilizing the NLTK data server that has been connected to the Vader Lexicon
. Each score generated will be combined to produce a compound, which has been normalized between -1 to +1. The compound score in this Vader method is the combined result or can be referred to as the average. There are three sentiment labels:
positive
, negative
, and neutral
. A compound score above 0.05 is positive, below -0.05 is negative, and between -0.05 and 0.05 is neutral.
Run modeling.ipynb
to see the modeling results. At this stage, there are four stages including Data Division
, Word Weighting
, Data Classification
, and Model Visualization
.
Data Division
=> The labeled tweet data will be divided intotraining
data andtesting
data. The training data is used by the classification algorithm to form a classifier model, while the testing data is used to measure the extent to which the classifier succeeds in classifying correctly. The distribution of sentiment data is done randomly, with a ratio of80%
for training data and20%
for testing data.Word Weighting
=> Datasets that have been divided into training data and testing data, then the training data will be weighted using theCount Vectorizer and TF-IDF
methods.Data Classification
=> The dataset will be classified into three categories positive, negative, and neutral. In this research, the data classification stage uses theRandom Forest
algorithm,Random Forest
is a classifier consisting of a collection of structured tree classifiers where each tree throws voting units for the most popular class in input x. In other words,Random Forest
consists of a set of decision trees, where the collection of decision trees is used to classify data into a class.Model Visualization
=> In this evaluation, there are three categories or classes in the classification model so that the resultingConfusion Matrix
has an ordo of 3x3, where the matrix table consists of actual and predicted data, through theConfusion Matrix
, the average value ofaccuracy, precision, and recall
is obtained. In model visualization using the librarySeaborn
andMatplotlib
.