AlmaBetter Verfied Project - AlmaBetter School
We have build system to check the sentiment of the user based on the tweets, it is categorized into positive sentiments or negative sentiment.
- Coronavirus_tweet_sentiment_analysis.ipynb - Includes all functions required for classification operations.
- Google Colab - All the outputs are visible in the provided colab notebook.
Sentiment analysis refers to identifying as well as classifying the sentiments that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of the people about a variety of topics.
Therefore we need to develop an Automated Machine Learning Sentiment Analysis Model in order to compute the customer perception. Due to the presence of non-useful characters (collectively termed as the noise) along with useful data, it becomes difficult to implement models on them.
Original Dataset contains 6 columns and 41157 rows. Location column contains null values. So, we have dropped the null values. And we added a new column "clean_tweets" after cleaning the tweets. After dropping and adding a new column, now we have 7 columns and 32567 rows. In order to analyze the data we required only two columns "OriginalTweet" and "Sentiment". The columns such as "UserName" and "ScreenName" does not give any meaningful insights for our analysis. There are five types of sentiments - Extremely Positive, Positive, Extremely Negative, Negative and Neutral. We have renamed the Extremely Positive and Extremely Negative sentiments to Positive and Negative respectively. And we are left with three types of sentiments - Positive, Negative and Neutral. The pie chart shows the proportion of sentiments. Bar plot for unique values shows us the number of unique values in each column. The graphical representation of top 10 locations shows us that most of the tweets came from London followed by United States.
-
For multiclass classification, the best model for this dataset would be Logistic Regression
-
For binary classification, the best model for this dataset would be Stochastic Gradient Descent.
The order of execution of the colab notebook is as follows:
1) Coronavirus_tweet_sentiment_analysis.ipynb
First, click on the open in colab button present on the top center of the notebook.
2) Kaggle Dataset
Downlaod the dataset from kaggle through provided link.Then, connect to the runtime and execute the cell to mount the drive or upload the data file to the current runtime.
3) Cell Path
Finally, delete the path in the dataset loading cell and replace it with the path of your current data file. Run each cell to see the output below it.
Vivek Pawar | Data Scientist | Machine Learning Engineer
Contact me for Data Science Project Collaborations