
Application of Sentiment Analysis of Italian tweet with Python and Spark

Primary LanguagePython


This is the project for my thesis in Computer Science done at University of Palermo under the supervision of the professor Roberto Pirrone.

The goal was to build a data analysis pipeline with technologies related to Big Data:

  • Data collection
  • Data pre-processing
  • Data labeling
  • Machine Learning model tuning
  • Application of the Naive Bayes algorithm
  • Model evaluation
  • Insight extraction

The technologies used are:

  • Python 3.7
  • Tweepy, Twitter API
  • Pandas, Python Data Analysis Library
  • NLTK, Natural Language Toolkit Library
  • Apache Spark 2.4

The project consists of 4 python pages of code:

  • tweetSave.py to collect the tweet, is set to collect italian tweet with music keyword
  • tweetClean.py to clean and pre-process the data
  • tweetSentimentRadici.py to label the tweet with positive, negative or neutral sentiment
  • tweetSpark.py to apply the machine learning tools (RUNS ON SPARK)

Write me if you have doubts or to improve the solution.