Dataset - IMDB Dataset of 50K Movie Reviews (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/notebooks)
Examined a dataset of 50,000 Movie Reviews from IMDB to predict the presence of a correlation between the movie review and the sentiment of a film.
Adopted NLTK (Natural Language ToolKit) to perform data cleaning: BeautifulSoup for removing Html tags, preprocessing, ‘StopWords’, Tokenizer for Vectorization and WordClouds for visualization.
Engineered data for analysis by ‘Feature Engineering’, ‘Bag of words’ representation and ‘TF-IDF Vectorization’.
Performed ‘Sentiment Analysis’ (Supervised Learning) on each representation using various machine learning algorithms: Univariate and Multivariate classification, Random Forest, Logistic Regression, SVM (Support Vector Machine).
Achieved a final accuracy of 90.65% on the TF-IDF Vectorised (70:30 random train-test split) data through Random Forest.