🎬 Sentiment Analysis with IMDB Movie Reviews πŸŽ₯

Project Overview πŸ“

Dive into the world of sentiment analysis with this exciting project! We analyze IMDB movie reviews to determine the sentiment behind them using cutting-edge machine learning techniques. From data preprocessing and text cleaning to feature extraction and model training, we explore it all with Naive Bayes and Support Vector Machine (SVM) classifiers.

  • Type: Natural Language Processing (NLP)
  • Language: Python

πŸ“š Table of Contents

πŸ› οΈ Libraries Used

Explore the powerful libraries that drive this project:

  • Pandas: For seamless data manipulation and analysis
  • NumPy: For efficient numerical operations
  • Matplotlib: To visualize data in style
  • Scikit-Learn: To implement and evaluate machine learning models
  • NLTK: For mastering natural language processing
  • Regular Expressions (re): To clean and refine text data

πŸ“Š Dataset

We’re working with the IMDB Movie Reviews Dataset – a treasure trove of movie reviews! The dataset file, IMDB Dataset.csv, includes:

  • review: The actual movie review text
  • sentiment: The sentiment label (positive or negative)

πŸ“ Steps

Here’s how we bring this project to life:

  1. Import Libraries: Get the essential tools ready for data processing, visualization, and machine learning.
  2. Load and Inspect Data: Peek into the dataset, check for any missing values, and understand the data distribution.
  3. Data Preprocessing: Transform text to lowercase, clean out HTML tags, tokenize reviews, and perform lemmatization.
  4. Data Preparation: Split the data into training and testing sets, encode labels, and convert text into TF-IDF features.
  5. Model Training and Evaluation: Train and test Naive Bayes and Support Vector Machine models, then evaluate their performance with accuracy scores, confusion matrices, and classification reports.

✨ Features

Our project shines with the following features:

  • Data Preprocessing: Clean and tokenize text, strip HTML tags, and normalize text.
  • Feature Extraction: Convert text into numerical features using TF-IDF vectorization.
  • Model Training: Build and train Naive Bayes and SVM classifiers.
  • Evaluation: Assess model performance with accuracy scores, confusion matrices, and detailed classification reports.

Usage πŸš€

  1. Preprocess the data: Clean and tokenize the text data.
  2. Train the model: Fit a machine learning model on the training data.
  3. Evaluate the model: Test the model on the test data and calculate metrics like accuracy, precision, recall, etc.
  4. Predict sentiment: Use the trained model to predict the sentiment of new reviews.

Modeling 🧠

The project explores several machine learning models, including:

  • Logistic Regression
  • Support Vector Machines (SVM)
  • Naive Bayes
  • Random Forest

We also experimented with hyperparameter tuning to improve model performance.

Evaluation πŸ“ˆ

The performance of each model is evaluated using metrics such as:

  • Accuracy
  • Precision
  • Recall
  • F1 Score

The confusion matrix is also used to visualize the performance of the models.

πŸ“ˆ Results

See how well our models perform! We evaluate them based on accuracy, confusion matrices, and classification reports to gauge their sentiment classification prowess.

Contributing 🀝

Contributions are welcome! If you have suggestions for improvements, feel free to fork the repository and create a pull request.

πŸ™ Acknowledgements

A big shoutout to:

  • Dataset: The amazing IMDB movie reviews dataset, courtesy of Kaggle.
  • Libraries: Our project’s backbone includes pandas, numpy, matplotlib, scikit-learn, and nltk.
  • Inspiration: Inspired by fantastic sentiment analysis tutorials and groundbreaking NLP research.

πŸ‘¨β€πŸ’» Author

  • Santhosh VS - Connect with me on LinkedIn

πŸ“§ Contact

Got questions or feedback? Drop me a line at santhosh02vs@gmail.com. I’d love to hear from you!