Sentiment Analysis on IMDB Dataset

Overview

This project is a sentiment analysis task performed on the IMDB dataset. The notebook applies several machine learning models like Logistic Regression, SVM, and Random Forest to predict whether a review is positive or negative. The analysis involves text preprocessing, vectorization techniques (CountVectorizer and TF-IDF), and evaluation using confusion matrices and classification reports.

Prerequisites
Installation
Dataset
Running the Notebook
Models Used
Evaluation
Results
License

Prerequisites

Make sure you have the following libraries installed:

Python 3.10+
pandas
scikit-learn
spacy
nltk
seaborn
matplotlib

Required Commands:

Install Spacy model:

!pip install spacy
!python -m spacy download en_core_web_sm

Install NLTK:
```
!pip install nltk
```

Installation

Clone the repository:

git clone https://github.com/your-repo/NLPL_Assignment1.git

Open the notebook in Google Colab or Jupyter Notebook:
```
NLPL_Assignment_1_095.ipynb
```
Install the required packages (already in Google Colab):
```
!pip install -r requirements.txt
```

Dataset

The dataset used in this project is the IMDB movie reviews dataset. It contains 50,000 labeled reviews, with an even distribution of positive and negative sentiments.

Columns:
- review: The actual movie review text.
- sentiment: The label (positive or negative).

The dataset can be downloaded from IMDB Dataset.

Running the Notebook

Load the dataset:

df = pd.read_csv('/content/drive/MyDrive/IMDB Dataset.csv')

Preprocess the text data using Spacy:

def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'[^A-Za-z\s]', '', text)
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(tokens)
df['cleaned_review'] = df['review'].apply(clean_text)

Split the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2, random_state=42)

Apply vectorization (CountVectorizer and TF-IDF):

count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

Train and evaluate models using the evaluate_model function:

def evaluate_model(model, X_train, X_test, y_train, y_test, vectorizer_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

Models Used

The following models are implemented:

Logistic Regression: A simple linear model for classification.
SVM (Support Vector Machine): A robust classifier for binary sentiment classification.
Random Forest Classifier: An ensemble model based on decision trees.

Evaluation

The models were evaluated based on the following metrics:

Confusion Matrix: Visualized using Seaborn to compare predicted vs. actual results.
Classification Report: Precision, recall, f1-score, and accuracy for both positive and negative classes.
Model Comparison: Bar plots comparing precision, recall, f1-score, and accuracy across different models.

Results

Logistic Regression (CountVectorizer):
- Accuracy: 88%
- Precision: 87%
- Recall: 89%
SVC (TF-IDF):
- Accuracy: 90%
- Precision: 88%
- Recall: 91%
Random Forest Classifier (TF-IDF):
- Accuracy: 85%
- Precision: 85%
- Recall: 85%

Conclusion

Among the models tested, the SVM model using TF-IDF Vectorizer yielded the best results with an accuracy of 90%, followed closely by Logistic Regression. Random Forest underperformed in comparison.

License

This project is licensed under the MIT License - see the LICENSE file for details.

SaiBhujbal/NLP-Sentiments