This project is a sentiment analysis task performed on the IMDB dataset. The notebook applies several machine learning models like Logistic Regression, SVM, and Random Forest to predict whether a review is positive or negative. The analysis involves text preprocessing, vectorization techniques (CountVectorizer and TF-IDF), and evaluation using confusion matrices and classification reports.
Make sure you have the following libraries installed:
- Python 3.10+
- pandas
- scikit-learn
- spacy
- nltk
- seaborn
- matplotlib
- Install Spacy model:
!pip install spacy !python -m spacy download en_core_web_sm
- Install NLTK:
!pip install nltk
-
Clone the repository:
git clone https://github.com/your-repo/NLPL_Assignment1.git
-
Open the notebook in Google Colab or Jupyter Notebook:
NLPL_Assignment_1_095.ipynb
-
Install the required packages (already in Google Colab):
!pip install -r requirements.txt
The dataset used in this project is the IMDB movie reviews dataset. It contains 50,000 labeled reviews, with an even distribution of positive and negative sentiments.
- Columns:
- review: The actual movie review text.
- sentiment: The label (positive or negative).
The dataset can be downloaded from IMDB Dataset.
-
Load the dataset:
df = pd.read_csv('/content/drive/MyDrive/IMDB Dataset.csv')
-
Preprocess the text data using Spacy:
def clean_text(text): text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) text = re.sub(r'[^A-Za-z\s]', '', text) doc = nlp(text.lower()) tokens = [token.lemma_ for token in doc if not token.is_stop] return ' '.join(tokens) df['cleaned_review'] = df['review'].apply(clean_text)
-
Split the dataset into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2, random_state=42)
-
Apply vectorization (CountVectorizer and TF-IDF):
count_vectorizer = CountVectorizer() tfidf_vectorizer = TfidfVectorizer()
-
Train and evaluate models using the
evaluate_model
function:def evaluate_model(model, X_train, X_test, y_train, y_test, vectorizer_name): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))
The following models are implemented:
- Logistic Regression: A simple linear model for classification.
- SVM (Support Vector Machine): A robust classifier for binary sentiment classification.
- Random Forest Classifier: An ensemble model based on decision trees.
The models were evaluated based on the following metrics:
- Confusion Matrix: Visualized using Seaborn to compare predicted vs. actual results.
- Classification Report: Precision, recall, f1-score, and accuracy for both positive and negative classes.
- Model Comparison: Bar plots comparing precision, recall, f1-score, and accuracy across different models.
-
Logistic Regression (CountVectorizer):
- Accuracy: 88%
- Precision: 87%
- Recall: 89%
-
SVC (TF-IDF):
- Accuracy: 90%
- Precision: 88%
- Recall: 91%
-
Random Forest Classifier (TF-IDF):
- Accuracy: 85%
- Precision: 85%
- Recall: 85%
Among the models tested, the SVM model using TF-IDF Vectorizer yielded the best results with an accuracy of 90%, followed closely by Logistic Regression. Random Forest underperformed in comparison.
This project is licensed under the MIT License - see the LICENSE file for details.